JANUARY 30 2025
How to Use Dgraph for OLAP data analysis
A practical guide to performing OLAP data analysis with Dgraph
![Staff Architect](/_next/image?url=%2Fimages%2Fblog-posts%2Fauthors%2Frapha%C3%ABl-derbier.jpg&w=64&q=75)
When it comes to data analytics, you don't always have the luxury of using the best-of-the-best tools for every task.
Whatever tool gets the job done today is always better than waiting around for a tool that could do it better but that you don't have on hand, maybe due to time, resources, or availability. A perfect example is the tools needed for OLTP vs. OLAP.
In this blog post, we'll explore how Dgraph — a graph database optimized for Online Transaction Processing (OLTP) and deeply nested queries — can also be used effectively for Online Analytical Processing (OLAP) use cases. We'll highlight Dgraph's analytical capabilities through examples and practical techniques for designing analytical solutions without needing an additional OLAP solution.
Let's dive in.
What Is OLTP vs. OLAP?
OLTP (Online Transaction Processing) focuses on processing day-to-day transactions, while OLAP (Online Analytical Processing) is geared toward analyzing data from multiple sources to support business decision-making.
Dgraph — though primarily designed for OLTP — has robust features that make it capable of addressing OLAP needs by leveraging its graph structure and DQL (Dgraph Query Language).
Relationships Form the Dimensionality
In Dgraph, relationships between nodes naturally form the dimensions required for OLAP-style analysis.
DQL's aggregation and math functions, combined with thoughtful graph design, allow you to create a comprehensive analytical solution directly within Dgraph.
The examples below use a dataset about donations to public schools in the U.S. built from public data provided by DonorsChoose.org in a Kaggle project dataset. You can also find the data ready to load into Dgraph in this Dgraph benchmarks Github repository.
Basic Count
# number of projects per school
{
stats(func:type(School)) {
School.name
count(~Project.school)
}
}
Dgraph returns the school names and the count result:
{
"data": {
"stats": [
{
"School.name": "Abbott Middle School",
"count(~Project.school)": 16
},
{
"School.name": "Lincoln Elementary School",
"count(~Project.school)": 7
},
{
"School.name": "Rosemont Early Education Center",
"count(~Project.school)": 5
},
]
}
}
DQL's block structure and alias feature allow you to align query responses with a format easily consumable by a visualization tool.
For instance, to use the query result in a Python script with Plotly, you can modify the query like this:
# number of projects per school
{
school(func:type(School)) {
category:School.name
value:count(~Project.school)
}
}
Using this result, you can create a bar chart in Python:
import pydgraph
import plotly.express as px
import plotly.graph_objects as go
import json
import pandas as pd
def bar_chart(payload, title='bar chart'):
keys = payload.keys()
serie = list(keys)[0]
df = pd.json_normalize(payload[serie])
cols = df.columns.to_list()
cols.remove('category')
fig = px.bar(df,
y='category',
x=cols,
title=title,
orientation='h',
labels={"category":serie,"value": cols[0]}, # Customize x-axis name
text_auto=True # Automatically display bar values
)
return fig
query = '''
# number of projects per school
{
school(func:type(School)) {
category:School.name
value:count(~Project.school)
}
}
'''
res = client.txn(read_only=True).query(query)
res = json.loads(res.json)
fig = bar_chart(res, "Number of projects per school")
fig.show()
Aggregations & Variables
Dgraph variables add flexibility by enabling filtering, ordering, and querying additional data.
Value variables in DQL are map structures: a variable maps UIDs of the enclosing block to the corresponding values. Value variables are used by extracting the UIDs with uid(var-name)
or by getting the variable's value in the query context using val(var-name)
.
The query context is a node in the graph. If the current node UID is not in the variable map, the value is undefined and not displayed in the result.
Here's an example that counts projects per school and orders them by project count:
# number of projects per school ordered by project count
{
var(func:type(School)) {
c as count(~Project.school)
}
serie(func:uid(c), orderdesc:val(c)) {
category:School.name
project_count: val(c)
}
}
The var
block is used to define a variable but does not return data in the response. The named block uses the variable data and additional node information.
Grouping & Filtering by Dimensions
Dgraph's @groupby
directive allows for powerful OLAP-style groupings.
Here's an example of counting nodes by type:
# groupby a scalar predicate
# count of nodes per dgraph.type (attribute)
{
stats(func:has(dgraph.type)) @groupby(dgraph.type) {
count:count(uid)
}
}
{
"data": {
"stats": [
{
"@groupby": [
{
"dgraph.type": "City",
"count": 10
},
{
"dgraph.type": "School",
"count": 10
},
{
"dgraph.type": "State",
"count": 13
},
{
"dgraph.type": "Category",
"count": 15
},
{
"dgraph.type": "Project",
"count": 73
},
{
"dgraph.type": "Donor",
"count": 190
},
{
"dgraph.type": "Donation",
"count": 224
}
]
}
]
}
]
In the previous example, we group the data using dgraph.type
, a scalar predicate sorting data types by string.
The groupby
directive is more powerful when we group by a relationship to other nodes: in that case the aggregated values can be associated with the related node in a query variable and reused.
For example, in our data model, a Project
has a relationship to an entity Category
. We can group projects per categories using the following:
# count of projects per category (relationship)
{
stats(func:has(<Project.category>)) @groupby(Project.category) {
count:count(uid)
}
}
{
"data": {
"stats": [
{
"@groupby": [
{
"Project.category": "0x4b61b90",
"count": 1
},
...
{
"Project.category": "0x4b61b9a",
"count": 10
},
{
"Project.category": "0x4b61b94",
"count": 13
},
{
"Project.category": "0x4b61b92",
"count": 22
}
]
}
]
}
}
The groupby
is now expressed in node UIDs: each element in the response corresponds to a node. So we can set a Dgraph variable and use val()
and orderby
and filters
to better structure the query like this:
# count of projects per category for categories with at least 10 projects
# ordered by number of projects
{
var(func:has(<Project.category>)) @groupby(Project.category) {
ProjectPerCategory as count:count(uid)
}
stat(func:uid(ProjectPerCategory),orderdesc:val(ProjectPerCategory))
@filter(ge(val(ProjectPerCategory),10)) {
Category.name
project_count: val(ProjectPerCategory)
}
}
Aggregation with Multiple Hops
Aggregations are not limited just to directly connected entities (i.e., only one hop).
For example, let's find the first three cities per average donation using the number of donations, the sum of donations, and the average donation.
In our graph data model we have the following relationships:
City
→ School
→ Project
→ Donation
Aggregating Donation
data per City
requires us to follow three hops. Let's try a few different approaches.
Approach 1: Nested Aggregation
The first approach is to use the fact that DQL queries are nested by nature: query the graph-nested structure until you can access the value to aggregate and compute aggregated values of variables at different nested levels.
{
var(func:type(City)) {
~School.city {
School.projects {
Project.donations {
a as Donation.amount
}
s1 as sum(val(a))
c1 as count(Project.donations)
}
s2 as sum(val(s1))
c2 as sum(val(c1))
}
s3 as sum(val(s2))
c3 as sum(val(c2))
avg as math(s3 / c3)
# sperp as sum(Donation.amount)
}
stats(func:uid(c3),orderdesc:val(avg),first:3) {
City.name
sum_donation:val(s3)
count_donation:val(c3)
avg_donation:val(avg)
}
}
c1 as count(Project.donations)
is computed at the School.Projects
level. a as Donation.amount
is computed in a lower level of the tree in Project.donations
.
Each variable is re-aggregated at the parent level to create a new variable of the right node type:
s1
andc1
are variables ofProject
type (i.e., UIDs1
is a list of UIDProject
nodes).s2
andc2
are variables ofSchool
types3
andc3
are variables ofCity
type
{
"data": {
"stats": [
{
"City.name": "Stockton",
"sum_donation": 3857.38,
"count_donation": 16,
"avg_donation": 241.08625
},
{
"City.name": "San Mateo",
"sum_donation": 4779.32,
"count_donation": 41,
"avg_donation": 116.56878
},
{
"City.name": "San Jose",
"sum_donation": 267,
"count_donation": 3,
"avg_donation": 89
}
]
}
}
Approach 2: Variable Propagation
The second approach to multi-hop aggregation uses variable propagation.
In this case, we use the fact that Dgraph accumulates variables when traversing the graph (see the variable propagation documentation).
{
var(func:type(Donation)) {
s1 as Donation.amount
c1 as Math(1)
Donation.project {
Project.school {
School.city {
s as math(s1)
c as math(c1)
avg as math(s1/c1)
}
}
}
}
stats(func:type(City),orderdesc:val(avg),first:3) {
City.name
sum_donation:val(s)
count_donation:val(c)
avg_donation:val(avg)
}
}
s1 as Donation.amount
is the amount for each donation. When reaching the nested block Donation.project {}
, Dgraph automatically sums this variable which is now the sum of donation.amount
for each project. The variable is again propagated to the nested-level Project.School {}
.
Finally, inside the School.city {}
block, the variable is now the sum of Donation.amount
for all the paths reaching a city. As expected, the result is the same.
Let's take a look at the query metrics:
Aggregation | Propagation |
---|---|
{ "extensions": { "server_latency": { "encoding_ns": 111750, "parsing_ns": 287250, "processing_ns": 4670001, "assign_timestamp_ns": 815625, "total_ns": 6111500 }, "txn": { "start_ts": 348486 }, "metrics": { "num_uids": { "": 0, "City.name": 3, "Donation.amount": 224, "Project.donations": 146, "Project.school": null, "School.city": null, "School.projects": 10, "_total": 393, "dgraph.type": 0, "~School.city": 10 } } } } | { "extensions": { "server_latency": { "encoding_ns": 194750, "parsing_ns": 487124, "processing_ns": 10162374, "assign_timestamp_ns": 2094542, "total_ns": 13222375 }, "txn": { "start_ts": 348479 }, "metrics": { "num_uids": { "": 0, "City.name": 3, "Donation.amount": 224, "Project.donations": null, "Project.school": 55, "School.city": 9, "School.projects": null, "_total": 515, "dgraph.type": 0, "~School.city": null } } } } |
In the variable propagation approach, we traverse from Donation
to Project
, and we have 224 donations. So we follow this relation 224 times. In the aggregation approach, we navigate from Project
to Donation
, and we have 146 projects. The number of times the algorithm visits Project
nodes is higher for the variable propagation approach.
Note that adding a count index on Project.donations
predicate significantly improves the query.
Of course, the right approach depends on your dataset and use case.
Aggregation along Hierarchical Dimensions
For this next example, let's find the number and sum of donations per states and cities. The x-axis of our analysis will be a hierarchy of states and then the cities they contain.
As this hierarchy is designed as a nodes and relationships in our graph, we can obtain the data for this analysis in one query:
# Number of projects per state and city
{
var(func: type(State)) {
city: ~City.state {
~School.city {
School.projects {
Project.donations {
a as Donation.amount
}
c as count(Project.donations)
s as sum(val(a))
}
s1 as sum(val(s))
c1 as sum(val(c))
}
s2 as sum(val(s1))
c2 as sum(val(c1))
}
s3 as sum(val(s2))
c3 as sum(val(c2))
}
stats(func: type(State)) {
state: State.name
amount: val(s3)
count: val(c3)
city: ~City.state {
City.name
amount: val(s2)
count: val(c2)
}
}
}
The result will look like this:
{
"data": {
"stats": [
{
"state": "California",
"amount": 19556.48,
"count": 224,
"city": [
{
"City.name": "American Cyn",
"amount": 3003.92,
"count": 39
},
{
"City.name": "Newark",
"amount": 1978.12,
"count": 23
},
{
"City.name": "Brentwood"
},
{
"City.name": "San Mateo",
"amount": 4779.32,
"count": 41
}
]
},
{
"state": "New York"
}
]
}
}
Depending on the data analytics use case, we can filter out the cities without donations, and the state without donations, and then order the results:
# Number of projects per state and city
{
var(func: type(State)) {
city: ~City.state {
~School.city {
School.projects {
Project.donations {
a as Donation.amount
}
c as count(Project.donations)
s as sum(val(a))
}
s1 as sum(val(s))
c1 as sum(val(c))
}
s2 as sum(val(s1))
c2 as sum(val(c1))
}
s3 as sum(val(s2))
c3 as sum(val(c2))
}
stats(func: type(State), orderdesc: val(c3)) @filter(gt(val(c3), 0)) {
state: State.name
amount: val(s3)
count: val(c3)
city: ~City.state @filter(gt(val(c2), 0)) (orderdesc: val(c2)) {
City.name
amount: val(s2)
count: val(c2)
}
}
}
Aggregation on Multiple Dimensions
If we have to get the number of donations per school and category, we cannot follow a path across the graph since schools and categories are not related in the graph. We cannot use a single graph query and produce a result as organized as schools > category > count or category > school > count.
In this case, we need to split the analysis into multiple queries.
For example, let's create a parameterized query that will return the count of donations per school for a given category:
# Number of projects per school in a given category
query stat_per_school_for_category($category: string) {
# Literacy & Language
var(func: eq(Category.name, $category)) {
c1_projects as ~Project.category {
c1_schools as Project.school
}
}
var(func: uid(c1_schools)) {
School.projects @filter(uid(c1_projects)) {
Project.donations {
d as Donation.amount
}
s as sum(val(d))
}
c1_s as sum(val(s))
}
stats(func: uid(c1_s)) {
School.name
total_donation: val(c1_s)
}
}
Note that we can compute the same result using pattern matching
with a @cascade
directive in DQL:
# Number of projects per school in a given category
query stat_per_school_for_category($category: string) {
# Literacy & Language
var(func: type(School)) {
School.projects @cascade {
Project.category @filter(eq(Category.name, $category))
Project.donations {
d as Donation.amount
}
s as sum(val(d))
}
c1_s as sum(val(s))
}
stats(func: uid(c1_s)) {
School.name
total_donation: val(c1_s)
}
}
Note: Using the metrics information provided by Dgraph you can analyze the performance of your queries and select the approach that best works for your use case.
From a client application, we can now query the list of categories and use the parametrized query to get the data per school.
Here is an example in Python:
query_categories = '''
{
categories(func:has(Category.name)) {
label:Category.name
}
}
'''
query = '''
query stat_per_school_for_category($category: string) {
# Literacy & Language
var(func: eq(Category.name, $category)) {
c1_projects as ~Project.category {
c1_schools as Project.school
}
}
var(func: uid(c1_schools)) {
School.projects @filter(uid(c1_projects)) {
Project.donations {
d as Donation.amount
}
s as sum(val(d))
}
c1_s as sum(val(s))
}
stats(func: uid(c1_s)) {
School.name
total_donation: val(c1_s)
}
}
'''
res = client.txn(read_only=True).query(query_categories)
res = json.loads(res.json)
# For each label
rows = []
for category in res["categories"]:
res = client.txn(read_only=True).query(query, variables={"$category": category["label"]})
res = json.loads(res.json)
print(res)
for entry in res["stats"]:
rows.append({
"Category": category["label"],
"School": entry["School.name"],
"Total Donation": entry.get("total_donation", 0),
})
# Convert to DataFrame
df = pd.DataFrame(rows)
# Create Bubble Chart
fig = px.scatter(
df,
x="School",
y="Category",
size="Total Donation",
hover_name="Total Donation",
title="Number of donations in schools per category",
size_max=30, # Maximum bubble size
)
# Show the chart
fig.show()
Best Practices for Using Dgraph for OLAP Data Analysis
In this post, we've explored some approaches to query Dgraph and produce OLAP-style analysis.
Here is a summary of the design best practices we have used:
- Use relationships to node types instead of scalar attributes for the dimensions
- Create hierarchies as entity-relationships
- Index relationships with
@reverse
to follow directed relationships from their target to their source. - Add
count index
on relationships used in thecount()
function - Create parameterized queries for your analytics use cases
Conclusion
Dgraph's flexible graph model and powerful DQL capabilities make it a great choice for analytical use cases. By leveraging its inherent relationships, variables, and aggregation functions, you can create insightful and efficient OLAP-style analyses directly within Dgraph.
Whether it's basic counts, hierarchical aggregations, or multi-dimensional data, Dgraph offers a seamless and performant solution for your analytical needs.
Interested in trying it out for yourself? Get started with Dgraph today and experience why developers love the only open source, AI-ready graph database