How to use Dgraph for OLAP data analysis

When it comes to data analytics, you don't always have the luxury of using the best-of-the-best tools for every task.

Whatever tool gets the job done today is always better than waiting around for a tool that could do it better but that you don't have on hand, maybe due to time, resources, or availability. A perfect example is the tools needed for OLTP vs. OLAP.

In this blog post, we'll explore how Dgraph—a graph database optimized for Online Transaction Processing (OLTP) and deeply nested queries—can also be used effectively for Online Analytical Processing (OLAP) use cases. We'll highlight Dgraph's analytical capabilities through examples and practical techniques for designing analytical solutions without needing an additional OLAP solution.

Let's dive in.

What is OLTP vs. OLAP?

OLTP (Online Transaction Processing) focuses on processing day-to-day transactions, while OLAP (Online Analytical Processing) is geared toward analyzing data from multiple sources to support business decision-making.

Dgraph—though primarily designed for OLTP—has robust features that make it capable of addressing OLAP needs by leveraging its graph structure and Dgraph Query Language (DQL).

Relationships form the dimensionality

In Dgraph, relationships between nodes naturally form the dimensions required for OLAP-style analysis.

DQL's aggregation and math functions, combined with thoughtful graph design, allow you to create a comprehensive analytical solution directly within Dgraph.

The examples below use a dataset about donations to public schools in the U.S. built from public data provided by DonorsChoose.org in a Kaggle project dataset. You can also find the data ready to load into Dgraph in this Dgraph benchmarks GitHub repository.

Basic count

# number of projects per school
{
  stats(func:type(School)) {
     School.name
     count(~Project.school)
  }
}

Dgraph returns the school names and the count result:

{
  "data": {
    "stats": [
      {
        "School.name": "Abbott Middle School",
        "count(~Project.school)": 16
      },
      {
        "School.name": "Lincoln Elementary School",
        "count(~Project.school)": 7
      },
      {
        "School.name": "Rosemont Early Education Center",
        "count(~Project.school)": 5
      },
    ]
  }
}

DQL's block structure and alias feature allow you to align query responses with a format easily consumable by a visualization tool.

For instance, to use the query result in a Python script with Plotly, you can modify the query like this:

# number of projects per school
{
  school(func:type(School)) {
     category:School.name
     value:count(~Project.school)
  }
}

Using this result, you can create a bar chart in Python:

import pydgraph
import plotly.express as px
import plotly.graph_objects as go
import json
import pandas as pd

def bar_chart(payload, title='bar chart'):
    keys = payload.keys()
    serie = list(keys)[0]
    df = pd.json_normalize(payload[serie])
    cols = df.columns.to_list()
    cols.remove('category')

    fig = px.bar(df,
                 y='category',
                 x=cols,
                 title=title,
                 orientation='h',
                 labels={"category":serie,"value": cols[0]},  # Customize x-axis name
                 text_auto=True  # Automatically display bar values
    )

    return fig

query = '''
# number of projects per school
{
  school(func:type(School)) {
     category:School.name
     value:count(~Project.school)
  }
}
'''

res = client.txn(read_only=True).query(query)
res = json.loads(res.json)
fig = bar_chart(res, "Number of projects per school")
fig.show()

Aggregations and variables

Dgraph variables add flexibility by enabling filtering, ordering, and querying additional data.

Value variables in DQL are map structures: a variable maps UIDs of the enclosing block to the corresponding values. Value variables are used by extracting the UIDs with uid(var-name)or by getting the variable's value in the query context using val(var-name).

The query context is a node in the graph. If the current node UID is not in the variable map, the value is undefined and not displayed in the result.

Here's an example that counts projects per school and orders them by project count:

# number of projects per school ordered by project count
{
  var(func:type(School)) {
     c as count(~Project.school)
  }
  serie(func:uid(c), orderdesc:val(c)) {
     category:School.name
     project_count: val(c)
  }
}

The var block is used to define a variable but doesn't return data in the response. The named block uses the variable data and additional node information.

Grouping and filtering by dimensions

Dgraph's @groupby directive allows for powerful OLAP-style groupings.

Here's an example of counting nodes by type:

# groupby a scalar predicate
# count of nodes per dgraph.type (attribute)

{
   stats(func:has(dgraph.type)) @groupby(dgraph.type) {
      count:count(uid)
  }
}

{
  "data": {
    "stats": [
      {
        "@groupby": [
          {
            "dgraph.type": "City",
            "count": 10
          },
          {
            "dgraph.type": "School",
            "count": 10
          },
          {
            "dgraph.type": "State",
            "count": 13
          },
          {
            "dgraph.type": "Category",
            "count": 15
          },
          {
            "dgraph.type": "Project",
            "count": 73
          },
          {
            "dgraph.type": "Donor",
            "count": 190
          },
          {
            "dgraph.type": "Donation",
            "count": 224
          }
        ]
      }
    ]
  }
]

In the previous example, we group the data using dgraph.type , a scalar predicate sorting data types by string.

The groupby directive is more powerful when we group by a relationship to other nodes: in that case the aggregated values can be associated with the related node in a query variable and reused.

For example, in our data model, a Project has a relationship to an entity Category. We can group projects per categories using the following:

# count of projects per category (relationship)
{
   stats(func:has(<Project.category>)) @groupby(Project.category) {
      count:count(uid)
  }
}

{
  "data": {
    "stats": [
      {
        "@groupby": [
          {
            "Project.category": "0x4b61b90",
            "count": 1
          },
          ...
          {
            "Project.category": "0x4b61b9a",
            "count": 10
          },
          {
            "Project.category": "0x4b61b94",
            "count": 13
          },
          {
            "Project.category": "0x4b61b92",
            "count": 22
          }
        ]
      }
    ]
  }
}

The groupby is now expressed in node UIDs: each element in the response corresponds to a node. So we can set a Dgraph variable and use val() and orderby and filters to better structure the query like this:

# count of projects per category for categories with at least 10 projects
# ordered by number of projects

{
  var(func:has(<Project.category>)) @groupby(Project.category) {
      ProjectPerCategory as count:count(uid)
  }
  stat(func:uid(ProjectPerCategory),orderdesc:val(ProjectPerCategory))
  @filter(ge(val(ProjectPerCategory),10)) {
     Category.name
     project_count: val(ProjectPerCategory)
  }
}

Aggregation with multiple hops

Aggregations aren't limited just to directly connected entities–they can follow multiple hops.

For example, let's find the first three cities per average donation using the number of donations, the sum of donations, and the average donation.

In our graph data model there are the following relationships: City → School → Project → Donation

Aggregating Donation data per City requires us to follow three hops. Let's try a few different approaches.

Approach 1: nested aggregation

The first approach is to use the fact that DQL queries are nested by nature: query the graph-nested structure until you can access the value to aggregate and compute aggregated values of variables at different nested levels.

{
  var(func:type(City)) {
      ~School.city {
        School.projects {
            Project.donations {
               a as Donation.amount
            }
            s1 as sum(val(a))
            c1 as count(Project.donations)
        }
        s2 as sum(val(s1))
        c2 as sum(val(c1))
      }
      s3 as sum(val(s2))
      c3 as sum(val(c2))
      avg as math(s3 / c3)
    # sperp as sum(Donation.amount)
  }
  stats(func:uid(c3),orderdesc:val(avg),first:3) {
     City.name
     sum_donation:val(s3)
     count_donation:val(c3)
     avg_donation:val(avg)
  }
}

c1 as count(Project.donations) is computed at the School.Projects level. a as Donation.amount is computed in a lower level of the tree in Project.donations.

Each variable is re-aggregated at the parent level to create a new variable of the right node type:

s1 and c1 are variables of Project type (that is, UID s1 is a list of UID Project nodes).
s2 and c2 are variables of School type
s3 and c3 are variables of City type

{
  "data": {
    "stats": [
      {
        "City.name": "Stockton",
        "sum_donation": 3857.38,
        "count_donation": 16,
        "avg_donation": 241.08625
      },
      {
        "City.name": "San Mateo",
        "sum_donation": 4779.32,
        "count_donation": 41,
        "avg_donation": 116.56878
      },
      {
        "City.name": "San Jose",
        "sum_donation": 267,
        "count_donation": 3,
        "avg_donation": 89
      }
    ]
  }
}

Approach 2: variable propagation

The second approach to multi-hop aggregation uses variable propagation.

In this case, we use the fact that Dgraph accumulates variables when traversing the graph (see the variable propagation documentation).

{
  var(func:type(Donation)) {
    s1 as Donation.amount
      c1 as Math(1)
      Donation.project {
      Project.school {
        School.city {
          s as math(s1)
          c as math(c1)
          avg as math(s1/c1)
        }
      }
    }
  }

  stats(func:type(City),orderdesc:val(avg),first:3) {
    City.name
    sum_donation:val(s)
    count_donation:val(c)
    avg_donation:val(avg)
  }
}

s1 as Donation.amount is the amount for each donation. When reaching the nested block Donation.project {}, Dgraph automatically sums this variable which is now the sum of donation.amount for each project. The variable is again propagated to the nested-level Project.School {} .

Finally, inside the School.city {} block, the variable is now the sum of Donation.amount for all the paths reaching a city. As expected, the result is the same.

Let's take a look at the query metrics:

Aggregation	Propagation
{ "extensions": { "server_latency": { "encoding_ns": 111750, "parsing_ns": 287250, "processing_ns": 4670001, "assign_timestamp_ns": 815625, "total_ns": 6111500 }, "txn": { "start_ts": 348486 }, "metrics": { "num_uids": { "": 0, "City.name": 3, "Donation.amount": 224, "Project.donations": 146, "Project.school": null, "School.city": null, "School.projects": 10, "_total": 393, "dgraph.type": 0, "~School.city": 10 } } } }	{ "extensions": { "server_latency": { "encoding_ns": 194750, "parsing_ns": 487124, "processing_ns": 10162374, "assign_timestamp_ns": 2094542, "total_ns": 13222375 }, "txn": { "start_ts": 348479 }, "metrics": { "num_uids": { "": 0, "City.name": 3, "Donation.amount": 224, "Project.donations": null, "Project.school": 55, "School.city": 9, "School.projects": null, "_total": 515, "dgraph.type": 0, "~School.city": null } } } }

Aggregation

Propagation

{
    "extensions": {
      "server_latency": {
        "encoding_ns": 111750,
        "parsing_ns": 287250,
        "processing_ns": 4670001,
        "assign_timestamp_ns": 815625,
        "total_ns": 6111500
      },
      "txn": {
        "start_ts": 348486
      },
      "metrics": {
        "num_uids": {
          "": 0,
          "City.name": 3,
          "Donation.amount": 224,
          "Project.donations": 146,
          "Project.school": null,
          "School.city": null,
          "School.projects": 10,
          "_total": 393,
          "dgraph.type": 0,
          "~School.city": 10
        }
      }
    }
}

{
    "extensions": {
      "server_latency": {
        "encoding_ns": 194750,
        "parsing_ns": 487124,
        "processing_ns": 10162374,
        "assign_timestamp_ns": 2094542,
        "total_ns": 13222375
      },
      "txn": {
        "start_ts": 348479
      },
      "metrics": {
        "num_uids": {
          "": 0,
          "City.name": 3,
          "Donation.amount": 224,
          "Project.donations": null,
          "Project.school": 55,
          "School.city": 9,
          "School.projects": null,
          "_total": 515,
          "dgraph.type": 0,
          "~School.city": null
        }
      }
    }
}

In the variable propagation approach, we traverse from Donation to Project, and there are 224 donations. So we follow this relation 224 times. In the aggregation approach, we navigate from Project to Donation, and there are 146 projects. The number of times the algorithm visits Project nodes is higher for the variable propagation approach.

Note that adding a count index on Project.donations predicate significantly improves the query.

Of course, the right approach depends on your dataset and use case.

Aggregation along hierarchical dimensions

For this next example, let's find the number and sum of donations per states and cities. The x-axis of our analysis is a hierarchy of states and then the cities they contain.

As this hierarchy is designed as a nodes and relationships in our graph, we can obtain the data for this analysis in one query:

# Number of projects per state and city
{
  var(func: type(State)) {
    city: ~City.state {
      ~School.city {
        School.projects {
          Project.donations {
            a as Donation.amount
          }
          c as count(Project.donations)
          s as sum(val(a))
        }
        s1 as sum(val(s))
        c1 as sum(val(c))
      }
      s2 as sum(val(s1))
      c2 as sum(val(c1))
    }
    s3 as sum(val(s2))
    c3 as sum(val(c2))
  }

  stats(func: type(State)) {
    state: State.name
    amount: val(s3)
    count: val(c3)

    city: ~City.state {
      City.name
      amount: val(s2)
      count: val(c2)
    }
  }
}

The result looks like this:

{
  "data": {
    "stats": [
      {
        "state": "California",
        "amount": 19556.48,
        "count": 224,
        "city": [
          {
            "City.name": "American Cyn",
            "amount": 3003.92,
            "count": 39
          },
          {
            "City.name": "Newark",
            "amount": 1978.12,
            "count": 23
          },
          {
            "City.name": "Brentwood"
          },
          {
            "City.name": "San Mateo",
            "amount": 4779.32,
            "count": 41
          }
        ]
      },
      {
        "state": "New York"
      }
    ]
  }
}

Depending on the data analytics use case, we can filter out the cities without donations, and the state without donations, and then order the results:

# Number of projects per state and city
{
  var(func: type(State)) {
    city: ~City.state {
      ~School.city {
        School.projects {
          Project.donations {
            a as Donation.amount
          }
          c as count(Project.donations)
          s as sum(val(a))
        }
        s1 as sum(val(s))
        c1 as sum(val(c))
      }
      s2 as sum(val(s1))
      c2 as sum(val(c1))
    }
    s3 as sum(val(s2))
    c3 as sum(val(c2))
  }

  stats(func: type(State), orderdesc: val(c3)) @filter(gt(val(c3), 0)) {
    state: State.name
    amount: val(s3)
    count: val(c3)

    city: ~City.state @filter(gt(val(c2), 0)) (orderdesc: val(c2)) {
      City.name
      amount: val(s2)
      count: val(c2)
    }
  }
}

Aggregation on multiple dimensions

If we want to get the number of donations per school and category, we can't follow a path across the graph since schools and categories aren't related in the graph. We can't use a single graph query and produce a result as organized as schools > category > count or category > school > count.

In this case, we need to split the analysis into multiple queries.

For example, let's create a parameterized query that returns the count of donations per school for a given category:

# Number of projects per school in a given category
query stat_per_school_for_category($category: string) {
  # Literacy & Language
  var(func: eq(Category.name, $category)) {
    c1_projects as ~Project.category {
      c1_schools as Project.school
    }
  }

  var(func: uid(c1_schools)) {
    School.projects @filter(uid(c1_projects)) {
      Project.donations {
        d as Donation.amount
      }
      s as sum(val(d))
    }
    c1_s as sum(val(s))
  }

  stats(func: uid(c1_s)) {
    School.name
    total_donation: val(c1_s)
  }
}

Note that we can compute the same result using pattern matching with a @cascade directive in DQL:

# Number of projects per school in a given category
query stat_per_school_for_category($category: string) {
  # Literacy & Language
  var(func: type(School)) {
    School.projects @cascade {
      Project.category @filter(eq(Category.name, $category))
      Project.donations {
        d as Donation.amount
      }
      s as sum(val(d))
    }
    c1_s as sum(val(s))
  }

  stats(func: uid(c1_s)) {
    School.name
    total_donation: val(c1_s)
  }
}

Note: Using the metrics information provided by Dgraph you can analyze the performance of your queries and select the approach that best works for your use case.

From a client app, we can now query the list of categories and use the parametrized query to get the data per school.

Here is an example in Python:

query_categories = '''
{
  categories(func:has(Category.name)) {
    label:Category.name
  }
}
'''

query = '''
query stat_per_school_for_category($category: string) {
  # Literacy & Language
  var(func: eq(Category.name, $category)) {
    c1_projects as ~Project.category {
      c1_schools as Project.school
    }
  }

  var(func: uid(c1_schools)) {
    School.projects @filter(uid(c1_projects)) {
      Project.donations {
        d as Donation.amount
      }
      s as sum(val(d))
    }
    c1_s as sum(val(s))
  }

  stats(func: uid(c1_s)) {
    School.name
    total_donation: val(c1_s)
  }
}
'''

res = client.txn(read_only=True).query(query_categories)
res = json.loads(res.json)

# For each label
rows = []
for category in res["categories"]:
    res = client.txn(read_only=True).query(query, variables={"$category": category["label"]})
    res = json.loads(res.json)
    print(res)

    for entry in res["stats"]:
        rows.append({
            "Category": category["label"],
            "School": entry["School.name"],
            "Total Donation": entry.get("total_donation", 0),
        })

# Convert to DataFrame
df = pd.DataFrame(rows)

# Create Bubble Chart
fig = px.scatter(
    df,
    x="School",
    y="Category",
    size="Total Donation",
    hover_name="Total Donation",
    title="Number of donations in schools per category",
    size_max=30,  # Maximum bubble size
)

# Show the chart
fig.show()

Best practices for using Dgraph for OLAP data analysis

In this post, we've explored some approaches to query Dgraph and produce OLAP-style analysis.

Here is a summary of the design best practices we implemented:

Use relationships to node types instead of scalar attributes for the dimensions
Create hierarchies as entity-relationships
Index relationships with @reverse to follow directed relationships from their target to their source.
Add count index on relationships used in the count() function
Create parameterized queries for your analytics use cases

Conclusion

Dgraph's flexible graph model and powerful DQL capabilities make it a great choice for analytical use cases. By leveraging its inherent relationships, variables, and aggregation functions, you can create insightful and efficient OLAP-style analyses directly within Dgraph.

Whether it's basic counts, hierarchical aggregations, or multi-dimensional data, Dgraph offers a seamless and performant solution for your analytical needs.

Interested in trying it out for yourself? Get started with Dgraph today and experience why developers love the only open source, AI-ready graph database

JANUARY 30 2025