AUGUST 8 2019
Datetime Indexes in Dgraph

I recently started working at Dgraph Labs in Bengaluru as a Software Engineer. One of the first issues that I worked on was related to the Dgraph's datetime datatype. This article covers how I discovered and resolved the issue. The article assumes a basic understanding of Dgraph, but you can learn everything you need in the Docs.
Bug Discovery
While working on Flock, I observed that timestamp comparisons were not working correctly in Dgraph when the data or query had timezone information in it. To see the issue, let's first quickly set up Dgraph using the commands below.
The easiest way is to use Docker Compose.
Create a docker-compose.yml
file with the contents below.
version: "3.2"
services:
zero:
image: dgraph/dgraph:v1.0.16
restart: on-failure
command: dgraph zero --my=zero:5080
server:
image: dgraph/dgraph:v1.0.16
ports:
- 8080:8080
restart: on-failure
command: dgraph alpha --my=server:7080 --lru_mb=2048 --zero=zero:5080
ratel:
image: dgraph/dgraph:v1.0.16
ports:
- 8000:8000
command: dgraph-ratel
Next, from the folder containing the newly created file, run:
docker-compose up
We will use Ratel to showcase the issue. Ratel allows you to run queries, mutations and manage the Dgraph cluster. To open Ratel, you can go to the URL http://localhost:8000/?latest in your browser after running the commands above.
Now, let's set up the schema. To set up the schema, you have to go to the
Schema
tab on the left navigation bar in Ratel, click on Bulk Edit
and paste
the following schema:
created_at : datetime @index(hour) .
Then, click on Apply Schema
. This will create a new predicate created_at
of
type datetime
.
Now you can perform the following mutation to add data to Dgraph in the
Console
tab after choosing Mutate
option. This will store two tweets with a
creation time and author name.
{
"set": [
{
"uid" :"_:user1",
"created_at": "2019-03-28T14:00:00-06:00",
"author_name": "Rahul"
},
{
"uid" :"_:user2",
"created_at": "2019-03-28T18:00:00+01:00",
"author_name": "Ashish"
}
]
}
Finally, run the following query to get all the tweets that were created after
UTC March 28th, 3 PM
. You can run queries in the Console
after selecting
Query
option:
{
tweets(func: gt(created_at, "2019-03-28T15:00:00+00:00")) {
created_at
uid
}
}
This is the response we get:
{
"extensions": { ... },
"data": {
"tweets": [
{
"created_at": "2019-03-28T18:00:00+01:00",
"uid": "0x2"
}
]
}
}
The created_at
timestamp for the both the users is as follows. For brevity, we
will remove the date (2019-03-28
) from timestamps in the rest of the article.
| | Created At | In UTC | Greater than query ts (15h +00:00) | | :---: | :--------: | :--------: | :--------------------------------: | | user1 | 14h -06:00 | 20h +00:00 | True | | user2 | 18h +01:00 | 17h +00:00 | True |
Even when the created_at
timestamps for both the users are larger than the
timestamp in the query, only user2
shows up in the response of the query.
Before I explain how we found and fixed the root cause of the issue, let me
first explain how indexes work in Dgraph.
Indexes in Dgraph
Along with JSON, Dgraph takes RDF data format as input which looks like
<Subject> <Predicate> <Object>
(SPO Triples). The Subject
is
generally a UID
value representing a node in the graph whereas the Predicate
represents a relationship of the node with the Object
value. Indexes are
created on a given Predicate
to quickly find nodes in the graph for the
provided Object
value.
Dgraph indexes are stored in BadgerDB,
an embedded key-value store written purely in Go. For each object value, one or
more tokens are generated based on the provided tokenizer. For example, in the
case of a hash
tokenizer, the tokenizer computes the hash of the Object
value and returns it as a token.
Once the tokens are generated, data is stored in badgerDB in the following
format. Each key in badgerDB, is a combination of predicate and generated token,
and the value is the list of UIDs that match the predicate and the token (the
key). An example is shown below for created_at
predicate:
| key | value | | :----------------------------: | :--------------: | | < created_at, Tokenizer-Tok1 > | List(uid1, uid3) | | < created_at, Tokenizer-Tok2 > | List(uid2) |
We prepend a unique Tokenizer
identifier in the key (whether it's a hash
tokenizer, int tokenizer, hour tokenizer etc.) to the token generated, as shown
above, to be able to traverse through all the tokens belonging to only that
particular tokenizer (index).
Datetime Tokenizers
For a datetime index, Dgraph provides tokenizers for hour
, day
, month
and
year
granularity. Each key generated by the tokenizer can be treated as a time
bucket of specified granularity, storing the UIDs with a date falling within
that bucket. The granularity of a datetime index should be carefully chosen to
keep the size of each bucket reasonable. Choosing a lower granularity time index
would wipe out the performance benefits of UID consolidation into time buckets
(think one entry per bucket). On the other hand, choosing very high
granularity would end up putting too many UIDs within the same bucket, which
could also affect performance (millions of entries per bucket).
Root Cause
In order to generate tokens for datetime datatype, we use the Hour()
, Day()
,
Month()
and Year()
methods of the time
package as you can see in the
(source code).
These methods return the information accounting for the timezone information in
the timestamp. For example, for timestamp 14h +06:00
, the Hour()
method
returns the value 14
and for timestamp 14h +00:00
, the Hour()
method still
returns the same value 14
. Due to this behavior, the index is not correctly
built.
In our example above, the data has two timestamps 1) 14h -06:00
and 2)
18h +01:00
and we have created an hour
index on the created_at
predicate.
The hour
index incorrectly stores roughly the following information:
<created_at, HourTokenizer-2019-03-28T14> -> List(0x01)
<created_at, HourTokenizer-2019-03-28T18> -> List(0x02)
The query is trying to find out all users having created_at
timestamp bigger
than 15h +00:00
. While processing the query, Dgraph computes the same hour
tokenizer for this timestamp to be 2019-03-28T15
. Then, it scans through all
the users in the index having the token bigger than 2019-03-28T15
and only
returns the user with uid 0x02
.
Fixing the Bug
Once the issue was clear, the fix was pretty straight forward. Instead of
building tokens directly on the given timestamp, we first convert the timestamp
into UTC()
timezone and then compute various tokens. This normalizes all
timestamps into single timezone and the Hour()
, Day()
, Month()
and
Year()
information can be correctly compared. The PR for this fix is available
here. Once this change is
included, the index looks like this instead:
<created_at, HourTokenizer-2019-03-28T17> -> List(0x01)
<created_at, HourTokenizer-2019-03-28T20> -> List(0x02)
The timestamp in the query now generates the token as 2019-03-28T15
and we see
both the users while scanning through the index. The correct response,
therefore, looks like as follows:
{
"extensions": { ... },
"data": {
"tweets": [
{
"created_at": "2019-03-28T14:00:00-06:00",
"uid": "0x1"
},
{
"created_at": "2019-03-28T18:00:00+01:00",
"uid": "0x2"
}
]
}
}
You can see that both the users are included in this response.
Conclusion
In this article, we discussed how indexes are implemented in Dgraph. I also showed how I discovered and resolved the issue with datetime indexes. While backporting it to v1.0 series would require change in data format on disk, the fix will be available in v1.1 which we plan to release end of this month. Stay tuned!