MAY 31 2024
Add OpenAI, Mistral or open-source embeddings to your knowledge graph.
“Use vector embeddings in Dgraph.'

Dgraph v24 introduces the much-expected merge of vector and graph database: vector support, HNSW index, and similarity search both in DQL and GraphQL. It is a major step to support GenAI RAG, classification, entity resolution, semantic search and many other AI and Graph use cases.
Vector databases are vector first: they let you find similar vectors, but what you need is real data. The data is either stored as a payload associated with the vector or as a reference ID. If you are using payload then creating multiple vectors for the same data introduces duplicates and synchronization issues. If you are using reference ID, then you need some extra queries to get the data you need.
Dgraph is entity first: you can add many vector predicates to the same entity
type. For example, a Product
may have one vector embedding built from the text
description and another vector embedding created from the product image. When
searching for similarity, you get similar entities i.e. the data and not only
the vector. You don't need extra queries to get the information you need. The
entities found are part of the graph so you can also query any relationships in
the same graph request.
Dgraph is a database and does not have the limitations of in-memory solutions: the vectors are treated as any other predicates, stored, and indexed in the core DBs.
This Blog shows how to get started with vector embeddings in Dgraph using
OpenAI, Mistral, or Huggingface embedding models. It provides details about how
Product
embedding have been added to an existing Dgraph instance in the video
example:
Adding a vector predicate to an existing entity type
In our example, the following minimal GraphQL Schema is deployed in Dgraph and
the Database is populated with existing Products
type Product {
id: String! @id
description: String @search(by: [term])
title: String @search(by: [term])
imageUrl: String
}
With v24 we can declare a new vector predicate and specify an index using the
@search
directive. Vector predicates support hnsw
index with euclidean
cosine
or dotproduct
metric. A vector is a predicate of type [Float!]
with
the directive @embedding
.
Here is the updated GraphQL Schema:
type Product {
id: String! @id
description: String @search(by: [term])
title: String @search(by: [term])
imageUrl: String
characteristics_embedding: [Float!]
@embedding
@search(by: ["hnsw(metric: euclidean, exponent: 4)"])
}
Notes
- You can add more than one embedding to an entity type.
- You don't specify the vector size. The first mutation will set it. If you you are using an embedding model producing vectors of size 384 for example, all the predicate values will have to be set with the same dimensions. If you decide to change the embedding model, you can easily drop all the predicate values and recompute the embeddings of your entities with the new model which may produce vectors a different dimension.
- When deploying the updated model, your existing data is untouched, you have just added a new predicate and a vector index.
For our test with a local instance of Dgraph, we simply deploy the schema using
curl -X POST http://localhost:8080/admin/schema --silent --data-binary '@./schema.graphql'
GraphQL API
Dgraph uses the deployed GraphQL schema to expose a GraphQL API with queries, mutations, and subscriptions for the declared types. For each entity with at least one vector predicate, Dgraph v24 generates 2 new queries
querySimilar<Entity>ByEmbedding
querySimilar<Entity>ById
querySimilar<Entity>ByEmbedding
returns the topK closest entities to a given
vector. The typical use case is semantic or natural language search: the vector
is computed in the client application from a sentence i.e. a request expressed
in natural language and using the same model used for the entities' embeddings.
querySimilar<Entity>ById
returns the topK closest entities to a given entity.
The typical use case is recommendation systems using similarity search.
Before experimenting with those new queries in the GraphQL API, we need to populate our graph with embeddings.
Adding embeddings
We are using a python script from the examples folder of the pydgraph repository.
The script is provided as-is, as an example. Adapt the logic to your needs.
The logic of the shared Python script is as follows:
- use paginated queries so we don't have size limit.
- use an embedding config file.
- find all entities of a given type. We have 2 options: get all entities or get only the entities for which the vector predicate is not present. The latter is to run the script to compute newly added entities.
- for each entity, use a DQL query to retrieve the predicates needed.
- create a text prompt from the value of the retrieved predicates and a text
template ( with
mustache
notation used bypybars
). - compute the vector embedding of the prompt using openAI, Mistral or a Hugging Face model
- mutate the vector value in Dgraph.
For our Product
we defined the following embedding configuration:
{
"embeddings": [
{
"entityType": "Product",
"attribute": "characteristics_embedding",
"index": "hnsw(metric: \"euclidean\")",
"provider": "huggingface",
"model": "sentence-transformers/all-MiniLM-L6-v2",
"config": {
"dqlQuery": "{ title:Product.title }",
"template": "{{title}}"
}
}
]
}
Note that the script is using a DQL query on data generated from a GraphQL Schema. You can learn more about this topic in the doc section GraphQL - DQL interoperability
In a terminal window, declare the Dgraph GRPC endpoint. For example, for a local instance:
export DGRAPH_GRPC=localhost:9080
If needed, for cloud instances, declare an admin client key.
export DGRAPH_ADMIN_KEY=<Dgraph cloud admin key>
and simply run the script
python ./computeEmbeddings.py
We are using
python 3.11
with
openai 1.27.0
mistralai 0.1.8
pybars3 0.9.7
sentence-transformers 2.2.2
Similarity Queries
Having vector predicates populated with your embeddings is all you need to perform similarity queries using the auto-generated queries in the GraphQL API.
In our example we have identified one of the Product with id 059446790X
and
performed a similarity search:
query QuerySimilarProductById {
querySimilarProductById(
id: "059446790X"
by: characteristics_embedding
topK: 10
) {
id
title
vector_distance
}
}
Note that you specify the predicate name (here characteristics_embedding
) to
be used for the similarity search in the query. As previously mentioned, you may
have more than one vector attached
to the Product entity and you can perform
different similarity queries (similar description, similar image, etc...).
vector_distance
is a generated predicate providing the distance between the
given vector and each entity vector. It can be used to compute similarity score
or to apply thresholds.
Conclusion
Dgraph added vector support as a first-class citizen with fast HNSW index support.
Using vector predicates to store embeddings, computed by ML models such as OpenAI, Mistral, Hugging Face, or others, is a surprisingly powerful approach to many AI or NLP use cases.
In this Blog, we showed how to quickly add embeddings to existing entities stored in Dgraph. Let us know what you are building by combining the power of Dgraph and ML models.
Photo by Tuur Tisseghem