JULY 24 2024
Product recommendation using RAG on Dgraph
Combine Dgraph vector search and LLM language processing power to generate accurate product recommendations.

In this blog post, we will use Dgraph database to store retail products information (Amazon products data) and Language Models to reply to users asking for a product recommendation.
The blog post provides Python code snippets to explain the main steps. The complete Jupyter Notebook and associated data folder are available for you to play with.
The language models are used in 3 different ways
- we use LLM text analysis capabilities to craft a Graph database query to fetch the relevant information to generate a response.
- we use a small model to generate and store text embeddings to find products, categories, brands, characteristics etc. based on semantic similarity
- we use an LLM to generate a response to the use question based on the data retrieved in the Graph.
This is a case of Retrieval Augmented Generation (RAG), and NLP (Natural Language Processing) leveraging graph structures.
Why Dgraph for AI?
Dgraph is particularly suited for knowledge graph and AI applications due to several key features and capabilities:
-
Graph Database Structure: Dgraph is designed as a native graph database, which means it stores data in a graph structure consisting of nodes, edges, and properties. This is inherently aligned with the way knowledge graphs represent relationships and entities, making it easier to model complex interconnections.
-
Native vector support: Any node may have any number of vector predicates that are indexed using the HNSW algorithm for fast similarity retrieval.
-
Scalability: Dgraph is built to scale horizontally, handling large volumes of data and high query loads efficiently. This is crucial for AI applications that often require processing vast amounts of interconnected data.
-
High Performance: Dgraph provides fast query execution and low latency, which are essential for real-time AI applications. Its performance optimizations, such as parallel query execution and efficient data storage, make it capable of handling demanding workloads.
-
Flexible Schema: Dgraph supports flexible schema definitions, allowing for dynamic data models that can evolve. This is beneficial for AI applications where the data schema might need to adapt to new requirements or insights.
-
Rich Querying Capabilities: Dgraph’s query language, DQL (Dgraph Query Language) is declarative, which means that queries return a response in a similar shape to the query. DQL allows for complex graph traversals and pattern matching, which are essential for extracting insights and relationships in knowledge graphs. It also supports advanced features like recursive queries and aggregations and most importantly vector similarity search.
Solution overview
Create a knowledge graph consisting of Products information, categories, brands, age_groups, colors, measurements, materials ad characteristics. On user request
- create an
intent
representing which part of the graph should be used to reply to the request. - convert the
intent
in a Dgraph DQL query and execute the query. Use similarity search for best filtering. - use the retrieved structured data and a proper prompt to generate the final response to the user.
Setup
Create a file .env
in the folder containing this Python notebook with one line
for your OpenAI API key
OPENAI_API_KEY=sk-....
We just need some python packages for Dgraph, Openai, Hugging Face, and some tools we are using.
# Optional script to install all the required packages
!pip3 install pydgraph
!pip3 install openai
!pip3 install sentence_transformers
!pip3 install pybars3
!pip3 install python-dotenv
import os
import json
import pydgraph
from pybars import Compiler
# Activate the provider you want to use for embeddings and LLM
# from openai import OpenAI
# from mistralai.client import MistralClient
from sentence_transformers import SentenceTransformer
from dotenv import load_dotenv
load_dotenv()
assert os.getenv("OPENAI_API_KEY") is not None, "Set OPENAI_API_KEY in your .env file"
Dataset
Dgraph supports JSON and RDF format. In this Notebook we are using RDF. RDF is a powerful notation for knowledge graph. It describes information in triples of the form Subject - Predicate - Object (S-P-O).
The original dataset is in JSON format and is 2.7Mb. We have generated an RDF file with the same information. The RDF file is only 361 Kb!
Loading the dataset
Connecting to Dgraph
See
Learning Environment
to setup a docker image with dgraph:standalone/latest
, or use your on-prem or
cloud instance.
dgraph_grpc = "localhost:9080"
client_stub = pydgraph.DgraphClientStub(dgraph_grpc)
client = pydgraph.DgraphClient(client_stub)
print(f"Connected to DGraph at {dgraph_grpc}")
The notebook provides a complete setup covering on-prem and cloud instances.
House keeping
First we clean the DB. You may want to skip this step.
# Drop all data including schema from the Dgraph instance.
# This is a useful for small examples such as this since it puts Dgraph into a clean state.
confirm = input("drop schema and all data (y/n)?")
if confirm == "y":
op = pydgraph.Operation(drop_all=True)
client.alter(op)
print("schema and data deleted")
Deploying the Graph schema
In the Dgraph schema, we tell the system the different indexes we want to use on the different predicate, declare node types and specify relationships cardinality.
# add predicates to Dgraph type schema
with open('data/products.schema', 'r') as file:
dqlschema = file.read()
op = pydgraph.Operation(schema=dqlschema)
client.alter(op)
print("schema updated:")
print(dqlschema)
Importing data
As the dataset is small we can load all the data in one mutation:
def mutate_rdf(nquads, client):
ret = {}
body = "\n".join(nquads)
if len(nquads) > 0:
txn = client.txn()
try:
res = txn.mutate(set_nquads=body)
txn.commit()
ret["nquads"] = len(nquads),
ret["total_ns"]= res.latency.total_ns
except pydgraph.errors.AbortedError as err:
print("AbortedError %s" % i)
except Exception as inst:
print(inst)
finally:
txn.discard()
return ret
with open('data/products.rdf') as f:
data = f.readlines()
mutate_rdf(data, client)
For large dataset, refer to the Import data options.
Simple graph query
As our data is now in a graph database, we can traverse the graph, search for nodes, count relationships, etc. To verify that we have data in the DB, let's execute a simple query to find the top 5 categories and their number of products,
query = '''
{
var(func:type(category)) {
np as count(~Product.category)
}
productsPerCategory(func:uid(np), orderdesc:val(np), first:3){
category:category.Value
number_of_products:val(np)
}
}
'''
res = client.txn(read_only=True).query(query)
res = json.loads(res.json)
print("Top 3 categories with the most products:")
print (json.dumps(res, indent=4))
The expected result is
Top 3 categories with the most products:
{
"productsPerCategory": [
{
"category": "home decoration",
"number_of_products": 20
},
{
"category": "books",
"number_of_products": 17
}
]
}
Similarity search with vector embeddings
We don't want to constrain the question to only use terms present in the database. For example, the user may want "some clothes of dark color". We need to search our graph by similarity and not only by terms. We will use the power of Dgraph vectors and language model vector embeddings.
Creating vector indexes
Dgraph is a Graph database with native vector support, HNSW index, and similarity search. For this use case, we will be using a Python script shared in the blog post Add OpenAI, Mistral or open-source embeddings to your knowledge graph. to compute and add vector embeddings to all our entities.
For example, with an embedding on the color
entities, we will be able to
search for colors similar_to
"dark color".
Refer to the notebook to get the embedding logic details.
The embeddings are then computed using a simple configuration file:
embedding_config = [
{
"entityType":"Product",
"attribute":"embedding",
"index":"hnsw",
"provider": "huggingface",
"model":"sentence-transformers/all-MiniLM-L6-v2",
"config" : {
"dqlQuery" : "{ title: Product.Title}",
"template": "{{title}} "
}
},
{
"entityType":"age_group",
"attribute":"embedding",
"index":"hnsw",
"provider": "huggingface",
"model":"sentence-transformers/all-MiniLM-L6-v2",
"config" : {
"dqlQuery" : "{ value: age_group.Value}",
"template": "{{value}} "
}
},
{
"entityType":"brand",
"attribute":"embedding",
"index":"hnsw",
"provider": "huggingface",
"model":"sentence-transformers/all-MiniLM-L6-v2",
"config" : {
"dqlQuery" : "{ value: brand.Value}",
"template": "{{value}} "
}
},
...
]
for embedding_def in embedding_config:
buildEmbeddings(
embedding_def,
only_missing = True
)
print(f"Embeddings done for {embedding_def['entityType']}.{embedding_def['attribute']}")
In our example we used Hugging Face Sentence Transformer model
all-MiniLM-L6-v2
for all our embeddings.
The template
is a handlebars template that generates the text to be embedded
from the dqlQuery
result. By using a DQL query, we can build complex
embeddings (or graph embeddings) for any node type: the embedded text can
include text from connected nodes at any level.
In our use case, the embeddings are kept simple.
Querying the graph using Dgraph similarity function
When the embeddings have been added to each node using mutations, we can use
similar_to
function in DQL query.
For example:
sentence = "looking for something to make my home pretty"
# Get the sentence embedding with the same model
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
sentence_embedding = model.encode(sentence).tolist()
# Use Dgraph similar_to function to find similar categories and use Graph relations to get the products for this category
txn = client.txn(read_only=True)
query = f'''
{{
result(func: similar_to(category.embedding,3,"{sentence_embedding}")) {{
category:category.Value
products:~Product.category (first:3) {{
name:Product.Name
}}
}}
}}'''
try:
res = txn.query(query)
data = json.loads(res.json)
print(json.dumps(data,indent=4))
finally:
txn.discard()
The query is looking for 3 category
nodes closed to the provided prompt and
get 3 products maximum for each category. The response looks like the following:
{
"result": [
{
"category": "wedding decor",
"products": [
{
"name": "Romantic LED Light Valentine's Day Sign"
}
]
},
{
"category": "home decor",
"products": [
{
"name": "Fall Pillow Covers"
}
]
},
{
"category": "home garden balcony decor",
"products": [
{
"name": "Flower Pot Stand"
}
]
}
]
}
Extracting entities from the prompt
In the previous query, we assumed that the question was about products
found
by category
, so we could write a DQL query.
We can go further and use an LLM model to analyze the user prompt and determine the correct criteria to use before querying the graph structure. In this example our dataset is small but the approach must work for large graph. Loading all the data to the LLM context may not be practical and get over the LLM token window. The whole idea is to extract a subset of the data to reply to the user question.
We will use OpenAI and a prompt build with our knowledge of the graph structure, i.e the description of the entities and predicates that can be found in the graph (aka ontology).
We define the ontology and a way to represent it as text:
entities = [
{
"entity_name": "Product",
"description": "Item detailed type",
"predicates": {
"category" : {"description": "Item category, for example 'home decoration', 'women clothing', 'office supply'"},
"color" : {"description": "color of the item"},
"brand": {"description": "if present, brand of the item"},
"characteristic": {"description": "if present, item characteristics, for example 'waterproof', 'adhesive', 'easy to use'"},
"measurement": {"description": "if present, dimensions of the item"},
"age_group": {"description": "target age group for the product, one of 'babies', 'children', 'teenagers', 'adults'."}
}
}
]
def ontologyPrompt(ontology):
# Create a textual description of the ontology to help prompting LLM
# The graph database has the following entities and predicates:
entities = [ f'\'{e["entity_name"]}\'' for e in ontology]
list_entities = ", ".join(entities)
s = f"Identify if the user question is about one of the entities {list_entities}."
s += "\nIdentify criteria about predicates depending on the entity."
for e in ontology:
pred = [ f'\'{p}\'' for p in e["predicates"]]
pred_list = ", ".join(pred)
s+= f'\nFor \'{e["entity_name"]}\' look for:'
for p in e["predicates"]:
s+= f'\n- \'{p}\': {e["predicates"][p]["description"]}'
return(s)
Using meta-data in an ontology structure is an elegant and generic way to provide information to both the LLM (textual part) and the query builder (structured knowledge) about the domain we are dealing with.
We use a prompt in including the ontology information to ask openAI to identify
an intent
from the user prompt:
system_prompt = f'''
You are analyzing user prompt to fetch information from a knowledge graph.
{ontologyPrompt(entities)}
Return a json object following the example:
{{
"entity": "product",
"intent": "one of 'list', 'count'",
criteria: [
{{ "predicate": "category", "value": "clothing"}},
{{ "predicate": "color", "value": "blue"}},
{{ "predicate": "age_group", "value": "adults"}}
]
}}
If there are no relevant entities in the user prompt, return an empty json object.
'''
from openai import OpenAI
llm = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))
# Define the entities to look for
def text_to_intent(prompt, model="gpt-4o-mini"):
completion = llm.chat.completions.create(
model=model,
temperature=0,
response_format= {
"type": "json_object"
},
messages=[
{
"role": "system",
"content": system_prompt
},
{
"role": "user",
"content": prompt
}
]
)
intent = json.loads(completion.choices[0].message.content)
intent['prompt'] = prompt
return intent
For example the analysis of the request
"do you have clothes for teenagers in dark colors?",
results into the intent
{
"entity": "product",
"intent": "list",
"criteria": [
{
"predicate": "category",
"value": "clothing"
},
{
"predicate": "color",
"value": "dark"
},
{
"predicate": "age_group",
"value": "teenagers"
}
],
"prompt": "do you have clothes for teenagers in dark colors?"
}
The intent structure is easily translated into a graph traversal with constraints and filters.
One key idea is to use a similarity search instead of a keyword or term search.
In the above example dark
is not a color value. A keyword search would not
find any result. A similarity search should find "black" and "dark blue" as good
match.
For example the query generated from the previous intent looks like:
query test($categoryvect: float32vector,$colorvect: float32vector,$age_groupvect: float32vector){
category as var(func:similar_to(category.embedding,1,$categoryvect))
color as var(func:similar_to(color.embedding,1,$colorvect))
age_group as var(func:similar_to(age_group.embedding,1,$age_groupvect))
products(func:type(Product)) @filter(
uid_in(Product.category, uid(category))
AND uid_in(Product.color, uid(color))
AND uid_in(Product.age_group, uid(age_group)) ) {
name:Product.Name
title:Product.Title
age_group:Product.age_group {
value:age_group.Value
}
brand:Product.brand {
value:brand.Value
}
color:Product.color {
value:color.Value
}
category:Product.category {
value:category.Value
}
characteristic:Product.characteristic {
value:characteristic.Value
}
material:Product.material {
value:material.Value
}
measurement:Product.measurement {
value:measurement.Value
}
}
}
The DQL query is created from 4 parts
- the list of vectors used as query parameters
- var blocks to find matching nodes for each criteria in the intent
- a main entity type query with filters on the relationships to matching nodes.
- the information to retrieve for each node of the main entity type.
The query parts are inferred from the intent and the ontology. In the example we
have harcoded the fact that we are dealing with Product
type, but this can
easily be generated from the intent "entity" information.
Here is the code used to build the DQL query
# use same embedding model for user input and for the searched entities
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
def create_embedding(text):
# print (f"create embedding for {text}")
return huggingfaceEmbeddings(model,[text])[0]
# for each criteria, compute an embedding of the criteria value.
# build a sequence of var block to find the most similar node (category, characteristic, brand etc ...).
# build a filter to keep only the Product with the corresponding category, characteristic, brand etc ...
def intent_to_dql(intent):
vect = []
vars = []
filters = []
variables = {}
for criteria in intent['criteria']:
variables[f"${criteria['predicate']}vect"] = f"{create_embedding(criteria['value'])}"
vect.append(f"${criteria['predicate']}vect: float32vector")
vars.append(f"{criteria['predicate']} as var(func:similar_to({criteria['predicate']}.embedding,1,${criteria['predicate']}vect))")
filters.append(f"uid_in(Product.{criteria['predicate']}, uid({criteria['predicate']}))")
all_filters = "\n AND ".join(filters)
all_vars = "\n".join(vars)
query = f"""
query test({','.join(vect)}){{
{all_vars}
products(func:type(Product)) @filter( {all_filters} ) {{
name:Product.Name
title:Product.Title
age_group:Product.age_group {{
value:age_group.Value
}}
brand:Product.brand {{
value:brand.Value
}}
color:Product.color {{
value:color.Value
}}
category:Product.category {{
value:category.Value
}}
characteristic:Product.characteristic {{
value:characteristic.Value
}}
material:Product.material {{
value:material.Value
}}
measurement:Product.measurement {{
value:measurement.Value
}}
}}
}}
"""
return {"query":query,"variables":variables}
Generating a response from the retrieved sub-graph
We simply instruct an LLM to reply to the user request using the data retrieved
from the graph. This allows us to create a graph query that is good enough
and
may retrieve more data than needed. We then let the LLM use what is relevant for
the request.
def rag(prompt, payload):
model="gpt-4o-mini"
rag_prompt = f'''
You are suggesting products based on user input and available items.
Reply to the user with suggestions from the following data that match the criteria
{payload}
If possible explain why the items are suggested.
If there are no relevant items reply that we don't have any items that match the criteria.
'''
completion = llm.chat.completions.create(
model=model,
messages=[
{
"role": "system",
"content": rag_prompt
},
{
"role": "user",
"content": prompt
}
]
)
return completion.choices[0].message.content
The last function is a good summary of the approach in few lines of code:
def reply(sentence):
intent = text_to_intent(sentence)
dql = intent_to_dql(intent)
res = client.txn(read_only=True).query(dql["query"], variables=dql["variables"])
payload = json.loads(res.json)
return rag(sentence, payload)
Let's test our RAG solution:
example_queries = [
"Which pink items are suitable for children?",
"Do you have a helmet with anti allergic padding?",
]
for q in example_queries:
print()
print(f"> {q}")
print()
r = reply(q)
print(r)
Here are the results we got
> Which pink items are suitable for children?
I have two great suggestions for pink items that are suitable for children:
-
Suitcase Music Box
- Title: Suitcase Music Box, Mini Music Box Clockwork Music Box for Children
- Age Group: Children
- Color: Pink
- Category: Toys & Games
- Characteristics: This music box features a make-up mirror, jewelry box functionality, requires no batteries, operates with a clockwork mechanism, and has a storage compartment. It's a delightful toy that can also serve as a charming decorative piece for a child's room.
-
Unicorn Curtains
- Title: Eiichuang Unicorn Curtains Rod Pocket Blackout Cute Cartoon Pink Unicorn Wearing a Crown Mermaid Pattern Print Room Darkening Window Drapes for Kids Girls Bedroom Nursery, 2 Panels Set, 29 x 63 Inch
- Age Group: Children
- Color: Pink
- Category: Home Decoration
- Characteristics: These curtains feature a fun unicorn design and are room darkening with a rod pocket for easy hanging. They are perfect for a child's bedroom or nursery, creating a whimsical atmosphere.
Both items are not only visually appealing with their pink color but also serve functional purposes for children's enjoyment and room decor.
The second question :
> Do you have a helmet with anti allergic padding?
Yes, we have a helmet that features anti-allergic interior padding. I recommend the Steelbird Hi-Gn SBH-11 HUNK Helmet.
Here are some details about it:
- Brand: Steelbird
- Color: Glossy Black and Blue
- Measurement: 580 mm (M)
- Category: Motorcycle gear
- Characteristics:
- Anti Allergic Interior
- High Impact ABS Material Shell
- Italian Design Hygienic Interior
- Neck Protector For Extra Comfort
- Multipored Breathable Padding
- Multi-layer EPS (Thermocol)
- Replaceable and washable interior
- Anti-bacteria coating
This helmet not only provides anti-allergic features but also has a variety of other comfort and safety attributes, making it a great choice for your motorcycle gear.
Conclusion
In this blog post, we demonstrated the integration of Dgraph database and Language Models to create an intelligent product recommendation system. By leveraging Dgraph’s graph database structure and native vector support, along with the powerful capabilities of language models, we achieved efficient storage, retrieval, and response generation for retail product information.
We explored the following key aspects:
- Crafting Graph Database Queries: Using language models to analyze user queries and generate corresponding graph database queries for fetching relevant product information.
- Generating and Storing Text Embeddings: Utilizing a smaller model to create and store text embeddings that enhance the semantic similarity search for products, categories, brands, and more.
- Generating Responses: Employing a language model to formulate comprehensive responses to user queries based on the retrieved data from the graph.
The approach exemplifies the use of Retrieval Augmented Generation (RAG) and Natural Language Processing (NLP) within graph structures and provides a general working flow of RAG over Graph use case. It can be improved in various points, including
- use a different embedding model
- create a more complex intent structure covering aggregation, counting and complex criteria.( E.g could you build an intent for the question "How many products do you have in home decoration under 100$?" )
- train a model to generate the query instead of crafting it.
- train a model to generate intent from user input instead of using an LLM
Happy coding with Dgraph, embeddings, and language models!
Photo by SHVETS production from Pexels