Product recommendation using RAG on Dgraph

In this blog post, we will use Dgraph database to store retail products information (Amazon products data) and Language Models to reply to users asking for a product recommendation.

The blog post provides Python code snippets to explain the main steps. The complete Jupyter Notebook and associated data folder are available for you to play with.

The language models are used in 3 different ways

we use LLM text analysis capabilities to craft a Graph database query to fetch the relevant information to generate a response.
we use a small model to generate and store text embeddings to find products, categories, brands, characteristics etc. based on semantic similarity
we use an LLM to generate a response to the use question based on the data retrieved in the Graph.

This is a case of Retrieval Augmented Generation (RAG), and NLP (Natural Language Processing) leveraging graph structures.

Why Dgraph for AI?

Dgraph is particularly suited for knowledge graph and AI applications due to several key features and capabilities:

Graph Database Structure: Dgraph is designed as a native graph database, which means it stores data in a graph structure consisting of nodes, edges, and properties. This is inherently aligned with the way knowledge graphs represent relationships and entities, making it easier to model complex interconnections.
Native vector support: Any node may have any number of vector predicates that are indexed using the HNSW algorithm for fast similarity retrieval.
Scalability: Dgraph is built to scale horizontally, handling large volumes of data and high query loads efficiently. This is crucial for AI applications that often require processing vast amounts of interconnected data.
High Performance: Dgraph provides fast query execution and low latency, which are essential for real-time AI applications. Its performance optimizations, such as parallel query execution and efficient data storage, make it capable of handling demanding workloads.
Flexible Schema: Dgraph supports flexible schema definitions, allowing for dynamic data models that can evolve. This is beneficial for AI applications where the data schema might need to adapt to new requirements or insights.
Rich Querying Capabilities: Dgraph’s query language, DQL (Dgraph Query Language) is declarative, which means that queries return a response in a similar shape to the query. DQL allows for complex graph traversals and pattern matching, which are essential for extracting insights and relationships in knowledge graphs. It also supports advanced features like recursive queries and aggregations and most importantly vector similarity search.

Solution overview

Create a knowledge graph consisting of Products information, categories, brands, age_groups, colors, measurements, materials ad characteristics. On user request

create an intent representing which part of the graph should be used to reply to the request.
convert the intent in a Dgraph DQL query and execute the query. Use similarity search for best filtering.
use the retrieved structured data and a proper prompt to generate the final response to the user.

Setup

Create a file .env in the folder containing this Python notebook with one line for your OpenAI API key

OPENAI_API_KEY=sk-....

We just need some python packages for Dgraph, Openai, Hugging Face, and some tools we are using.

# Optional script to install all the required packages
!pip3 install pydgraph
!pip3 install openai
!pip3 install sentence_transformers
!pip3 install pybars3
!pip3 install python-dotenv

import os
import json
import pydgraph
from pybars import Compiler
# Activate the provider you want to use for embeddings and LLM
# from openai import OpenAI
# from mistralai.client import MistralClient

from sentence_transformers import SentenceTransformer
from dotenv import load_dotenv
load_dotenv()

assert os.getenv("OPENAI_API_KEY") is not None, "Set OPENAI_API_KEY in your .env file"

Dataset

Dgraph supports JSON and RDF format. In this Notebook we are using RDF. RDF is a powerful notation for knowledge graph. It describes information in triples of the form Subject - Predicate - Object (S-P-O).

The original dataset is in JSON format and is 2.7Mb. We have generated an RDF file with the same information. The RDF file is only 361 Kb!

Loading the dataset

Connecting to Dgraph

See Learning Environment to setup a docker image with dgraph:standalone/latest, or use your on-prem or cloud instance.


dgraph_grpc = "localhost:9080"
client_stub = pydgraph.DgraphClientStub(dgraph_grpc)
client = pydgraph.DgraphClient(client_stub)
print(f"Connected to DGraph at {dgraph_grpc}")

The notebook provides a complete setup covering on-prem and cloud instances.

House keeping

First we clean the DB. You may want to skip this step.


# Drop all data including schema from the Dgraph instance.
# This is a useful for small examples such as this since it puts Dgraph into a clean state.
confirm = input("drop schema and all data (y/n)?")
if confirm == "y":
  op = pydgraph.Operation(drop_all=True)
  client.alter(op)
  print("schema and data deleted")

Deploying the Graph schema

In the Dgraph schema, we tell the system the different indexes we want to use on the different predicate, declare node types and specify relationships cardinality.

# add predicates to Dgraph type schema
with open('data/products.schema', 'r') as file:
	dqlschema = file.read()
	op = pydgraph.Operation(schema=dqlschema)
	client.alter(op)
	print("schema updated:")
	print(dqlschema)

Importing data

As the dataset is small we can load all the data in one mutation:

def mutate_rdf(nquads, client):
    ret = {}
    body = "\n".join(nquads)
    if len(nquads) > 0:
        txn = client.txn()
        try:
            res = txn.mutate(set_nquads=body)
            txn.commit()
            ret["nquads"] = len(nquads),
            ret["total_ns"]= res.latency.total_ns
        except pydgraph.errors.AbortedError as err:
            print("AbortedError %s" % i)
        except Exception as inst:
            print(inst)
        finally:
            txn.discard()

    return ret

with open('data/products.rdf') as f:
    data = f.readlines()
    mutate_rdf(data, client)

For large dataset, refer to the Import data options.

Simple graph query

As our data is now in a graph database, we can traverse the graph, search for nodes, count relationships, etc. To verify that we have data in the DB, let's execute a simple query to find the top 5 categories and their number of products,

query = '''
{
    var(func:type(category)) {
        np as count(~Product.category)
    }
    productsPerCategory(func:uid(np), orderdesc:val(np), first:3){
        category:category.Value
        number_of_products:val(np)
    }
}
'''

res = client.txn(read_only=True).query(query)
res = json.loads(res.json)
print("Top 3 categories with the most products:")
print (json.dumps(res, indent=4))

The expected result is


Top 3 categories with the most products:
{
    "productsPerCategory": [
        {
            "category": "home decoration",
            "number_of_products": 20
        },
        {
            "category": "books",
            "number_of_products": 17
        }
    ]
}

Similarity search with vector embeddings

We don't want to constrain the question to only use terms present in the database. For example, the user may want "some clothes of dark color". We need to search our graph by similarity and not only by terms. We will use the power of Dgraph vectors and language model vector embeddings.

Creating vector indexes

Dgraph is a Graph database with native vector support, HNSW index, and similarity search. For this use case, we will be using a Python script shared in the blog post Add OpenAI, Mistral or open-source embeddings to your knowledge graph. to compute and add vector embeddings to all our entities.

For example, with an embedding on the color entities, we will be able to search for colors similar_to "dark color".

Refer to the notebook to get the embedding logic details.

The embeddings are then computed using a simple configuration file:

embedding_config =  [
        {
            "entityType":"Product",
            "attribute":"embedding",
            "index":"hnsw",
            "provider": "huggingface",
            "model":"sentence-transformers/all-MiniLM-L6-v2",
            "config" : {
                "dqlQuery" : "{ title: Product.Title}",
                "template": "{{title}} "
            }
        },
        {
            "entityType":"age_group",
            "attribute":"embedding",
            "index":"hnsw",
            "provider": "huggingface",
            "model":"sentence-transformers/all-MiniLM-L6-v2",
            "config" : {
                "dqlQuery" : "{ value: age_group.Value}",
                "template": "{{value}} "
            }
        },
        {
            "entityType":"brand",
            "attribute":"embedding",
            "index":"hnsw",
            "provider": "huggingface",
            "model":"sentence-transformers/all-MiniLM-L6-v2",
            "config" : {
                "dqlQuery" : "{ value: brand.Value}",
                "template": "{{value}} "
            }
        },

        ...
    ]

for embedding_def in embedding_config:
    buildEmbeddings(
            embedding_def,
            only_missing = True
            )
    print(f"Embeddings done for {embedding_def['entityType']}.{embedding_def['attribute']}")

In our example we used Hugging Face Sentence Transformer model all-MiniLM-L6-v2 for all our embeddings.

The template is a handlebars template that generates the text to be embedded from the dqlQuery result. By using a DQL query, we can build complex embeddings (or graph embeddings) for any node type: the embedded text can include text from connected nodes at any level.

In our use case, the embeddings are kept simple.

Querying the graph using Dgraph similarity function

When the embeddings have been added to each node using mutations, we can use similar_to function in DQL query.

For example:

sentence = "looking for something to make my home pretty"

# Get the sentence embedding with the same model
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
sentence_embedding = model.encode(sentence).tolist()

# Use Dgraph similar_to function to find similar categories and use Graph relations to get the products for this category
txn = client.txn(read_only=True)
query = f'''
  {{
    result(func: similar_to(category.embedding,3,"{sentence_embedding}")) {{
        category:category.Value
        products:~Product.category (first:3) {{
            name:Product.Name
        }}
    }}
  }}'''
try:
    res = txn.query(query)
    data = json.loads(res.json)
    print(json.dumps(data,indent=4))
finally:
    txn.discard()

The query is looking for 3 category nodes closed to the provided prompt and get 3 products maximum for each category. The response looks like the following:

{
  "result": [
    {
      "category": "wedding decor",
      "products": [
        {
          "name": "Romantic LED Light Valentine's Day Sign"
        }
      ]
    },
    {
      "category": "home decor",
      "products": [
        {
          "name": "Fall Pillow Covers"
        }
      ]
    },
    {
      "category": "home garden balcony decor",
      "products": [
        {
          "name": "Flower Pot Stand"
        }
      ]
    }
  ]
}

Extracting entities from the prompt

In the previous query, we assumed that the question was about products found by category, so we could write a DQL query.

We can go further and use an LLM model to analyze the user prompt and determine the correct criteria to use before querying the graph structure. In this example our dataset is small but the approach must work for large graph. Loading all the data to the LLM context may not be practical and get over the LLM token window. The whole idea is to extract a subset of the data to reply to the user question.

We will use OpenAI and a prompt build with our knowledge of the graph structure, i.e the description of the entities and predicates that can be found in the graph (aka ontology).

We define the ontology and a way to represent it as text:

entities = [
    {
        "entity_name": "Product",
        "description": "Item detailed type",
        "predicates": {
            "category" : {"description": "Item category, for example 'home decoration', 'women clothing', 'office supply'"},
            "color" : {"description": "color of the item"},
            "brand": {"description": "if present, brand of the item"},
            "characteristic": {"description": "if present, item characteristics, for example 'waterproof', 'adhesive', 'easy to use'"},
            "measurement": {"description": "if present, dimensions of the item"},
            "age_group": {"description": "target age group for the product, one of 'babies', 'children', 'teenagers', 'adults'."}
        }
    }
]
def ontologyPrompt(ontology):
    # Create a textual description of the ontology to help prompting LLM
    # The graph database has the following entities and predicates:
    entities = [ f'\'{e["entity_name"]}\'' for e in ontology]
    list_entities = ", ".join(entities)
    s = f"Identify if the user question is about one of the entities {list_entities}."
    s += "\nIdentify criteria about predicates depending on the entity."
    for e in ontology:
        pred = [ f'\'{p}\'' for p in e["predicates"]]
        pred_list = ", ".join(pred)
        s+= f'\nFor \'{e["entity_name"]}\' look for:'
        for p in e["predicates"]:
            s+= f'\n- \'{p}\': {e["predicates"][p]["description"]}'
    return(s)

Using meta-data in an ontology structure is an elegant and generic way to provide information to both the LLM (textual part) and the query builder (structured knowledge) about the domain we are dealing with.

We use a prompt in including the ontology information to ask openAI to identify an intent from the user prompt:

system_prompt = f'''
    You are analyzing user prompt to fetch information from a knowledge graph.

    {ontologyPrompt(entities)}

    Return a json object following the example:
    {{
        "entity": "product",
        "intent": "one of 'list', 'count'",
        criteria: [
        {{ "predicate": "category", "value": "clothing"}},
        {{ "predicate": "color", "value": "blue"}},
        {{ "predicate": "age_group", "value": "adults"}}
        ]
    }}

    If there are no relevant entities in the user prompt, return an empty json object.
  '''

from openai import OpenAI
llm = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))

# Define the entities to look for
def text_to_intent(prompt, model="gpt-4o-mini"):
    completion = llm.chat.completions.create(
        model=model,
        temperature=0,
        response_format= {
            "type": "json_object"
        },
    messages=[
        {
            "role": "system",
            "content": system_prompt
        },
        {
            "role": "user",
            "content": prompt
        }
        ]
    )
    intent = json.loads(completion.choices[0].message.content)
    intent['prompt'] = prompt
    return intent

For example the analysis of the request

"do you have clothes for teenagers in dark colors?",

results into the intent

{
  "entity": "product",
  "intent": "list",
  "criteria": [
    {
      "predicate": "category",
      "value": "clothing"
    },
    {
      "predicate": "color",
      "value": "dark"
    },
    {
      "predicate": "age_group",
      "value": "teenagers"
    }
  ],
  "prompt": "do you have clothes for teenagers in dark colors?"
}

The intent structure is easily translated into a graph traversal with constraints and filters.

One key idea is to use a similarity search instead of a keyword or term search. In the above example dark is not a color value. A keyword search would not find any result. A similarity search should find "black" and "dark blue" as good match.

For example the query generated from the previous intent looks like:

query test($categoryvect: float32vector,$colorvect: float32vector,$age_groupvect: float32vector){
      category as var(func:similar_to(category.embedding,1,$categoryvect))
      color as var(func:similar_to(color.embedding,1,$colorvect))
      age_group as var(func:similar_to(age_group.embedding,1,$age_groupvect))

      products(func:type(Product)) @filter(
        uid_in(Product.category, uid(category))
        AND uid_in(Product.color, uid(color))
        AND uid_in(Product.age_group, uid(age_group)) ) {
            name:Product.Name
            title:Product.Title
            age_group:Product.age_group  {
               value:age_group.Value
            }
            brand:Product.brand  {
               value:brand.Value
            }
            color:Product.color  {
               value:color.Value
            }
            category:Product.category  {
               value:category.Value
            }
            characteristic:Product.characteristic  {
               value:characteristic.Value
            }
            material:Product.material  {
               value:material.Value
            }
            measurement:Product.measurement  {
               value:measurement.Value
            }

          }
      }

The DQL query is created from 4 parts

the list of vectors used as query parameters
var blocks to find matching nodes for each criteria in the intent
a main entity type query with filters on the relationships to matching nodes.
the information to retrieve for each node of the main entity type.

The query parts are inferred from the intent and the ontology. In the example we have harcoded the fact that we are dealing with Product type, but this can easily be generated from the intent "entity" information.

Here is the code used to build the DQL query

# use same embedding model for user input and for the searched entities

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
def create_embedding(text):
    # print (f"create embedding for {text}")
    return huggingfaceEmbeddings(model,[text])[0]

# for each criteria, compute an embedding of the criteria value.
# build a sequence of var block to find the most similar node (category, characteristic, brand etc ...).
# build a filter to keep only the Product with the corresponding category, characteristic, brand etc ...

def intent_to_dql(intent):
    vect = []
    vars = []
    filters = []
    variables = {}
    for criteria in intent['criteria']:
        variables[f"${criteria['predicate']}vect"] = f"{create_embedding(criteria['value'])}"
        vect.append(f"${criteria['predicate']}vect: float32vector")
        vars.append(f"{criteria['predicate']} as var(func:similar_to({criteria['predicate']}.embedding,1,${criteria['predicate']}vect))")
        filters.append(f"uid_in(Product.{criteria['predicate']}, uid({criteria['predicate']}))")
    all_filters = "\n AND ".join(filters)
    all_vars = "\n".join(vars)
    query = f"""
      query test({','.join(vect)}){{
          {all_vars}
          products(func:type(Product)) @filter( {all_filters} ) {{
            name:Product.Name
            title:Product.Title
            age_group:Product.age_group  {{
               value:age_group.Value
            }}
            brand:Product.brand  {{
               value:brand.Value
            }}
            color:Product.color  {{
               value:color.Value
            }}
            category:Product.category  {{
               value:category.Value
            }}
            characteristic:Product.characteristic  {{
               value:characteristic.Value
            }}
            material:Product.material  {{
               value:material.Value
            }}
            measurement:Product.measurement  {{
               value:measurement.Value
            }}

          }}
      }}
    """
    return {"query":query,"variables":variables}

Generating a response from the retrieved sub-graph

We simply instruct an LLM to reply to the user request using the data retrieved from the graph. This allows us to create a graph query that is good enough and may retrieve more data than needed. We then let the LLM use what is relevant for the request.



def rag(prompt, payload):
    model="gpt-4o-mini"
    rag_prompt = f'''
        You are suggesting products based on user input and available items.
        Reply to the user with suggestions from the following data that match the criteria
        {payload}
        If possible explain why the items are suggested.
        If there are no relevant items reply that we don't have any items that match the criteria.
    '''
    completion = llm.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "system",
                "content": rag_prompt
            },
            {
                "role": "user",
                "content": prompt
            }
            ]
        )
    return completion.choices[0].message.content

The last function is a good summary of the approach in few lines of code:


def reply(sentence):
    intent = text_to_intent(sentence)
    dql = intent_to_dql(intent)
    res = client.txn(read_only=True).query(dql["query"], variables=dql["variables"])
    payload = json.loads(res.json)
    return rag(sentence, payload)

Let's test our RAG solution:


example_queries = [
    "Which pink items are suitable for children?",
    "Do you have a helmet with anti allergic padding?",
]
for q in example_queries:
    print()
    print(f"> {q}")
    print()
    r = reply(q)
    print(r)

Here are the results we got

> Which pink items are suitable for children?

I have two great suggestions for pink items that are suitable for children:

Suitcase Music Box
- Title: Suitcase Music Box, Mini Music Box Clockwork Music Box for Children
- Age Group: Children
- Color: Pink
- Category: Toys & Games
- Characteristics: This music box features a make-up mirror, jewelry box functionality, requires no batteries, operates with a clockwork mechanism, and has a storage compartment. It's a delightful toy that can also serve as a charming decorative piece for a child's room.
Unicorn Curtains
- Title: Eiichuang Unicorn Curtains Rod Pocket Blackout Cute Cartoon Pink Unicorn Wearing a Crown Mermaid Pattern Print Room Darkening Window Drapes for Kids Girls Bedroom Nursery, 2 Panels Set, 29 x 63 Inch
- Age Group: Children
- Color: Pink
- Category: Home Decoration
- Characteristics: These curtains feature a fun unicorn design and are room darkening with a rod pocket for easy hanging. They are perfect for a child's bedroom or nursery, creating a whimsical atmosphere.

Both items are not only visually appealing with their pink color but also serve functional purposes for children's enjoyment and room decor.

The second question :

> Do you have a helmet with anti allergic padding?

Yes, we have a helmet that features anti-allergic interior padding. I recommend the Steelbird Hi-Gn SBH-11 HUNK Helmet.

Here are some details about it:

Brand: Steelbird
Color: Glossy Black and Blue
Measurement: 580 mm (M)
Category: Motorcycle gear
Characteristics:
- Anti Allergic Interior
- High Impact ABS Material Shell
- Italian Design Hygienic Interior
- Neck Protector For Extra Comfort
- Multipored Breathable Padding
- Multi-layer EPS (Thermocol)
- Replaceable and washable interior
- Anti-bacteria coating

This helmet not only provides anti-allergic features but also has a variety of other comfort and safety attributes, making it a great choice for your motorcycle gear.

Conclusion

In this blog post, we demonstrated the integration of Dgraph database and Language Models to create an intelligent product recommendation system. By leveraging Dgraph’s graph database structure and native vector support, along with the powerful capabilities of language models, we achieved efficient storage, retrieval, and response generation for retail product information.

We explored the following key aspects:

Crafting Graph Database Queries: Using language models to analyze user queries and generate corresponding graph database queries for fetching relevant product information.
Generating and Storing Text Embeddings: Utilizing a smaller model to create and store text embeddings that enhance the semantic similarity search for products, categories, brands, and more.
Generating Responses: Employing a language model to formulate comprehensive responses to user queries based on the retrieved data from the graph.

The approach exemplifies the use of Retrieval Augmented Generation (RAG) and Natural Language Processing (NLP) within graph structures and provides a general working flow of RAG over Graph use case. It can be improved in various points, including

use a different embedding model
create a more complex intent structure covering aggregation, counting and complex criteria.( E.g could you build an intent for the question "How many products do you have in home decoration under 100$?" )
train a model to generate the query instead of crafting it.
train a model to generate intent from user input instead of using an LLM

Happy coding with Dgraph, embeddings, and language models!

JULY 24 2024