NOVEMBER 24 2020
Using Dgraph For Entity Resolution
Dgraph streamlines Entity Resolution. Entity Resolution use cases can include: Know Your Customer, Customer 360, Spend Analytics, Fraud Detection, etc.

Introduction
In this blog, we will introduce the concept of "Entity Resolution" and its relevance. We will discuss three aspects of Entity Resolution: Data Standardization, Duplicate Detection, and Data Merging. Finally, we will use Dgraph to model a "Person" entity, add some example data and demonstrate Dgraph's capabilities that support Entity Resolution on the Person entity.
Part 1: What is Entity Resolution?
- For many businesses, it is important to uniquely identify customers and partners. This unique identification is the basis for several business policies. Certain business policies such as a maximum discount, one-time offers, and maximum claim amounts, etc., require clean and duplicate-free data. This is where a database that can actively detect duplicates and merge attributes can play a pivotal role.
- Entity Resolution is the process of comparing key attributes of customer or partner records to identify the same or completely different entities. For example, you can uniquely identify a business partner by the registered name and address. A person is identified with their full name, date of birth, gender, and address.
- Poor Entity Resolution practices result in gaps in processes that lead to inefficiencies and mistakes that can often be exploited by malicious actors. Fraud frequently occurs by slightly changing customer attributes and claiming inappropriate benefits. For example, there are cases in the health insurance industry where scammers claimed benefits for 3+ limbs by submitting spurious, duplicate claims.
Business Processes in Entity Resolution
Entity Resolution typically requires specialized processing steps. These include data standardisation, duplicate detection, and data merging. We will go through a brief description of each process to understand the context.
Data Standardization
Data Standardization involves tight control of the values associated with certain attributes. For example, you may want to operate a clean reference list of countries and state names from an ISO standard, as opposed to a free-flow entry of these attributes. This ensures that the information stored is not ambiguous. You may apply a policy where any customer data not containing a country name from the clean reference list gets rejected. This forces usage of high-quality data, which in turn results in higher confidence of data stored in the database.
In the process of data standardization, you try to match the existing value of an attribute to a clean list. When a suitable match is found, you “hook” or connect the attribute to this clean list, making it available for any subsequent processes. In this blog series, we will demonstrate how you can take an ISO value for a country and use it to standardize customer data.
Duplicate Detection
This process is critical in uniquely identifying an entity. In distributed systems, there is often a lack of a single unique “ID” attribute that can disambiguate and uniquely identify an entity. Thus, every entity needs to be frequently compared with other entities in the system to identify and establish uniqueness. In other words, if the attributes of an entity are not the same as any other entity within the current system, you can declare it as unique, at least in that particular system.
You can pick which attributes are critical in the makeup of an entity. For example, a name and address might be essential for a person or organization, whereas a text description could be critical for products.
Due to human error or other unavoidable circumstances, there may be slight spelling variations in a name or address. As a result, when comparing attributes, you cannot go by approaches that involve exact matches exclusively. This is why the duplicate detection mechanism usually takes a more flexible approach to achieve matches. These include fuzzy text matching, and probability-based duplicate detection approaches, etc. Dgraph provides text indices that allow strong and fuzzy text comparisons. These indices let you perform matches at scale and on-demand.
Data Merging
Once you have identified duplicates, you will find that these duplicates might have the same values for their attributes while others may have different values - some duplicates have an empty attribute value. It’s often not straightforward to understand how to resolve available values of attributes into a single golden copy. For example, the lastest timestamp may prove the deciding factor in mobile numbers, while an insured address may be preferable as a residence address.
The process of data merging is all about identifying a unique golden value for a given attribute while detecting potential conflicts, determining the right mechanism (last updated, source of an attribute, etc.) to break the deadlock, and finally, saving this value for the attribute. The final set of attributes is often called a “golden record.” Additionally, the decision-making mechanism needs to be auditable so that business processes down the line have confidence in using the golden record attributes.
Dgraph provides a sophisticated query language and data structures that allow us to “construct” this golden record from candidates. We can do this in a manner that is entirely expressible in the form of a graph and visible to stakeholders. These features make activities in this complicated process transparent and command higher confidence in the data.
An Example of Entity Resolution: Three records for James Kramer
Schema and Initial Data Load
Let us now demonstrate how Dgraph can help with a simple scenario. Here are the schema and the mutation to set up the example.
<country>: uid @reverse .
<countryName>: string @index(term) .
<duplicated>: [uid] @count .
<name>: string @index(exact, trigram) .
<otherNames>: [string] @index(exact) .
<streetName>: string @index(trigram) .
<tag>: string @index(exact) .
type <Country> {
name
otherNames
}
{
set{
_:p1 <dgraph.type> "Person" .
_:p1 <name> "James Kramer" .
_:p1 <streetName> "One Main Street, Seattle WA 98101" .
_:p1 <countryName> "USA" .
_:p2 <dgraph.type> "Person" .
_:p2 <name> "James Kraamer" .
_:p2 <streetName> "1 Main Street, Seattle WA 98101" .
_:p2 <countryName> "United States Of America" .
_:p3 <dgraph.type> "Person" .
_:p3 <name> "James Kramer" .
_:p3 <streetName> "One Main Str, Seattle WA 98101" .
_:p3 <countryName> "Unidentified" .
_:usa <dgraph.type> "Country" .
_:usa <name> "USA" .
_:usa <otherNames> "United States Of America" .
_:usa <otherNames> "U.S.A." .
_:usa <otherNames> "U.S." .
_:usa <otherNames> "US" .
}
}
- A schema containing name, address (
streetName
), and a country name for aPerson
is provided. - The schema also accommodates a
Country
type which will store country names as per ISO code with allowed aliases (otherNames
). - Three records representing duplicate
Person
records as per the scenario discussed are created. USA
is created as a valid 3-character abbreviation for theCountry
type, as well as allowed aliases.
The picture above shows how the data looks in Ratel, a web UI built by Dgraph that helps to explore Dgraph elements including schema, data and cluster visually. Please note that these nodes are distinct and no relationship between them has been asserted.
Part 2: Entity resolution using Dgraph
Let us now begin processing of the data stored in Dgraph. As part of our process, we will first standardize the data; then, we will detect duplicates, and finally figure out the conflicts and resolution mechanisms involved in merging.
Standardization: Match country names to ISO code and known aliases, and connect them
As a first step, we want to link the country attribute (countryName
) of the
Person
records with the corresponding Country
node.
upsert{
query{
qCountryOfPerson(func: uid(0x1) ) {
cn as countryName
}
qMatchWithISOCountry(func: type( Country)) @filter( eq(name,val(cn))
OR eq(otherNames,val(cn)) ){
standardCountryUID as uid
standardName as name
}
}
mutation @if( gt(len(standardName),0) ) {
set{
uid(cn) <country> uid(standardCountryUID) .
}
}
}
Result of this upsert query is visualized in Ratel as below.
- The
upsert
query is executed for eachPerson
entity by passing in itsuid
. ThecountryName
attribute is compared with either the 3-character ISO code (name
) or any of the allowed aliases (otherNames
). - An existing match of type
Country
is confirmed by the@if
check. - If a
Country
node exists, thePerson
nodes are linked. - We can query the
Person
nodes again, and this time we'll find that two of the nodes which had the appropriate ISO code or an approved alias are now connected to the corresponding country node. In other words, we can say that the person's country value is now standardized.
This standardization will prove beneficial for several other business processes down the line. There will be no need to remember ad-hoc representations of country names, and all querying can be driven off these standards-based Country nodes.
Duplicate Detection and Tagging: Fuzzy Match on Street Names
We can now go person by person and tag duplicates. In the query below, a person
(uid
0x1
) is being matched with others in the system on the streetName
attribute.
upsert{
query{
qGetStreetNameOfPerson(func: uid(0x1)){
snSource as streetName
snName as name
}
q1(func: type(Person)) @filter(match(streetName,val(snSource),4)
and match(name,val(snName),2)) {
snTarget as streetName
}
v as var(func: eq(tag,"0x1-match"))
}
mutation @if( gt(len(snTarget),0) ) {
set{
uid(v) <duplicated> uid(snTarget) .
uid(v) <duplicated> uid(snSource) .
uid(v) <tag> "0x1-match" .
}
}
}
The view from Ratel after the duplicate detection has run for this person is as
below.
- Duplicates are detected based on the values of specific attributes. For people
and organizations, this is often a name and address. In this demonstration, we
are matching on street names. The
uid
0x1
points to one of thePerson
nodes. This node is matched onname
andstreetName
attributes with otherPerson
nodes in the system. - A fuzzy text match function provided by Dgraph is used to handle minor errors
and noise in the data. The
match
function checks for street name similarities within a distance of four characters. We also do a similar check for thename
attribute within a distance of two characters. Please refer to the Dgraph documentation on fuzzy matching for more information here. - If at least one match is found, the matched node is hooked into a common parent node. As a result, all matches will be hooked up to a common parent, with a “duplicated” edge leading to the matched nodes.
- We have now achieved detection and tagging of duplicates.
De-duplication: Conflict Detection and Merging
With duplicates now tagged, we can then move on to the task of merging attribute
values into a single golden record. This task involves selecting a particular
value for each attribute from the duplicates and nominating it as the final
golden attribute value. This set of golden attributes is often called a golden
record and will be used in all downstream business processes such as spend
analytics, fraud detection, reporting, etc. Let us assume that the uid
0x5
is the parent node of tagged duplicates for the person James Kramer
.
{
qConflictOnCountry(func: uid(0x5)) @normalize{
duplicateCount: count(duplicated)
duplicated{
c as count(country)
countries as country
}
countryValuesAvailable: sum(val(c))
}
qUniqueCountries(func: uid(countries)) {
uniqueCountryNodes: count(uid)
uniqueCountryNodeName: name
}
}
Query Response:
{
"data": {
"qConflictOnCountry": [
{
"duplicateCount": 3,
"countryValuesAvailable": 2
}
],
"qUniqueCountries": [
{
"uniqueCountryNodes": 1
},
{
"uniqueCountryNodeName": "USA"
}
]
}
}
The task of merging can be simple or complex, depending on the attribute
involved. In our case, selecting a street name may be as simple as picking the
first one amongst the duplicates. This is because we used the street name as an
attribute for detecting duplicates. However, this is not the case for the
country attribute. Remember that while two of the records had a clean country
node attached, one of them had none. You can see this by running the query
qConflictOnCountry
. Out of three duplicate nodes, only two nodes have valid
country data available.
In this case, we may apply a conflict resolution business rule that states: “If
the valid countries available in the duplicate converge to any single value, we
will pick this value as a golden record”. This business rule is translated into
the query qUniqueCountries
. We can see that indeed there is one unique country
node, and the value is USA
. This can now be used as the final golden value for
the person's Country
attribute.
A collection of such conflict detection and resolution rules can be proposed for
each attribute to complete the merging activity, and the final set of attributes
for the Person
can be defined.
Summary
In this blog post, you learned the intricacies involved in Entity Resolution. Dgraph fully supports executing the business processes involved. In this walkthrough, we used Dgraph's capability to store, dynamically link, and traverse nodes to achieve each process's needs. Finally, Dgraph is a fast, transactional database, which allows such scenarios to be executed in an on-demand manner. This set of diverse capabilities, now also available as a fully managed service (https://dgraph.io/cloud), will help you achieve your goals of higher confidence in data, customer intimacy, and effective fraud detection.
Interested in getting started with Dgraph? You can get started here. Please say hello to our community here; we will be glad to help you in your journey!