Using Dgraph For Entity Resolution

Introduction

In this blog, we will introduce the concept of "Entity Resolution" and its relevance. We will discuss three aspects of Entity Resolution: Data Standardization, Duplicate Detection, and Data Merging. Finally, we will use Dgraph to model a "Person" entity, add some example data and demonstrate Dgraph's capabilities that support Entity Resolution on the Person entity.

Part 1: What is Entity Resolution?

What is Entity Resolution

For many businesses, it is important to uniquely identify customers and partners. This unique identification is the basis for several business policies. Certain business policies such as a maximum discount, one-time offers, and maximum claim amounts, etc., require clean and duplicate-free data. This is where a database that can actively detect duplicates and merge attributes can play a pivotal role.
Entity Resolution is the process of comparing key attributes of customer or partner records to identify the same or completely different entities. For example, you can uniquely identify a business partner by the registered name and address. A person is identified with their full name, date of birth, gender, and address.
Poor Entity Resolution practices result in gaps in processes that lead to inefficiencies and mistakes that can often be exploited by malicious actors. Fraud frequently occurs by slightly changing customer attributes and claiming inappropriate benefits. For example, there are cases in the health insurance industry where scammers claimed benefits for 3+ limbs by submitting spurious, duplicate claims.

Business Processes in Entity Resolution

Entity Resolution typically requires specialized processing steps. These include data standardisation, duplicate detection, and data merging. We will go through a brief description of each process to understand the context.

Data Standardization

Sources of Data Standards

Data Standardization involves tight control of the values associated with certain attributes. For example, you may want to operate a clean reference list of countries and state names from an ISO standard, as opposed to a free-flow entry of these attributes. This ensures that the information stored is not ambiguous. You may apply a policy where any customer data not containing a country name from the clean reference list gets rejected. This forces usage of high-quality data, which in turn results in higher confidence of data stored in the database.

In the process of data standardization, you try to match the existing value of an attribute to a clean list. When a suitable match is found, you “hook” or connect the attribute to this clean list, making it available for any subsequent processes. In this blog series, we will demonstrate how you can take an ISO value for a country and use it to standardize customer data.

Duplicate Detection

Duplicates

This process is critical in uniquely identifying an entity. In distributed systems, there is often a lack of a single unique “ID” attribute that can disambiguate and uniquely identify an entity. Thus, every entity needs to be frequently compared with other entities in the system to identify and establish uniqueness. In other words, if the attributes of an entity are not the same as any other entity within the current system, you can declare it as unique, at least in that particular system.

You can pick which attributes are critical in the makeup of an entity. For example, a name and address might be essential for a person or organization, whereas a text description could be critical for products.

Due to human error or other unavoidable circumstances, there may be slight spelling variations in a name or address. As a result, when comparing attributes, you cannot go by approaches that involve exact matches exclusively. This is why the duplicate detection mechanism usually takes a more flexible approach to achieve matches. These include fuzzy text matching, and probability-based duplicate detection approaches, etc. Dgraph provides text indices that allow strong and fuzzy text comparisons. These indices let you perform matches at scale and on-demand.

Data Merging

Merging Duplicates

Once you have identified duplicates, you will find that these duplicates might have the same values for their attributes while others may have different values - some duplicates have an empty attribute value. It’s often not straightforward to understand how to resolve available values of attributes into a single golden copy. For example, the lastest timestamp may prove the deciding factor in mobile numbers, while an insured address may be preferable as a residence address.

The process of data merging is all about identifying a unique golden value for a given attribute while detecting potential conflicts, determining the right mechanism (last updated, source of an attribute, etc.) to break the deadlock, and finally, saving this value for the attribute. The final set of attributes is often called a “golden record.” Additionally, the decision-making mechanism needs to be auditable so that business processes down the line have confidence in using the golden record attributes.

Dgraph provides a sophisticated query language and data structures that allow us to “construct” this golden record from candidates. We can do this in a manner that is entirely expressible in the form of a graph and visible to stakeholders. These features make activities in this complicated process transparent and command higher confidence in the data.

An Example of Entity Resolution: Three records for James Kramer

Example

Schema and Initial Data Load

Let us now demonstrate how Dgraph can help with a simple scenario. Here are the schema and the mutation to set up the example.

<country>: uid @reverse .
<countryName>: string @index(term) .
<duplicated>: [uid] @count .
<name>: string @index(exact, trigram) .
<otherNames>: [string] @index(exact) .
<streetName>: string @index(trigram) .
<tag>: string @index(exact) .
type <Country> {
	name
	otherNames
}

{
  set{
    _:p1 <dgraph.type> "Person" .
    _:p1 <name> "James Kramer" .
    _:p1 <streetName> "One Main Street, Seattle WA 98101" .
    _:p1 <countryName> "USA" .

    _:p2 <dgraph.type> "Person" .
    _:p2 <name> "James Kraamer" .
    _:p2 <streetName> "1 Main Street, Seattle WA 98101" .
    _:p2 <countryName> "United States Of America" .

    _:p3 <dgraph.type> "Person" .
    _:p3 <name> "James Kramer" .
    _:p3 <streetName> "One Main Str, Seattle WA 98101" .
    _:p3 <countryName> "Unidentified" .

    _:usa <dgraph.type> "Country" .
    _:usa <name> "USA" .
    _:usa <otherNames> "United States Of America" .
    _:usa <otherNames> "U.S.A." .
    _:usa <otherNames> "U.S." .
    _:usa <otherNames> "US" .
  }
}

A schema containing name, address (streetName), and a country name for a Person is provided.
The schema also accommodates a Country type which will store country names as per ISO code with allowed aliases (otherNames).
Three records representing duplicate Person records as per the scenario discussed are created.
USA is created as a valid 3-character abbreviation for the Country type, as well as allowed aliases.

View from Ratel

The picture above shows how the data looks in Ratel, a web UI built by Dgraph that helps to explore Dgraph elements including schema, data and cluster visually. Please note that these nodes are distinct and no relationship between them has been asserted.

Part 2: Entity resolution using Dgraph

Let us now begin processing of the data stored in Dgraph. As part of our process, we will first standardize the data; then, we will detect duplicates, and finally figure out the conflicts and resolution mechanisms involved in merging.

Standardization: Match country names to ISO code and known aliases, and connect them

As a first step, we want to link the country attribute (countryName) of the Person records with the corresponding Country node.

upsert{
  query{
    qCountryOfPerson(func: uid(0x1)   )  {
	  cn as countryName
    }
    qMatchWithISOCountry(func: type( Country)) @filter( eq(name,val(cn))
      OR eq(otherNames,val(cn)) ){
      standardCountryUID as uid
      standardName as name

    }
  }
  mutation @if( gt(len(standardName),0)   )    {
    set{
      uid(cn) <country> uid(standardCountryUID) .
    }
  }
}

Result of this upsert query is visualized in Ratel as below. View from Ratel

The upsert query is executed for each Person entity by passing in its uid. The countryName attribute is compared with either the 3-character ISO code (name) or any of the allowed aliases (otherNames).
An existing match of type Country is confirmed by the @if check.
If a Country node exists, the Person nodes are linked.
We can query the Person nodes again, and this time we'll find that two of the nodes which had the appropriate ISO code or an approved alias are now connected to the corresponding country node. In other words, we can say that the person's country value is now standardized.

This standardization will prove beneficial for several other business processes down the line. There will be no need to remember ad-hoc representations of country names, and all querying can be driven off these standards-based Country nodes.

Duplicate Detection and Tagging: Fuzzy Match on Street Names

We can now go person by person and tag duplicates. In the query below, a person (uid 0x1) is being matched with others in the system on the streetName attribute.

upsert{
  query{
    qGetStreetNameOfPerson(func: uid(0x1)){
      snSource as streetName
      snName as name
    }
    q1(func: type(Person)) @filter(match(streetName,val(snSource),4)
      and match(name,val(snName),2)) {
    snTarget as streetName
    }
    v as var(func: eq(tag,"0x1-match"))
  }
  mutation  @if( gt(len(snTarget),0)   )    {
    set{
      uid(v) <duplicated> uid(snTarget) .
      uid(v) <duplicated> uid(snSource) .
      uid(v) <tag> "0x1-match" .

    }
  }
}

The view from Ratel after the duplicate detection has run for this person is as below.

Duplicates are detected based on the values of specific attributes. For people and organizations, this is often a name and address. In this demonstration, we are matching on street names. The uid 0x1 points to one of the Person nodes. This node is matched on name and streetName attributes with other Person nodes in the system.
A fuzzy text match function provided by Dgraph is used to handle minor errors and noise in the data. The match function checks for street name similarities within a distance of four characters. We also do a similar check for the name attribute within a distance of two characters. Please refer to the Dgraph documentation on fuzzy matching for more information here.
If at least one match is found, the matched node is hooked into a common parent node. As a result, all matches will be hooked up to a common parent, with a “duplicated” edge leading to the matched nodes.
We have now achieved detection and tagging of duplicates.

De-duplication: Conflict Detection and Merging

With duplicates now tagged, we can then move on to the task of merging attribute values into a single golden record. This task involves selecting a particular value for each attribute from the duplicates and nominating it as the final golden attribute value. This set of golden attributes is often called a golden record and will be used in all downstream business processes such as spend analytics, fraud detection, reporting, etc. Let us assume that the uid 0x5 is the parent node of tagged duplicates for the person James Kramer.

{
  qConflictOnCountry(func: uid(0x5)) @normalize{
    duplicateCount: count(duplicated)
    duplicated{
      c as count(country)
      countries as country
    }
    countryValuesAvailable: sum(val(c))
  }

  qUniqueCountries(func: uid(countries)) {
      uniqueCountryNodes: count(uid)
      uniqueCountryNodeName: name
  }
}

Query Response:

{
  "data": {
    "qConflictOnCountry": [
      {
        "duplicateCount": 3,
        "countryValuesAvailable": 2
      }
    ],
    "qUniqueCountries": [
      {
        "uniqueCountryNodes": 1
      },
      {
        "uniqueCountryNodeName": "USA"
      }
    ]
  }
}

The task of merging can be simple or complex, depending on the attribute involved. In our case, selecting a street name may be as simple as picking the first one amongst the duplicates. This is because we used the street name as an attribute for detecting duplicates. However, this is not the case for the country attribute. Remember that while two of the records had a clean country node attached, one of them had none. You can see this by running the query qConflictOnCountry. Out of three duplicate nodes, only two nodes have valid country data available.

In this case, we may apply a conflict resolution business rule that states: “If the valid countries available in the duplicate converge to any single value, we will pick this value as a golden record”. This business rule is translated into the query qUniqueCountries. We can see that indeed there is one unique country node, and the value is USA. This can now be used as the final golden value for the person's Country attribute.

A collection of such conflict detection and resolution rules can be proposed for each attribute to complete the merging activity, and the final set of attributes for the Person can be defined.

Summary

In this blog post, you learned the intricacies involved in Entity Resolution. Dgraph fully supports executing the business processes involved. In this walkthrough, we used Dgraph's capability to store, dynamically link, and traverse nodes to achieve each process's needs. Finally, Dgraph is a fast, transactional database, which allows such scenarios to be executed in an on-demand manner. This set of diverse capabilities, now also available as a fully managed service (https://dgraph.io/cloud), will help you achieve your goals of higher confidence in data, customer intimacy, and effective fraud detection.

Interested in getting started with Dgraph? You can get started here. Please say hello to our community here; we will be glad to help you in your journey!

NOVEMBER 24 2020