Master Data with DataStudio

How to Automate Data Reconciliation for Large, Multi-Variate Data Sets


All data that a business needs is seldom from a single source. Different data sources might capture different facets or attributes pertaining to a particular entity as shown in Figure 1. Some of the common attributes between sources might also present conflicting information about the entity. Consider an event related to catastrophic equipment failure for an industrial company. Information about this failure might be recorded in a variety of databases. For example, the maintenance database might record information such as the date of failure, root cause and preventive actions taken to address the failure, while a different safety events database might record the injuries, personnel involved, compensation paid and other safety-related meta data.

Disparate Data Sources
Figure 1: Disparate Data Sources

Frequently, disparate data sources do not have unique identifiers that map events or entities between them. That is, the failure event might have two different identifiers in the maintenance and safety databases. As a result, it becomes too difficult to answer questions like, “what is the true cost of the equipment failure that caused the injury?

Most companies do not have the resources and budget to generate a common schema for enterprise entities which could cater to multiple departments including operations, finance, marketing, accounts, etc. IT departments may make an attempt to do so, oftentimes unsuccessfully. This is because implementing an enterprise Master Data Management (MDM) systems can be very expensive and involves significant manual work. The complexity of the project goes up exponentially if IT has inherited several disparate systems through past mergers and acquisitions. To this end, a common need of industrial data users is to integrate data from multiple databases, where data pertaining to the same entities is collected from various sources, reconcile conflicting information and aggregate complementary information.

Traditional Approach

The enterprise data integration and reconciliation task typically involves multiple steps as shown below

In the IT world, this process is often called data mastering, and solved using software such as Informatica MDM. The master data collects data from various sources and identifies identical or similar data and creates a database which becomes the single source of truth.

This whole process involves significant manual work and is usually performed by a data expert using software. This manual process is also custom based, i.e. it greatly depends on the data sources and hence, needs to be altered as the sources change. Also, when the data includes geospatial components and text, which is not in the same format in all sources, the process becomes even more difficult.

The DeepIQ Approach

DeepIQ has developed patent pending, AI based automated data reconciliation software that uses the power of distributed computation and machine learning to automate the process of master data discovery.

DeepIQ’s software uses a combination of statistical and machine learning algorithms to achieve a high-quality data mastering process. First, a series of statistical algorithms process all the sources to identify all versions of an entity or event across sources. Next, the machine learning algorithms refine the output of statistical algorithm to prune out poor quality results and generate high fidelity mapping of sources to globally unique entities. A final prioritization or fusion approach amalgamates all sources to create golden records of each unique entity.

These sequences of algorithms can be very computationally intensive. Consider for example, creating a material master for a combination of two sources, each with a million records. A naïve implementation to identify the unique entities in the union of the two sources would require a trillion computations, where each entity in a source is compared to every entity in another sourc

DeepIQ uses two strategies to handle this computationally intensive task. First, it runs natively on a distributed computational platform that supports Adobe Spark. Based on these techniques, DeepIQ’s software is able to run the end-to-end master data workflows in just a few minutes even dealing with sources with multiple millions of entities.

DeepIQ’s DataStudio provides an easy to use, drag and drop interface for the relational ingestion and data preparation like fixing time zones and filtering. DataStudio enables the user to create simple workflows for Master Data

The Spark code is generated from a combination of the user generated workflows and the DeepIQ libraries. This code is passed to Spark execution.

Success Story

An E&P company has grown through mergers and acquisitions. As a result, it has multiple data sources from different organizations within the company containing information on all of the materials that it procures, stores and uses. The original material master was no longer complete, and all downstream applications suffered from duplicate material records, conflicting material configuration data and incorrect stock estimates.

The company decided to drive optimal processes in purchasing and inventory management to minimize stock outages or unnecessary purchases. A traditional master data management solution would have taken more than a year to complete with multiple people and several technology stacks. The company conducted a pilot with a leading vendor that used a machine learning process to automate this task. The software license costs alone were in excess of $1 million per year and yielded mixed results at best.

At this point, the company started an engagement with DeepIQ to automate this process and achieve immediate results. DeepIQ installed its DataStudio software and configured it to run on customer’s existing infrastructure. Within four weeks, using DataStudio, our client configured the workflow to:

Sample Master Data Workflow
Figure 3: Sample Master Data Workflow

The solution was designed so that as new records were created, updated or deleted in the source systems, the master data was updated. Processing all three data sources, including deduplication of the data within each source, took only 12 minutes to complete on a medium sized cluster.

To ensure data quality, any records deemed to be between 75% to 90% match were sent to a human data steward for verification. Random matches were also sent to the data steward. The decisions made by the data steward were used to further train the master data model enabling the matching algorithm to improve over time.

The following diagram shows a sample of a typical match and the results

 Sample Match Results
Figure 4: Sample Match Results


DeepIQ’s patent pending Master Data capability captures the complex interactions between different data entities even when the training data is sparse. This unique capability can be a gamechanger for your enterprise data, allowing users to more robustly analyse their business using the full scope of their dataset.

For further information, please contact us at

Download the whitepaper here