Data auditing using Incline’s AutoBot: Faster than Eliud Kipchoge

Share on linkedin
Share on facebook
Share on twitter

This blog is the second in a series of articles introducing Incline’s AutoBot – a one of a kind product offering to help analysts automate key stages of model building. Read more about the AutoBot’s overall product offering here: https://lnkd.in/dbK2F8y

The prime focus of data analytics is to deliver actionable insights in order to drive optimal growth for businesses. Ideally then, the majority of an analyst’s time would be spent focusing on the real-world implications of their analysis on the operations of a business or the end-user and ensuring the best techniques were used to result in the biggest impact. However; this is generally not the case.

In data analytics, and more specifically in predictive model building, analysts’ efforts are skewed towards spending disproportionally more time in preparing and validating the inputs and outputs of the model than in optimizing its performance and deriving practical and actionable insights. In order to correct this imbalance, Incline has recently developed a model building tool by the name of AutoBot; a powerful web-based tool developed by analysts to automate key bottlenecks in the process of model building, thus refocusing analysts’ time towards actual analysis and optimization.

In this blog, an example of the AutoBot’s data audit ability is highlighted and benefits discussed. Further capabilities will be discussed in later posts.

The success of any predictive model hinges on the quality of the raw data, which makes an initial data audit all the more imperative. Traditional manual data auditing, including variable validation, is a time intensive and tedious process, and if done incorrectly, carries a high risk of causing major costly delays in downstream model building.

With AutoBot, the decision to perform an automatic data audit is as easy as a single check on the interface (below) and can be performed independently or in conjunction with a model build. 

Manual data auditing would normally involve creating code specifically designed to audit a particular data set, and even with altering existing code, this process could take up to two hours.

Using the AutoBot to automate data auditing can significantly reduce this time. As an example, a dataset of transactions (1,065 rows x 13 rows) took only 5 seconds to complete. Increasing the table size to (188,000 rows x 39 columns) still took under 30 seconds to complete.

The output of the audit is a data snapshot showing the first few rows of the table, to provide a visual sense check of what the observations contain. 

This is particularly useful to quickly identify what each column shows and what the identifier is, for each observation.

The second output of the audit is a statistical summary of each column. 

A statistical summary shows each column’s count, populated records, distinct records, nulls, blanks, minimum, mean, quartiles (Q1 and Q3), median, mode and maximum, value and frequency. This is a quick and efficient way of identifying whether there are any obvious anomalies in the data.

Once the anomaly is identified, it requires a combination of technical and industry knowledge to decide whether the outlier is actually incorrect or if it requires a change in understanding of the data to determine whether or not to exclude it from further analysis.

Examples of how the audit can show where something in the data is wrong (not limited to):

  • Unique identifier is not unique
  • A necessary field such as Transaction Type contains blanks
  • A numeric or data field is seen to be out of the scope of what is possible
  • A distinct count of 1 where there should be a variety of entries
  • A numeric field is recorded in character format

Examples of how the audit can change your view on what the data contains (not limited to):

  • If a column such as Transaction Value is negative – the data could contain credit / reimbursements as well as debits
  • If a column such as Gender has blank rows – this field could be optional at point of capturing
  • If Gender has blank rows in conjunction with the max Transaction Value being substantially higher than the expected/mean/mode – could show that it is not only personal transaction but also corporate

Further benefits of using the AutoBot for data auditing include:

  • The field formats are modified to best suit the characteristic of the variable. If there is an alphanumeric field saved as VARCHAR(50) but the greatest length of any observation is 7, it will modify the field to CHAR(7), thus optimizing the computation time
  •  There is no restriction on the number of columns or rows the dataset may contain
  • Any punctuation or characters that will cause issues are automatically accomodated for
  • A time log is kept throughout the process and is easily accessible
  • Any error in the process is logged and informs the user of what and where the issue is

It is clear that one has to understand what the data is representing to benefit from the data audit; however the AutoBot tool has certainly enabled analysts to reach these conclusions sooner. This ultimately allows for more time to be spent refining the raw data or fine-tuning the remaining process, and focusing on delivering high quality actionable insights to clients well on time.

At Incline, we have quite a few avid runners and congratulate Eliud Kipchoge on smashing the world record marathon time yesterday. His incredible record of 2:01:39 inspires us personally and professionally, reminding us to always strive to be the best and not stop when we get there. We are confident we will never be able to run as fast as him; however, at least now with AutoBot, we can ensure data auditing will never again take as long as his world record time.

Details of further capabilities of the AutoBot will follow in future articles.

Follow Incline’s LinkedIn and Facebook accounts to keep up with articles and industry related news.

Visit us at www.incline.co.za

For more information contact info@incline.co.za