Dealing with Missing Values and Outliers When Conducting a Data Audit

We are working with a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

In this exercise, we will apply some commonly known data audit techniques, and our goal will be to make the available data most relevant. In the end, we will see what the data looks like after the data audit as compared to the original data.

Data Schema

The data contains eight columns and more than 541,909 rows. These rows are:

  • Invoice Number
  • Stock Code
  • Description
  • Quantity
  • Invoice Date
  • Unit PRice
  • Customer ID
  • Country

In further exercises, we will try and predict the behavior of the customer while purchasing items online using the above variables. But to do so, we need to create a dataset that is free of most of the anomalies like missing values, a single repetitive value for categorical variables, and outliers.

Missing Value Treatment

As a first step, we will try to find out the missing values in the data. Missing values are certain entries in the data that do not add any value to the data. Let’s say we have a list of names as follows:

  • “John A”
  • “William S”
  • “??”
  • “Scott G”
  • ”  “
  • “Rick R”

In the above list values “??” and “  ” are missing values. These values are not helpful in any predictive modeling, so the only option would be to replace or ignore these values. Now, if we apply the same measure to our data column-wise, we get the following results for four columns:

VARIABLES

MISSING VALUE

# MISSING VALUES

#  TOTAL VALUES

% OF MISSING VALUES

Description

?, ? sold as sets?, ??, ???, ????damages????, ????missing, ???lost, ???missing, ?display?, ?lost, ?missing, ?sold as sets?

67

541909

0.012

UnitPrice

“-11062.06” and “r”

3

541909

0.0006

CustomerID

Blanks

135080

541909

24.93

Now the challenge is – what to do with these missing values? The answer usually depends on the data set we have. In our data, we can observe that the percentage of missing values in the four columns are 0.012, 1.960, 0.0006 and 24.93. The first three percentages are very small, and considering that we have 541,909 rows, chunking out a few rows will still keep our data substantially large for predictive modeling.

You may observe that there are some values for the “Quantity” field that have negative values. Please note that they are not incorrect values. These are values for which the invoice has been cancelled. This can be verified by the letter C before every cancelled invoice. Eg. – C536391.

As far as the CustomerID column is considered, we see a considerable missing value percentage (25%). Usually, it is recommended to replace the missing values when the percentage exceeds 20%. This is known as Imputation in data analysis terms. But since our column is a random system-generated value, we cannot possibly assume a CustomerID; therefore, all of the blank rows should be deleted.

We achieved this using R. Below is the code for missing value removal.

#Load data into the dataset

retail <- read.csv(“Online Retail.csv”)

#Assign blank value as NA

retail[retail == “”] <- NA

#Remove the NA values using na.omit()

retail <- na.omit(retail)

Observations

After removing the missing values from the data, we are left with 406,829 rows. Now we will try to see if we have a good spread of values for our variables. This can be checked by plotting a graph and examining if we are close to a normal spread.

rplot

Let us try and analyze the above plot. We see that most of the values lie between 0 and 5000. Now the question arises: why are we considering values up to +/-100000 in the plot? To answer this, we must run a little summary analysis on the Quantity column.

summary(testvar$Quantity)

Min.1St Qu.MedianMean3rd Qu.Max.
-81000.002.005.0012.0612.0081000.00

If we look at the value under Max, we see that the maximum value is 81000, but IQR analysis tells us that most of the values lie between 2 and 12. This should explain the biased plot.

Therefore, we observe that out of the 397,924 rows, most of the values lie between 2 and 12 and values greater than 12 should be considered as outliers.  This introduces our second data audit factor: Outliers.

Outlier Analysis

An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. In a sense, this definition leaves it up to the analyst (or a consensus process) to decide what will be considered abnormal. For example, we considered 81,000 as an outlier because it was at a significant distance from our supposed normal distribution. We also applied IQR on Quantity and unit price and got the following results.

VARIABLE

MIN

Q1

MEDIAN

Q3

MAX

MEAN

LOWER FENCE

UPPER FENCE

Quantity

12

2

5

12

26

7.2876

-13

27

UnitPrice

0

1.95

1.95

3.75

8143

3.11

-2.5

7.5

 

Now, the thumb rule says that we should exclude all of the values that are less than the lower fence and greater than the upper fence.

Observations

When we applied the same rule to our data, we observed that quantity and Unit price contained 6.46% and 8.78% of outliers, respectively. After removing the outliers from the data set, we now have 343,712 rows with us, which is still a good amount of data for modeling. Now we will try and see if we get a better visualization for Quantity this time.

quantity_outliers

From the above plot, it can be concluded that our above analysis was correct, because most of the values are between 1 and 12 and the distribution is now evenly spread.

Conclusion

This was a small exercise describing the need and impact of a data audit on a data set before applying further techniques for modeling. In this exercise, we saw that some missing values and outliers would not have been helpful in predictive modeling of the customer behavior, and now we have meaningful data to use for modeling.

Jasminder Singh

Jasminder Singh

Senior Quality Assurance Engineer

Jasminder is a Senior Quality Assurance Engineer at 3Pillar Global. His expertise is Quality Assurance, where he has worked across teams and helped them in delivering high-quality products. He has done many projects on data-audit, where he applies the various techniques of data analysis on a given data set to filter the data and remove all unwanted data and outliers. His current area of interest is automation testing. In his free time, he loves to watch and play sports and play video games.

Leave a Reply

Related Posts

Understanding Data in Data Science – Multiple Variable... In my previous post on understanding data for analysis, I described the common approaches for the analysis of single variables. In this post, I’ll sum...
How Blockchain Technology & AdLedger are Transforming t... On this episode of The Innovation Engine, we'll be looking at blockchain in the advertising space. Among the topics we cover are why there is currentl...
Docker – A Different Breed of Virtualization From Wikipedia: Virtualization, in computing, refers to the act of creating a virtual (rather than actual) version of something, including but not lim...
Accessibility Testing Techniques for Web Controls Overview Web Accessibility means that different types of impaired users can use the application without difficulty. Web accessibility can adapt to va...
Why You Need Automated Testing to Reach DevOps’ Holy Grail Automated testing is required to reach DevOps’ Holy Grail - continuous deployment. Despite what you may have seen in Indiana Jones and the Last Crusad...