October 30, 2017
Dealing with Missing Values and Outliers When Conducting a Data Audit
We are working with a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.
In this exercise, we will apply some commonly known data audit techniques, and our goal will be to make the available data most relevant. In the end, we will see what the data looks like after the data audit as compared to the original data.
Data Schema
The data contains eight columns and more than 541,909 rows. These rows are:
- Invoice Number
- Stock Code
- Description
- Quantity
- Invoice Date
- Unit PRice
- Customer ID
- Country
In further exercises, we will try and predict the behavior of the customer while purchasing items online using the above variables. But to do so, we need to create a dataset that is free of most of the anomalies like missing values, a single repetitive value for categorical variables, and outliers.
Missing Value Treatment
As a first step, we will try to find out the missing values in the data. Missing values are certain entries in the data that do not add any value to the data. Let’s say we have a list of names as follows:
- "John A"
- "William S"
- "??"
- "Scott G"
- " "
- "Rick R"
In the above list values “??” and “ ” are missing values. These values are not helpful in any predictive modeling, so the only option would be to replace or ignore these values. Now, if we apply the same measure to our data column-wise, we get the following results for four columns:
VARIABLES |
MISSING VALUE |
# MISSING VALUES |
# TOTAL VALUES |
% OF MISSING VALUES |
Description |
?, ? sold as sets?, ??, ???, ????damages????, ????missing, ???lost, ???missing, ?display?, ?lost, ?missing, ?sold as sets? |
67 |
541909 |
0.012 |
UnitPrice |
"-11062.06" and "r" |
3 |
541909 |
0.0006 |
CustomerID |
Blanks |
135080 |
541909 |
24.93 |
Now the challenge is – what to do with these missing values? The answer usually depends on the data set we have. In our data, we can observe that the percentage of missing values in the four columns are 0.012, 1.960, 0.0006 and 24.93. The first three percentages are very small, and considering that we have 541,909 rows, chunking out a few rows will still keep our data substantially large for predictive modeling.
You may observe that there are some values for the “Quantity” field that have negative values. Please note that they are not incorrect values. These are values for which the invoice has been cancelled. This can be verified by the letter C before every cancelled invoice. Eg. - C536391.
As far as the CustomerID column is considered, we see a considerable missing value percentage (25%). Usually, it is recommended to replace the missing values when the percentage exceeds 20%. This is known as Imputation in data analysis terms. But since our column is a random system-generated value, we cannot possibly assume a CustomerID; therefore, all of the blank rows should be deleted.
We achieved this using R. Below is the code for missing value removal.
#Load data into the dataset
retail <- read.csv("Online Retail.csv")
#Assign blank value as NA
retail[retail == ""] <- NA
#Remove the NA values using na.omit()
retail <- na.omit(retail)
Observations
After removing the missing values from the data, we are left with 406,829 rows. Now we will try to see if we have a good spread of values for our variables. This can be checked by plotting a graph and examining if we are close to a normal spread.
Let us try and analyze the above plot. We see that most of the values lie between 0 and 5000. Now the question arises: why are we considering values up to +/-100000 in the plot? To answer this, we must run a little summary analysis on the Quantity column.
summary(testvar$Quantity)
Min. | 1St Qu. | Median | Mean | 3rd Qu. | Max. |
-81000.00 | 2.00 | 5.00 | 12.06 | 12.00 | 81000.00 |
If we look at the value under Max, we see that the maximum value is 81000, but IQR analysis tells us that most of the values lie between 2 and 12. This should explain the biased plot.
Therefore, we observe that out of the 397,924 rows, most of the values lie between 2 and 12 and values greater than 12 should be considered as outliers. This introduces our second data audit factor: Outliers.
Outlier Analysis
An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. In a sense, this definition leaves it up to the analyst (or a consensus process) to decide what will be considered abnormal. For example, we considered 81,000 as an outlier because it was at a significant distance from our supposed normal distribution. We also applied IQR on Quantity and unit price and got the following results.
VARIABLE |
MIN |
Q1 |
MEDIAN |
Q3 |
MAX |
MEAN |
LOWER FENCE |
UPPER FENCE |
Quantity |
12 |
2 |
5 |
12 |
26 |
7.2876 |
-13 |
27 |
UnitPrice |
0 |
1.95 |
1.95 |
3.75 |
8143 |
3.11 |
-2.5 |
7.5 |
Now, the thumb rule says that we should exclude all of the values that are less than the lower fence and greater than the upper fence.
Observations
When we applied the same rule to our data, we observed that quantity and Unit price contained 6.46% and 8.78% of outliers, respectively. After removing the outliers from the data set, we now have 343,712 rows with us, which is still a good amount of data for modeling. Now we will try and see if we get a better visualization for Quantity this time.
From the above plot, it can be concluded that our above analysis was correct, because most of the values are between 1 and 12 and the distribution is now evenly spread.
Conclusion
This was a small exercise describing the need and impact of a data audit on a data set before applying further techniques for modeling. In this exercise, we saw that some missing values and outliers would not have been helpful in predictive modeling of the customer behavior, and now we have meaningful data to use for modeling.