In my previous post on understanding data for analysis, I described the common approaches for the analysis of single variables. In this post, I’ll summarize the common approaches for analyzing the relationships between multiple variables in your data. Why is an analysis of the relationships important? Let’s start with a paradox.
An intriguing effect is sometimes observed when the analysis of single variables leads to a trend that reverses or disappears when the variables are combined or the effect of confounding variables are taken into account. In the fall of 1973 at UC Berkeley, there were allegations of gender bias in the graduate school admissions. It was observed that men had a 44% admission rate against 35% for women, a difference that was unlikely to be caused by a random anomaly.
However, a breakup of the admission rates by all departments revealed that the allegations were not true. The conclusion from the new data was women tended to apply to competitive departments with low rates of admission while men tended to apply to less competitive departments with higher rates of admission. The moral of the story is it is important to analyze the relationships between variables.
Covariance is a measure of the tendency of two variables to vary together. Covariance is expressed as:
Additionally, Python libraries, such as NumPy, account for corrections from small sample sizes. Covariance is interpreted as:
Covariance is rarely used in summary statistics because it is hard to interpret. By itself, it does not provide a sense of how much the two variables vary together, but only their ‘direction’ (if you consider each series to be a vector). The unit of Covariance is also confusing because it is the product of two different units. Pearson’s Correlation divides the Covariance with the product of the standard deviations of both series resulting in a dimensionless value.
Pearson’s Correlation is bounded in [-1, 1].
Sx, Sy are the standard deviations of the X, Y series respectively. Standard Deviation (σ) is a measure of the spread of the distribution.
Crosstabs (short for cross tabulations) are counts of the intersection of two categorical variables (while Covariance and Correlations hold true for continuous variables). For example, if you have two categorical variables – X with values (1, 2, 3) and Y with values (‘A’, ‘B’, ‘C’), a crosstab will be a 3×3 matrix that counts the number of times each value occurs together in the data set.
Crosstabs are suggested only when you suspect a relationship between two categorical variables as the matrices can become large and hard to analyze.
Pearson’s Correlation is misleading in the face of non-linear relationships between the variables and if the variables are not normally distributed. It is also susceptible to outliers. Spearman’s Rank Correlation corrects for these effects by computing the Correlation between the ranks of each series. The rank of a value in the series is its index in the sorted list. The computation of the Spearman’s Rank Correlation is more expensive than Pearson’s Correlation because it involves sorting the two series or computing the ranks by index hashing. The formula is the same as Pearson’s correlation, but the series X, Y are the ranks of the values in the original series. This is best explained with a table – follow the colors to see the value to rank transformation:
The formula can be expressed as:
We know enough to analyze a wine quality data set, obtained from UC Irvine’s archives. The data set has 11 input variables and one output variable – quality – that was assessed by expert testers. The input variables are: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulfates, and alcohol.
It helps to have some domain knowledge of the data set as you can use this knowledge to avoid calculating correlations for known relationships and detect spurious correlations as well. This data set, however, is small enough to compute a correlation matrix for all the input variables.
A correlation matrix denotes the correlation coefficients between the input variables. Let’s additionally add a dash of data visualization to the matrix so that we don’t end up staring at numbers. This is called a Correlogram.
I analyzed the input variables and observed that they did not conform to a normal distribution. I also compared the Pearson’s correlation and Spearman’s Rank correlation coefficients for a couple of input variables and found significant differences. Given these observations, I decided to compute the matrix using Spearman’s Rank correlation. Using R:
df <- read.csv("winequality-red.csv", header=TRUE, sep=";") idf <- df[,1:11] mcor <- cor(idf, method=c("spearman")) #install.packages("corrplot") library(corrplot) corrplot(mcor, type="upper", order="hclust", tl.col="black", tl.srt=45)
This is what we get:
What can we decipher from the Correlogram?
Scatter Plots are a simple way to visualize the relationship between two (or more) variables. For two dimensions, they plot the location of the data point. The more correlated the variables are, the narrower a band towards which the plot tends. Scatter plots can become confusing when there are a large number of points or outliers ; hexbin plots are used in these cases. A hexbin plot divides the graph into hexagonal bins and colors each bin according to the number of points in each bin.
Let’s draw scatter and hexbin plots for pH versus fixed acidity and Total versus free sulfur dioxide. Using Python:
import pandas as pd import matplotlib.pyplot as plt columns = ["fixed acidity", "volatile acidity", "citric acid", "residual sugar", "chlorides", "free sulfur dioxide", "total sulfur dioxide", "density", "pH", "sulphates", "alcohol"] df = pd.read_csv("winequality-red.csv", sep=";") df.plot.scatter(x=columns, y=columns) df.plot.hexbin(x=columns, y=columns, gridsize=10) df.plot.scatter(x=columns, y=columns) df.plot.hexbin(x=columns, y=columns, gridsize=25) plt.show()
Scatter plot – fixed acidity vs pH
Hexbin plot – fixed acidity vs pH
Scatter plot – free vs total sulfur dioxide
Hexbin plot – free vs total sulfur dioxide
There are a couple of interesting things worth mentioning:
Single and multiple variable summaries provide you with a starting platform for a data audit strategy. You need to couple this with effective visualizations to understand your data and check for misleading summary statistics.