April 15, 2016
Using Grid Heat Maps for Data Visualization
Heat maps represent values in a matrix as colors. Traditionally, heat maps have been used to indicate the level of activity in different systems. For example, a load test result can represent requests to different parts of the application as a heat map. The heat map appears as a mass of colors chosen from a color scheme with gradients from one color to the other.
Here is a typical example from Wikipedia:
By Plumbago - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=23016243
Above is a geographical heat map of ocean salinity using a rainbow colormap.
Another interesting use of heat maps is to understand the degree of relationship between two variables. This results in a grid where the axes are obtained from the range of each variable. The rest of this post describes the usage of grid heat maps in different scenarios.
House Hunting?
This visualization, taken from Trulia’s trends, depicts the degree to which the day of the week and the time of the day are correlated for house hunting. The full visualization suggests that most house hunting is done on weekdays at 9PM and Sunday evenings.
Web Usage
A web application’s logs can be analyzed to understand the usage patterns. If you take the day of the week on the y-axis and the time of the day in the x-axis, the grid color can be determined by the number of requests or by user sessions, measured over a period of time.
The grid heat maps are not limited to time units on both axis. The next three examples show usage in other domains.
Weekly Inventory Prediction
In a recent project, I proposed a prediction model that analyzed weather trends and advised on the inventory for perishable items for each day of the week. In order to depict this, I plotted the items (categorized as A, B, C…) on the x-axis and the day of the week (Mon, Tue…) on the y-axis. The grid color was influenced by the amount of inventory to maintain for a particular item and day. The resulting visualization was quite similar to the web usage example.
Correlation Matrix
A correlation matrix denotes the correlation coefficients between variables at the same time. A heat map grid can be used to represent these coefficients to build a visual representation of the dependence between the variables. This makes it easy to spot the strong dependencies. A positive correlation indicates a strong dependency while a negative correlation indicates a strong inverse dependency; a correlation coefficient closer to zero indicates weak dependency.
The data source is mtcars data set from R development environment. It comprises of different aspects of automobile design and performance for 32 automobiles. You can refer to the data set to understand the variables used in the correlation matrix. In the matrix, the blue circles indicate positive correlation, while red circles indicate negative correlation.
Confusion Matrix
A confusion matrix is a table that is used to denote the performance of a classifier on test data for which the true labels are known. A typical confusion matrix looks quite like a correlation matrix, except the cells denote the number of times an event (from the test data) was mislabelled. A grid heat map can quickly show the degree of confusion.
This data set represents classification of images taken by satellites. The type of satellite image is a function of the image features. Can you tell which are the most mislabelled images?
Clock In-Out time
Enough of examples. Let us understand how to build a grid heat map with a faux problem (but real data!).
You are the operations head in an organization and you are health conscious. You want to provide fresh fruits to employees because you are concerned that they keep snacking on unhealthy choices. In order to do that, you want to time the shelf stocking (when do the fruits come out and when do they go in). One way for you to time the activities is when most employees clock in and when they clock out.
We start with collecting the raw data and using a suitable data format logic, we get the in-out records for a month for every employee. Here’s a sample of three days for an employee:
Mon::10:57:21::18:50:05,Tue::09:54:11::18:37:54,Wed::10:25:21::18:06:50
Each record denotes the day of the week, in-time in 24-hour clock format and out-time in 24-hour clock format.
The next step is to read each record and bin the in and out times in a matrix with hours-of-the-day as the x-axis and day-of-week as the y-axis. For example, the record shown above will increase the count in (Mon, 10), (Tue, 9), (Wed, 10) cells of the in-matrix and (Mon, 18), (Tue, 18), (Wed, 18) of the out-matrix.
You get two matrices for day-of-week versus in-time hours and day-of-week versus out-time hours. The cell value is the number of times any employee clocks in (or out) on the day of the week and the hour. Each matrix would look like this:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | |
Mon | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 31 | 135 | 332 | 428 | 202 | 68 | 13 | 4 | 3 | 36 | 6 | 3 | 2 | 0 | 0 | 0 |
Tue | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 11 | 34 | 108 | 287 | 365 | 141 | 51 | 8 | 3 | 0 | 23 | 2 | 0 | 0 | 0 | 1 | 0 |
Wed | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 26 | 112 | 285 | 337 | 171 | 60 | 13 | 3 | 4 | 22 | 5 | 3 | 0 | 0 | 0 | 3 |
Thu | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 27 | 95 | 283 | 358 | 149 | 58 | 14 | 0 | 1 | 27 | 7 | 1 | 3 | 0 | 0 | 0 |
Fri | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 34 | 110 | 265 | 324 | 164 | 42 | 18 | 4 | 6 | 18 | 6 | 1 | 2 | 0 | 0 | 0 |
Sat | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 10 | 3 | 1 | 5 | 6 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 |
Sun | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 6 | 3 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
Let us denote frequency of in-time values with a blue scheme and frequency of out-time values as a red scheme. However, we have a problem we have not seen in the previous examples: so far, we have seen a single variable vary between the axes, but in this problem there are two variables - one is the in-time and the other is the out-time. For the purpose of this visualization, we will consider the larger value only because the chances of people leaving office when it is time to arrive at work and vice versa are quite low. We merge the two matrices cell by cell, with precedence given to the variable with a larger value:
with open("data/in_out_series.csv", mode='w') as outfile: writer = csv.writer(outfile) # "series" is 0 for in-time and 1 for out-time writer.writerow(["day", "hour", "value", "series"]) for row_index, row in enumerate(in_matrix): for col_index, in_value in enumerate(row): out_value = out_matrix[row_index][col_index] in_out_row = [row_index + 1, col_index + 1] if in_value >= out_value: in_out_row.append(in_value) in_out_row.append(0) else: in_out_row.append(out_value) in_out_row.append(1) writer.writerow(in_out_row)
This gives us a CSV series that is loaded by D3js. The supporting JavaScript creates two color sequences for blue (inColors) and red (outColors), generated from this excellent ColorBrewer scale. It uses these sequences to create a blue and red scale:
// buckets is fixed at 9; so we have 9 colors for blue and red var blueScale = d3.scale.quantile() .domain([0, buckets - 1, d3.max(data, function (d) { return d.value; })]) .range(inColors); var redScale = d3.scale.quantile() .domain([0, buckets - 1, d3.max(data, function (d) { return d.value; })]) .range(outColors);
Next, each grid cell is drawn as a ‘card.’ All cards start with the same color and transition to a color either in the blue or red scale, depending on the “series” attribute:
var cards = svg.selectAll(".hour") .data(data, function(d) {return d.day+':'+d.hour;}); cards.enter().append("rect") .attr("x", function(d) { return (d.hour - 1) * gridSize; }) .attr("y", function(d) { return (d.day - 1) * gridSize; }) .attr("rx", 4) .attr("ry", 4) .attr("class", "hour bordered") .attr("width", gridSize) .attr("height", gridSize) .style("fill", inColors[0]) // == outColors[0], initial color is same .append("title"); cards.transition().duration(1000) .style("fill", function(d) { return d.index == 0 ? blueScale(d.value) : redScale(d.value); });
The final visualization:
Isn’t it easy to spot that everyone comes in only after 8 AM and most people leave by 9 PM? So, now you know when to put fresh fruits on the table and when to put them away.
Write to us or leave us a comment if you think this can help you with a business case.