In my previous blog about building an Information Palace that clusters information automatically into different nodes, I wrote about using Apache Spark for creating the clusters from the collected information.
This post outlines the core clustering approach that uses the new DataFrame API in Apache Spark and uses a zoomable circle packing to represent the clusters (using the excellent D3.js). In addition, the common terms and the unique terms at each level of the hierarchy are shown when a cluster or a node is zoomed in. The hierarchy is defined by the overall data set, at each cluster in the data set, and at each data node in a cluster.
In addition to the traditional clustering by machine learning, I have added a “Node Size” attribute which may present a new approach to the organization and presentation of unstructured data such as news or any free text. The “Node Size” attribute is calculated from the unique terms a data node provides to the corpus in addition to the common terms, which allows it to be clustered with similar data nodes.
The thinking behind this approach is if there are some articles that have a common set of words, the one article with more unique words is likely to cover the subject in greater detail than the other ones. In addition, there is a non-deterministic limit for these unique words, beyond which the clustering algorithm will eject it from the cluster. This thinking is further validated by how typical search engines use an inverted documented index to find the most relevant documents for a given search term.
For the media industry, suggesting a particular source of information over other similar sources by using the “Node Size” attribute can be a simpler alternative or even a correlating factor with user trends. I think this is particularly suitable for businesses such as law firms, healthcare, and banking where relevance must be determined from the articles or data sources themselves.
The rest of this post outlines the implementation of the clustering approach.
In order to test the effectiveness of the clustering approach, I curated data from different sources on three historical figures: Napoleon Bonaparte, Sir Winston Churchill, and Mahatma Gandhi. Each data point is represented by a file with the following format:
source:This is the source URL for the content. For this data set it is the relative path to the file, for example data/churchill-file-1.txt.
delimiter:The |||| is the delimiter.
content:This is the extracted content from the source. This is just plain text.
In this representation, I have 3 data sources for Napoleon Bonaparte, 2 for Sir Winston Churchill, and 4 for Mahatma Gandhi. The perfect clustering algorithm would create 3 clusters and put the data sources for each historical figure in its own cluster.
The data set is loaded as a paired RDD with the source as the key and the content as the value. In Spark, an RDD (Resilient Distributed Dataset) is the primary abstraction of a collection of items. Two successive map operations are executed on the values to tokenize the content into sentences and tokenize each sentence into a flat list of words.
The first data frame is built from this RDD and it has a text label and a sequence of tokens (words). A Data Frame is analogous to an SQL table, which holds multiple columns with labels. Spark can create data frames from multiple sources such as Parquet, Hive, or a Spark RDD. Just like some operations on an RDD, data frame transformations are “lazy” – they are executed on the Spark cluster when Spark is asked to compute the result and this computation is known as a ‘fit’. A fit transforms a data frame into a model, which is used to make predictions. Multiple transformers and models can be chained together (in a directed acyclic graph) to make up a pipeline for machine learning.
The initial data frame has “label” for the source (the paired RDD key) and “tokens” for the words (the paired RDD value).
A series of transformations are added to the pipeline to add additional columns to the data frame. The transformations can be expressed as a DAG where each node denotes the column added to the frame and a callout is the applied transformation.
All the transformers in the pipeline marked with a green background are known as “User Defined Transformers” (UDTs) – Spark makes it simple to use the provided transformers as well as to roll your own transformers. I will cover some of the important transformers:
hashed index, term) tuple. The term sorter joins the index from this tuple to the TF-IDF vector to produce a sequence of (
TF-IDF score, term) tuples.
A Data Frame can be queried for its schema. Once the transformations have been added (not necessarily applied), you can see the schema:
The Spark machine learning library has a bunch of clustering algorithms. While KMeans is probably the simplest to understand and implement, it is also quite effective and is put to use in lots of applications. KMeans is also implemented as a transformer that takes the Features column from the Data Frame and predicts the clusters in an output column which we call – wait for it – “Cluster”! Once the transformation (actually the pipeline) is applied, the new column reflects the data we were working towards:
The final visualization is available at http://atg.3pillarglobal.com/infop-static/.
The clustering algorithm produces 3 clusters and almost gets it right – one data node for Mahatma Gandhi is clustered with all the Napoleonic data nodes (the irony of it!), rest of the data nodes are clustered correctly.
The KMeans clustering algorithm needs to be told the number of clusters to create. If you ask it to create too few, dissimilar items will appear in the same cluster; too many and similar items will land up in different clusters. The current implementation uses the “Rule of Thumb” to calculate the number of clusters (k):
where n is the number of documents to be clustered. In addition, the current implementation caps ‘k’ to 1000.
There are quite a few ways in which the clustering approach can be improved or augmented:
Apart from the core algorithm, it would be necessary to evolve this into a system which ingests data and persists the clustered models. If you think of something else, please leave a comment!