October 20, 2015

Building a Software Information Palace

On the television show The Mentalist, the protagonist, Patrick Jane, often describes a “Memory Palace,” which he purportedly uses to store vast amounts of information which he is able to retrieve at will. To create this palace, he advises choosing a large, real, physical location with which you are intimately familiar. Once you have such a palace, he advises you to slot information into the appropriate places in the palace. This, he says, is the best way to not only keep an extensive memory, but also access needed memories with ease. How, as software engineers, can we build this palace in software?

Below is a video example of what our palace would look like.

Information can be imagined as different data streams flowing in from different directions. Suppose we were to build a dam for these data streams — we could examine the data pool at our leisure and get an idea of how all of this information could be appropriately slotted. These slots would be the nodes of a graph, and this would become the software engineer’s memory palace.

Once these slots (or nodes) are identified, any new information can be immediately added to an appropriate node. You can also retrieve any piece of information you desire as soon as you figure out the starting node. Each node can also be related to one or more nodes, which would be represented as edges between the nodes with the edge weight corresponding to the distance between the two nodes. These relationships enable you to traverse connected nodes in order. You can build a notional idea of the nodes that are near a particular node or far away from it. Over time and over a certain volume of information, you would want to re-group all of the information in the graph — this will probably create new nodes and remove old ones.

So what would the product look like? Here is one way to do it — RSS feeds representing information streams can be collected throughout the day in a certain interval. These feeds are sent to a system build on Apache Spark. To start with, we need to create our data pool, so over the course of the day the system can extract the data (content) of each RSS item (within each feed) and store the data to HDFS. As a nightly process, the system can examine the data pool and cluster the data by measuring its similarity using TF-IDF vectors. The distance between each cluster node is measured as the distance between the centroid vectors. Once these initial clusters are created, Spark’s streaming component can calculate the TF-IDF vector for the new incoming content and the similarity of existing clusters and then add it to the closest cluster. The application can report the top clusters so you would know the most talked about information topics in your areas of interest. In addition, it can also report the new pieces of content that were added overnight.

Apart from analyzing and structuring news, there is potential in this model wherever recommendations need to be made without having some sort of collaboration between participating actors. For example, if a jobs site built out this model by analyzing the jobs posted over a period of time, it could see which types of jobs were most in-demand and suggest them to job seekers with matching profiles.