Professor Nan Zhang of George Washington University talks with Digital Growth Insights about the possibilities — and limitations — of using third party Big Data for web analysis.
Thank you for taking some questions from Digital Growth Insights. Please give us a little background about yourself and your research.
I am an associate professor of Computer Science at The George Washington University. My research related to Big Data mainly focuses on the exploration and analytics of very large web databases, specifically those in the deep web – i.e., those that are hidden behind (and only accessible through) proprietary (search and/or browsing) web interfaces and cannot be efficiently crawled by traditional search engines such as Google. My research also involves data security and privacy – for example, what types of sensitive and/or private information can one infer from web databases, and how to prevent undesired privacy disclosure from happening.
Big Data has become a very popular term recently – what does Big Data mean to you?
I think Big Data mainly means two things. One is the ability for a data owner to process an extremely large amount of data within a reasonable time frame. Many business enterprises, scientific research institutions, and government agencies generate huge amounts of data every day – e.g., Walmart handles more than 1 million transactions per hour, while sensors at the Large Hadron Collider produce millions of readings per second. These organizations have full control over their data, and often know what kinds of analytics they want over the data. But they need new technologies in order to efficiently store, manage and process the exascale data. A lot of Big Data related research focuses on addressing the need of these data owners.
The other perspective of Big Data to me is what I’d like to call the third-party usage of Big Data. Most of us do not own petabytes of data (like Walmart does), but still want the ability to take advantage of Big Data publicly available on the web. For example, many people rely on Google Search to find information – which is essentially taking advantage of the massive amounts of data Google collects from around the web. In this case, Google happens to offer its search service to satisfy the need of third-party users. Unfortunately, there are many other cases where the third-party usage need is not satisfied.
For example, many of us may be interested in knowing what kinds of people (e.g., young adults or professionals, teenage girls or grandparents) like or hate a product, restaurant, hotel, etc., so we could make better decisions on gift giving, travel planning, etc. We know the information is out there on many web databases – e.g., Amazon, TripAdvisor, Yelp – but we don’t really know how to dig into these websites and get the information we want.
As another example, many investors and economists may be interested in tracking the transaction volumes of consumer goods, real estate properties, etc. We know the related information is readily available at many e-commerce websites, county property tax search pages, or even social media. But once again, it is unclear how to take advantage of this public data (which is nevertheless owned by others). I think, given the sheer volume of data publicly available on the web, enabling efficient third-party analytics over such data is an important area in Big Data research.
What are some of the differences when you’re data mining your own data, versus accessing another entity’s data?
I think there are key differences on both technical and non-technical sides. Technically, the key challenge for mining others’ data is no longer on how to handle huge volumes of data – after all, it is neither economical nor practically feasible for us to download terabytes of data from someone else’s e-commerce website only to track basic statistics such as the average price of all smart phones.
Instead, the key challenge now is the myriad of restrictions placed by the data owners on the (public) web interface of their databases – after all, these interfaces were designed for the search/browsing needs of normal users, not the analytical purposes we have in mind. Examples of such restrictions include what questions you can ask – e.g., Amazon allows you to ask for the price of a particular phone, but not the average price of all smart phones – also, how many questions you can ask – Google, for example, only allows 100 free search API calls a day (per user), with each call returning at most 1,000 documents.
Given these restrictions, the fundamental problem of enabling third-party analytics is how to translate an analytical need to a small number of questions supported by the interfaces of web databases, and use techniques such as statistical sampling to produce an accurate estimation based on the answers we receive.
On the non-technical side, the third-party mining of web data may lead to various ethics and privacy issues. For example, many counties have property tax data publicly available (and searchable) on the web. Does this make it ethical to integrate such information and publish the list of all houses and vacation homes owned by an individual?
If the gun-owner-map incident this past January (The Journal News visualized the publicly available gun ownership data for two New York counties on an interactive map which caused nationwide uproar) is any indication, this is a very complex issue. That’s because the public might be comfortable with the data being publicly available as it is, but not when a third party decides to take advantage of the data and make it easier to access and analyze.
Another interesting (non-technical) issue is the attitude of web data owners towards such third-party mining efforts. On one hand, most websites happily allow third-party search engines like Google to access their data (i.e., web pages) and make them searchable to the general public. Many owners of large web databases, such as Amazon, even provide third parties with APIs that significantly simplify data access. On the other hand, however, it is unclear how these data owners will react when third-party mining efforts unveil knowledge that the owner is uncomfortable to disclose. For example, an e-commerce website might not want its competitors to gain competitive advantage by mining its database and learning the pricing strategy it uses. As complex as this issue might be, I believe that eventually it is of everyone’s interest to enable open and easy access to analytical information over public web databases.
How will the accelerating digitization of information and its availability on the Internet change society in the next 3 – 5 years?
I think it will allow us to make faster and more informed decisions in our everyday life. Specifically, we will collect a lot more information about ourselves, share such information with many other people, and have better decision-support tools to truly take advantage of the massive amount of data. Just look at what happened in the last few years. Crowd sourcing platforms like Waze passively (or actively) collect speed information when we are driving and thereby give us “true” real-time access to traffic and accident information.
Many people started wearing activity monitors (such as the Jawbone Up wristband) that collect information about our everyday activities like running, stair climbing, and even sleeping. We can share the collected information with friends and family to cheer each other up, or even publish the information on social media. There are also a lot more services which make it easier for us to take advantage of data available on the web. For example, new shopping-assistant services use price information publicly available on the web to predict whether airfare (or product price) will rise or fall, and by how much.
With more and more data being collected and shared about ourselves, I think it is more important than ever to find easy ways for individuals like you and me – not just big corporations which own petabytes of data – to retrieve the analytical information of our individual interests and use the analytics to make our lives better. Doing so requires tools that help us understand what analytical information we need, over which web data sources, and, as argued above, enable third-party analytics of very large web databases. I think this is still a wide-open space for technological innovation.