A Quick Look Into the Popular Graph Databases

Purpose

The 3Pillar Recommendation Engine currently uses MongoDB to store the processed recommendations for all the users in the system. We looked into graph databases because its storing mechanism of nodes and relations directly maps to the way of the recommendation engine data model. This results in a low storage footprint and also provides the capability of greater insights into the data. To ensure transparency and flexibility, we specifically looked into open source projects that are well-supported by the community.

Neo4j

Neo4j is the market leader in GraphDB. It is highly performant, scalable, and flexible. It is equipped with a rich UI, called the Neo4j browser, that is used for queries, visualization, and data interaction. It has compatibility with most programming platforms like Java, Nodejs, PHP, Python, .NET, etc. It supports the Cypher query language, whose syntax naturally depicts data and relationships. It has a good integration support with Apache Spark, Elasticsearch, MongoDB, Cassandra, and Docker. The community edition is free under GPL v3 license.

Orient DB

Orient DB is a high speed, scalable, and reliable GraphDB. It has a high and constant traversing speed that is not affected by database size. User, role, and record level security is built-in. It has a good compatibility with most programming platforms like Java, Nodejs, PHP, Python, .NET, etc. It has a strong community support and is licensed under Apache 2.

Titan DB

Titan DB is highly scalable and can store and query graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster. Titan DB is transactional and supports various backend stores like Cassandra, HBase, and BerkeleyDB. It supports integration with Elastic Search, Lucene, and Solr, and also supports the Gremlin query language. It is licensed under Apache 2.

Apache Giraph

It is an Iterative Graph processing framework backed by Apache Hadoop. Giraph jobs are highly scalable as they run on Hadoop Cluster. Giraph can greatly increase the efficiency of graph computation on a huge dataset. It is licensed under Apache 2. We wanted to take a deep dive into these graph databases to understand how they stacked against each other. At the outset, we eliminated TitanDB because a stable 1.0 version had not been released at the time. We also skipped Apache Giraph because it is backed up only by Hadoop. This left Neo4j and OrientDB for more exploration.

Test Environment

  • CPU- Intel quad core i7 2.2 GHz
  • RAM – 16 GB
  • Hard disk type – Flash
  • Database on same machine as test program (no network latency)

Test Methodology

We used the sample Spring projects and Java drivers of each of the databases. The intent is to programmatically

  • Add 100k users,
  • Randomly search users across 100k edges in total, and
  • Come up with our average read/write speeds.

We then searched users with similar preferences to each other. We ran the suite as Java unit-tests and captured the times using timestamps. Neo4j allowed indexing capability on one of the attributes – we indexed the userID to give us indexed and unindexed speeds. Read queries were not identical in each DB, so the closest comparison was run.

Results

The results scale from 1 to 5, where the higher value is better.

DimensionsNeo4jOrient DBComments
Stable Java Driver? (Yes/No)YesYes
Maturity of Client Framework (1-5)52
Ease of query language (1-5)53Gremlin relates to SQL. Cypher relates more to graph queries.
Ease of setup (1-5)35
# of Reads per Second0.161ms (no index) 0.139 (indexed)0.955ms (search)Single node retrieval; the gap decreased when searching for multiple nodes.
# of Writes per Second0.023ms/edge0.31ms/edge
Storage Footprint (MB)103mb86mb
Average Memory Footprint (MB)172mb – Average256mb – Average
License CostCommunity edition is free under GPL v3 license.Community edition is free under Apache 2 license.Enterprise version cost is unknown and can be figured out with the vendor directly.
Community Support52

*White paper: http://www.stingergraph.com/data/uploads/papers/ppaa2014.pdf

Conclusion

Overall, we agree with the market sentiment that Neo4j is the current leader in this space, specifically for recommendation engines. OrientDB is better suited for use cases like data mining and data extractions. However, its license terms make it somewhat difficult to work with. Although the community edition is licensed under GPL v3, the locking mechanism in case of concurrency is slow, making it unsuitable for high concurrency scenarios. The enterprise edition improves upon this, but the last time we checked, the pricing is quite steep! Apache Giraph and TitanDB are really promising projects and they may overshadow Neo4j some day. Our take is any graph database that can work over Apache Mesos would be the ultimate winner – will it be Apache Spark?

Manoj Bisht

Manoj Bisht

Senior Architect

Manoj Bisht is the Senior Architect at 3Pillar Global, working out of our office in Noida, India. He has expertise in building and working with high performance team delivering cutting edge enterprise products. He is also a keen researcher and dive deeps into trending technologies. His current areas of interest are data science, cloud services and micro service/ serverless design and architecture. He loves to spend his spare time playing games and also likes traveling to new places with family and friends.

Leave a Reply

Related Posts

I’m Building a Killer Team at 3Pillar and Here’s... At the beginning of this year, I found myself at a moment of change. Since opening our doors in 2006, 3Pillar has experienced strong and steady growth...
Harnessing the 3 A’s to Derive Value from Your Data I recently wrote on my blog - and spoke at Outsell DataMoney - about data becoming a commodity in the digital economy, just as oil was in the Industri...
1776 Challenge Cup 2018 Fan Favorite: How Mobile Passport is... On this episode of The Innovation Engine, we welcome Airside Mobile's Hans Miller to the studio to talk about his company's experience at the recent 1...
5 Cost Saving Strategies When Using AWS Amazon Web Services, or AWS, offers reliable and scalable cloud computing services. More and more companies are realizing the benefits and are migrati...
Why Innovation Labs Fail – The Innovation Engine Podca... Jonathan Rivers joins us on this episode of The Innovation Engine to explain why innovation labs usually fail. We'll look at why innovation labs often...