Enterprise IT Consultant Views on Technologies and Trends

Nov 3 2010   3:01AM GMT

Neo4j, the Database for high performance traversals



Posted by: Sasirekha R
Tags:
graph database
Lucene
Neo4j
NoSQL
Scalability
traversal

Neo4j, the Graph Database – for high performance traversals

Currently there are many areas – the Semantic web movement (of W3C), content management, bioinformatics, artificial intelligence, social networks, business intelligence etc. – where data is naturally ordered in networks (Networks are very efficient data storage structures – as seen in human brain and world wide web). With this in mind, a team set out to create a transactional persistence engine with high performance, scalability and robustness but without the disadvantages of the relational model. The result is Neo4j that provides:

  • An intuitive graph-oriented model for data representation. The programmer can use an object oriented, flexible graph network consisting of nodes, relationships and properties called a nodespace.
  • A disk-based, native storage manager optimized for storing graph structures.
  • A powerful traversal framework for high-speed traversals in the node space.
  • A simple object-oriented API

Neo4j is an open-source graph database that stores data as nodes and relationships that hold properties in a key/value fashion. Neo4j is an embedded, disk-based, fully transactional Java persistence engine that stores data structured in graphs (and not tables).

A graph database can be looked upon as a document database, with a specific type of documents – relations – and optimized for graph operations like Graph Traversals. Some form of a graph database is used to do things like “People who bought this also bought..” (Amazon-style), “You might know…” (LinkedIn).

 In the Neo4j model, everything is represented by nodes, relationships and properties.

  • Node (or Vertex) has Id.
  • Relationship (or edge) that connects two nodes has a well­ defined, mandatory type and can be optionally directed.
  • Properties (or attribute) are key­/value pairs that are attached to both nodes and relationships.

It is important to note that relationships in this model are objects by themselves (and not just associations) between the nodes. When combined, these nodes, relationships, and properties form a node space – a coherent network that can represent business domain data.

Neo4j enables modeling semi-structured data. The nodes do not have to conform to the fixed set of properties – and in that sense Neo4j is Schemaless. Optional or no data structure enables easy schema-changes and lazy data-migration.

A graph database provides index free adjacency with each vertex serving as a “mini-index” of its adjacent elements, enabling the graph to grow in size while the cost of a local step remains the same. Neo4j traverses the graph in a lazy fashion – nodes and relations are first traversed and returned when the result iterator is asking for them, increasing performance with big and deep traversals.

Neo4j is a navigational database where we navigate from a (given or arbitrary) start node via relationships to the nodes that matches the set of criterias. Neo4j provides a powerful traversal framework that that makes it very easy to express complex queries (involving traversals and filters).

Neo4j Traverser implements the Java Iterator interface, loading the nodes and stepping through the graph lazily first when they are requested in a for {…} loop. The four components of a traverser are:

1. The starting node

2. The relationships (types and directions) we wish to traverse

3. A stop criteria to know when to stop traversing (StopEvaluator)

4. A selection criteria to know which nodes to return (ReturnableEvaluator)

The following is an example of using a traverser with couple of common Evaluators and Defaults (that are built in):

Traverser friends = Neo4j.traverse(
Order.BREADTH_FIRST,
StopEvaluator.DEPTH_ONE,
ReturnableEvaluator.ALL_BUT_START_NODE, KNOWS, Direction.BOTH);
for (Node friend : friends) {System.out.println(friend.getProperty("name"));
}

The query implies that we want to first visit all nodes at the same depth from the start node before continuing to nodes at more distant levels (Order.BREADTH_FIRST), stop after one depth of traversal  (StopEvaluator.DEPTH_ONE), traverse only relationships of type KNOWS in both directions (Direction.BOTH) and return all nodes except the start node (“Neo4j”) (ReturnableEvaluator.ALL_BUT_START_NODE).

Neo4j graph-algo is a component that contains algorithms for graphs – finding shortest paths, all paths, all simple path, Dijkstra etc. Algorithms considered production quality can be found in org.neo4j.graphalgo.GraphAlgoFactory factory.

Neo4j (or for that matter, the graph database) is not suitable for handling arbitrary queries (like “How many of customers over age 25 with last name starting in S have purchased item in the last two months?”) as this doesn’t go along the relationships.

Neo4j IndexService is a means of providing those indexing capabilities for a neo4j graph and integrate it as tightly as possible. With an IndexService you can associate any number of key-value pairs to any node and do fast lookups given such key-value pairs. Currently the main implementation is the LuceneIndexService that uses Lucene (Text Search engine) as the backend. LuceneFulltextIndexService makes it possible to do queries on individual words.

Neo4j also allows creating a timeline and adding nodes to it, each with a timestamp. Then we can use queries that return all nodes within a specific time period (with optional upper and lower bounds).

Neo4j uses the Master-Slave replication model. All writes must go through the master and the slaves will be read only. Changes performed on the master will be pushed out to the slaves when the logical log is rotated (based on configured size or invoking a method on the master).

The online backup utility used to synchronize a destination Neo4j database from a source Neo4j database can be used to emulate “high availability” (HA) having the master replicating changes to read only slaves.

Some points worth noting are:

  • Neo4j is released under dual license – free open source software, or commercial licensed closed source software
  • Neo4j has a small footprint, a single <500k jar with one dependency (the Java Transaction API)
  • Neo4j is available in Amazon AMI cloud.
  • Gephi, the Open Source Visualization Platform (on top of the NetBeans platform), supports Neo4j.
  • Neo4j may have a learning curve compared to a relational tool – need users to use domain modeling.
  • Neo4j has been in production for quite some time and has proven performance.

To use Neo4j (or graph database) effectively, it is important to focus on the domain model. It is suggested to use the white board (forget UML and E/R diagrams) and draw the conceptual model of the project’s domain and it is trivial to represent it in the graph database (which is white board friendly). Once the graph representation is ready, you can build the object-oriented domain layer on top of the graph.

Neo4j provides Massive scalability, as it can handle graphs of several billion nodes / relationships / properties on a single machine and also can be sharded by the client to scale out across multiple machines.

Neo4j’s intuitive representation and easy to write complex high-performance traversals makes it suitable for applications that contain semi-structured data that is naturally ordered in networks.

Most graph databases (that come under NoSQL umbrella) provide support for RDF and SPARQL. Embracing W3c linked data technology stack provides the benefits of data portability and interoperability resulting in no vendor or product lock-in.

1  Comment on this Post

 
There was an error processing your information. Please try again later.
Thanks. We'll let you know when a new response is added.
Send me notifications when other members comment.

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
  • Sasirekha R
    [...] 7. One of our newest blogs, Enterprise IT Consultant Views on Technologies and Trends, takes an in-depth look at Neo4j, the database for high performance traversals. [...]
    0 pointsBadges:
    report

Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to: