Graph Database Java OGM Comparison

I have been using graph databases for a while (mostly Neo4j) and i thought it might be a good idea to write down some of the things i noted while using Neo4j in combination with Spring Data Neo4j.

I also compiled a comparision of various graph databases a while ago. My list only includes Java based graph databases that allow embedding. I won’t go into details about the feature sets and how those differ between each of those databases. Instead i just point out the most important aspects i noticed while using the OGM’s and graph databases.

Graph Databases

Neo4j

Neo4j was the first graph database i have ever used. It is open source and very fast. It also ships directly with a neat little admin ui which can be used to visualize your graph data.
The database is very easy to embedd and comes with a powerful query language (cypher). I don’t know whether there are any other dedicated OGM/ORM layers for Neo4j besides of SDN.
The licensing on the other hand is not very useful once you decide to embedd Neo4j in your application.
Neo4j Community Edition is licensed under GPL. MySQL is also licensed under GPL.
This means when you would not embedd the database and only use the provided Neo4j REST API you would not need to license your application under GPL.
Once you embedd the database in your application you must license your application under GPL. This gets even worse when you decide to utilze the clustering features. In this case you would need to license your application under AGPL (even if you would use Neo4j though the REST API)

The High Availability mode (Master/Slave Replication) can also be used when embedding the database. I wrote a dummy project a while ago that contains a working example.

OrientDB

OrientDB is also open source. There is no cypher but you can use Orient SQL. Embedding is also very easy and the licensing with Apache 2 license is very developer friendly. Tinkerpop support is very good.

Sparsity Sparksee

I have never used Sparsity Sparksee but feature wise it is compareable to the other big graph databases.

Titan DB

Titan DB is an interessting database. The storage layer for this graph database is interchangeable. You can use Berkeley DB which is quiet fast but it basically limits the size of nodes you can store and you can’t use clustering.
Alternativly you can also use Cassandra. Cassandra is slower compared to BerkeleyDB but it supports replication.

Hypergraphdb

I have never used this database and can’t say much about it but my impression is that it is very small and the feature set is limited.

Performance comparison

The performance comparision is very superficial and you should keep in mind that the usecase for the database should always dictate the choice.
I have just compared low level read and write speed because i was interested in those. The benchmark does not cover any kind of graph traversals. I was merely interessted in the speed it takes each database to output a single node.

All tests were executed within in a JVM that also excuted the graph database. I created 10k nodes and read those 10k nodes sequentially and in random order. No warmup phase was added.
As a baseline i choose Neo4j because it got the overall best performance. (less % is better)

DB write 10k read 10k seq read 10k random
Neo4j 100% 100% 100%
OrientDB 171% 101% 104%
Titan DB (Cassandra) 314% 502% 510%
Titan DB (BerkeleyDB) 200% 205%

OGM – Object Graph Mapping

Spring Data Neo4j

I have used SDN a lot and i’m quiet impressed by it. Getting started is quiet easy and there are a lot of examples out there.
When talking about SDN it is important to note the differences between the versions. SDN 3.3.x is currently only supported for Neo4j 2.1.x while SDN 4.x also supports the newly releases Neo4j version 2.×.

3.3.x

SDN uses annotations to map the entities and relationships. Inheritance of objects is direcly mapped to the labels of a node. It is possible to create Spring Data Repositories that retrieve objects by using property values or by specifying cypher statements.

What i like is the paging cypher support. What i do not like is the amount of classes and interfaces you need to create to interface with your objects but i guess this is always application specific.

Example usecase:

  • User.java – Defines the entity
  • UserRepository.java – Defines the SDN user repository
  • UserRepositoryImpl.java – Defines a SDN repository implementation that may contain custom repository method implementations
  • UserActions.java – Interface that contains the methods (is extendes by UserRepository and implemented by UserRepositoryImpl)
  • UserService – Defines methods that the implementation may use to manipulate user objects.
  • UserServiceImpl – Implements the defined methods.

Another point that caused a lot of trouble for me was the @Fetch annotation. The getGroups() method would load the full entities (groups) when adding the @Fetch annotation to the method. This could cause to infinitive recursions or huge loading times. At the end i removed nearly all @Fetch annotations from my projects.

Instead i did use the Neo4jTemplate class in order to populate the returned enitity. When no @Fetch was specified only a shallow object with no properties is returned. Using the neo4jTemplate.fetch() this shallow object could be loaded.

Additionally SDN 3.x was/is slow as hell when using a remote Neo4j instead of the embedded one. This is another reason why SDN 4.x was developed.

@NodeEntity
public class User extends AbstractPersistable {

	private String lastname;

	private String firstname;

	@Indexed(unique = true)
	private String username;

  @RelatedTo(type = "MEMBER_OF", direction = Direction.OUTGOING, elementClass = Group.class)
	private Set<Group> groups = new HashSet<>();
}

4.x

SDN 4.x is currently is not using the Neo4j Core API directly. Instead it relies on the Neo4j REST API. The overall performance for remote connected neo4j servers is faster (compared to SDN 3.3 in remote mode).
I can only guess why Neo4j/Pivotal Software choose this approach but my guess is that they started a rewrite of SDN in preparation for the binary protocol support for neo4j and to speedup SDN when using a remote Neo4j.

Tinkerpop

Tinkerpop is a collection of APIs that are allow transparent and easy interfacing with graph databases. The blueprint API is the most low level api which is used to wrap the graph databases native API. By doing so it provides a standarized API which other APIs can use to interface with a graph db through this API layer. The API layer is very thin. There are various wrappers for many graph databases. I have used the blueprint neo4j implementation.

Tinkerpop Blueprint is generally a good choice when you want to develop your application but you are not yet sure what graph database you will use at the end.

There are three OGM’s that are based upon the blueprint API i have looked at.

Tinkerpop 2 – Frames

The Frames API uses annotations similar to SDN and thus switching from SDN to Frames is not that hard. Indices have to be created seperatly. Tinkerpop does not support cypher. You would need to write your statements in gremlin instead.

public interface User extends AbstractPersistable {

        @Property("firstname")
        public String getFirstname();

        @Property("firstname")
        public void setFirstname(String name);

        @Property("lastname")
        public String getLastname();

        @Property("lastname")
        public void setLastname(String name);

        @Property("username")
        public String getUsername();

        @Property("username")
        public void setUsername(String name);

        @Adjacency(label = "HAS_USER", direction = Direction.OUT)
        public Iterable<Group> getGroups();
}

Tinkerpop – Totorom

I guess Totorom could be seen as a successor to Frames. It is faster compared to Frames and it nativly interfaces with the tinkerpop gremlin query API.
The whole OGM is also very small. Many (all?) annotations are gone. Instead of interfaces you write classes which make thinks a lot easier compared to frames. In frames custom method handlers would need a special annotation (@JavaHandler) and a dedicated handler implementation for the method. With Totorom you just add your custom method.

public class User extends AbstractPersistable {

        public static String FIRSTNAME_KEY = "firstname";

        public static String LASTNAME_KEY = "lastname";

        public static String USERNAME_KEY = "username";

        public String getFirstname() {
                return getProperty(FIRSTNAME_KEY);
        }

        public void setFirstname(String name) {
                setProperty(FIRSTNAME_KEY, name);
        }

        public String getLastname() {
                return getProperty(LASTNAME_KEY);
        }

        public void setLastname(String name) {
                setProperty(LASTNAME_KEY, name);
        }

        public String getUsername() {
                return getProperty(USERNAME_KEY);
        }

        public void setUsername(String name) {
                setProperty(USERNAME_KEY, name);
        }

        public List<Group> getGroups() {
                return out("HAS_USER").toList(Group.class);
        }

}

Ferma

The API of Ferma is very similar to Totorom. Ferma has various operation modes. It also supports the Frames annotations.

public class User extends AbstractPersistable {

        public static String FIRSTNAME_KEY = "firstname";

        public static String LASTNAME_KEY = "lastname";

        public static String USERNAME_KEY = "username";

        public String getFirstname() {
                return getProperty(FIRSTNAME_KEY);
        }

        public void setFirstname(String name) {
                setProperty(FIRSTNAME_KEY, name);
        }

        public String getLastname() {
                return getProperty(LASTNAME_KEY);
        }

        public void setLastname(String name) {
                setProperty(LASTNAME_KEY, name);
        }

        public String getUsername() {
                return getProperty(USERNAME_KEY);
        }

        public void setUsername(String name) {
                setProperty(USERNAME_KEY, name);
        }

        public List<Group> getGroups() {
                return out("HAS_USER").toList(Group.class);
        }

}

Temporary Server Portforward

Sometimes it is useful to redirect all tcp traffic to port 80 from one server to another. This can be important when you decide to change your DNS entries from one server to another and you don’t want to leave the old webserver server running.

  echo 1 > /proc/sys/net/ipv4/ip_forward
  iptables -F
  iptables -t nat -F
  iptables -X
  iptables -t nat -A PREROUTING -p tcp --dport 80 -j DNAT --to-destination SOURCE_IP:80
  iptables -t nat -A POSTROUTING -p tcp -d SOURCE_IP --dport 80 -j SNAT --to-source DESTINATIONIP
  iptables -t nat -A POSTROUTING -j MASQUERADE

GlusterFS / Async Writes

Over the weekend i decided to experiment and setup a glusterfs between my root server(plexus) and my fileserver at home(hydra).

I won’t describe how to setup glustefs. There are various good guides available.

  gluster volume create gv0 replica 2 hydra:/opt/gfs/brick1/gv0 plexus:/gfs/brick1/gv0
  gluster volume set gv0 network.ping-timeout 5
  gluster volume start gv0

  root@plexus:/gfs# mount -t glusterfs localhost:/gv0 gv0/
  root@hydra:/gfs# mount -t glusterfs localhost:/gv0 gv0/

At first glance i hoped that small writes would be cached and asynchronously flushed to the filesystem. I tried to increase the performance.cache-size and performance.write-behind-window-size option but did not succeed in increasing the write performance.

As you can see the write performance is basically predetermined by my max upload speed.

  # Write 1MB
  root@hydra:/gfs/gv0# dd if=/dev/zero  of=testfile bs=1K count=1024
  1024+0 records in
  1024+0 records out
  1048576 bytes (1.0 MB) copied, 3.36612 s, 312 kB/s

But luckily there are branching filesystems like unionfs that is for example used for live CDs. Such filesystems work in that way that they overlay multiple filesystems into a single virtual one.

I created a new gv0-writecache folder which can be written to. The follow command creates a unionfs filesystem that is mounted to the /gv0/union folder.

  mkdir -p /gfs/union
  unionfs-fuse  -o default_permissions -o allow_other /gfs/gv0-writecache=RW:/gfs/gv0=RO /gfs/union/

Succeeding writes to the union folder are now very fast and the files are stored within the /gfs/gv0-writecache folder.

  root@hydra:/gfs/union# dd if=/dev/zero of=test2 bs=1M count=10
  10+0 records in
  10+0 records out
  10485760 bytes (10 MB) copied, 0.0454509 s, 231 MB/s

After that step we only need to make sure to sync those files from the writecache to the glusterfs mount.

  root@hydra:/gfs/gv0-writecache# rsync  --remove-source-files -av /gfs/gv0-writecache/* /gfs/gv0

I also tried to use aufs instead of unionfs-fuse but i got various strange errors.

Resizing raid5 with mdadm, lvm and luks for encryption on debian wheezy

This guide will show you how to setup a simple mdadm, lvm test environment in which you can experiment with changes you want to execute on a real disk system.

Setup the test environment

You can of course use your own disks. In my case i did a trial run of the needed steps by using losetup to map a data file to a loop device.

  dd if=/dev/zero of=diska bs=1M count=300
  dd if=/dev/zero of=diskb bs=1M count=300
  dd if=/dev/zero of=diskc bs=1M count=300
  dd if=/dev/zero of=diskd bs=1M count=300
  losetup /dev/loop0 diska
  losetup /dev/loop1 diskb
  losetup /dev/loop2 diskc
  losetup /dev/loop3 diskd

  # Create raid 5 with one disk missing
  mdadm -Cv /dev/md2 -l5 -n3 /dev/loop0 /dev/loop1 missing

  # Add luks cryptolayer ontop
  cryptsetup luksFormat /dev/md2
  cryptsetup luksOpen /dev/md2 cryptotest

  # Add lvm structure 
  pvcreate /dev/mapper/cryptotest
  vgcreate testvg /dev/mapper/cryptotest
  lvcreate -l 100%FREE -n testlv testvg

  # Create filesystem
  mkfs.ext3  /dev/mapper/testvg-testlv
  mount /dev/mapper/testvg-testlv test/

You can now copy your data onto the raid 5 and when done add the missing disk to the final raid. You can of course use three disks from the start.

Add missing disk to the raid 5

  # Add missing device
  mdadm --add /dev/md2 /dev/loop2

Growing the raid 5 by adding a forth disk

Once your data has been copied you may want to add the now free disk to the raid of md2.

  # Add disk and resize md2
  mdadm --add /dev/md2 /dev/loop3
  mdadm --grow --raid-devices=4 /dev/md2

  # Check the sync process and the final result
  cat /proc/mdstat
  mdadm --detail /dev/md2
  umount /tmp/test/test

  # Resize the crypt layer and check the result
  cryptsetup resize cryptotest
  fdisk -l /dev/mapper/cryptotest

  # Resize physical volume and check the result
  pvresize /dev/mapper/cryptotest 
  pvdisplay

  # Resize logical volume and check result
  lvextend -l +100%FREE /dev/mapper/testvg-testlv
  lvdisplay 

  # Resize filesystem
  e2fsck -f /dev/mapper/testvg-testlv
  resize2fs /dev/mapper/testvg-testlv
  mount /dev/mapper/testvg-testlv test

Cleanup of the test environment

Of course you should not do these steps on the real raid. These steps are just needed when you want to cleanup your testenvironment.

  umount /tmp/test/test
  lvremove testvg testlv
  vgremove testvg
  pvremove  /dev/mapper/cryptotest 
  cryptsetup close /dev/mapper/cryptotest 
  mdadm --manage /dev/md2 --stop
  losetup -d /dev/loop0
  losetup -d /dev/loop1
  losetup -d /dev/loop2
  losetup -d /dev/loop3