19 May 2013

Semantic Web and Linked Data Storage

Semantic Web often times is solely dependent on an efficient back-end storage and indexing strategy from where most of the processing stems. It seems leaving out the most valuable aspect of a Semantic Web architecture towards the end as a way of interface is a bad move. One should always first think through the data layer first. Semantic Web is like a work flow of services in a pipeline and have to be thought through in that manner as everything depends on resources and the active querying of such resources. In fact, by extending the model by way of linked data VoID interlinks one further extends the data requirements exponentially. 

There are generally three ways of approaching a back-end for semantic web. The first approach is to treat it as a pure W3C like a regular client-server model. The server being the triplestore and the client being the web interface of services. The second approach is usually apply a more granularity using property graphs with the Tinkerpop framework. In this manner a whole range of graph properties and options for NoSQL emerge. The third approach is to apply a standard relational model and to convert that into an RDF repository. In all three cases, an RDF interface layer similar to JDBC is required as well as possibly a search indexing layer. 

The two most common interface layers which also have their own storage layers include Sesame and Jena. Sesame is the more versatile of the two providing more robust features as well a majority of the triplestores are based on this model. Jena appears to be a more strict W3C driven approach. In both models, the provided storage is not sufficient for production requirements as the data can grow exponentially. One obviously has to keep room for current and future data needs. Often times clustering would be required to scale out the SPARQL queries. In almost all cases a read-only SPARQL endpoint has to be provided for users to interface with. In SPARQL 1.1 even an update and an insert has been added on. However, these particular methods should be restricted to admin level. 

Open source triplestores are generally quite limited for production use and so a workaround has to be applied at times to allow for scalability and storage needs. Currently, the top performing triplestores include Virtuoso, OWLIM, and Allegrograph both very much commercial and with quite a large toolset. The next best triplestore would be Bigdata which is a fairly good Open Source option providing clustering, sharding, and full-text indexing needs. It also has a zoo keeper connector. In terms of a property graph one can almost always use Neo4J or OrientDB. OrientDB provides a more liberal license option. Solutions that provide hadoop as the underline back-end storage layer will not perform very well due to the nature of its distributed design approach. The storage layer could be deployed to a clustered 64 bit and 4-8 CPU core production ready environment.

Semantic Web is really starting to take off and more and more interesting options are starting to emerge. However, it is still the case that open source solutions are lacking in production quality and are more experimental for research use. The field is still dominated by commercial players who provide a Swiss army knife of solutions in the field with an obvious premium. There is still a lot there to be done even in aspect of making Semantic Web more accessible for developers as the W3C specifications can be quite complex and in lot of ways there are just too many bewildering set of models to apply in a varied combination of usages. Perhaps, even the introduction of JSON-LD will facilitate the steps in making linked data more accessible for front-end developers. Simplicity and convergence is key in making Semantic Web the next evolution for Big Data and the Internet.

Java:
Sesame
Jena
Tinkerpop
linkeddataapi
any23
marmotta
stanbol
rdf2go
sesametools
groovysparql
pellet
owl-api
jsonld for java

Python:
Redland
RDFLib
Bulbflow
RDFAlchemy
Fuxi
Surf
ORDF
Django-rdf
Djubby
pysparql
sparta
Oort
sparqlwrapper


JavaScript/Nodejs:
RDFQuery
Tabulator

Semantic NLP:
KEA
OpenNLP 
DBPedia Spotlight
Maui

Graph stores:
Neo4j
OrientDB
Allegrograph
Virtuoso
BigData
Ontotext
Titan
Stardog

W3C:
SPARQL 1.1
RDF
JSON-LD

Reconciliation:
GoogleRefine