views:

280

answers:

7

Hi, all. Are there any enterprise-grade database engines (Oracle, MS SQL...etc) that can handle large RDF datasets (320 million) and SPARQL queries? I guess my question is also: is SPARQL/RDF/OWL ready for serving large real-world data warehouses for an enterprise? If not, are there efficient mechanisms for adapting SPARQL/RDF against a typical data warehouse star schema.

Thanks!

A: 

I don't know the answer but maybe look into the Billion Triples Challenge described at http://challenge.semanticweb.org/.

Kaarel
+1  A: 

Following from Kaarel's suggestion one of the entries this year presented at ISWC used 4store which does scale that far though the competitor set it up in some weird configuration which the CTO of Gralik (who develop 4store) described to me and colleagues as 'crazy' but 4store would be capable of that scale - http://4store.org

Also Virtuoso supports stores at this scale, they have a live application that you can use to SPARQL query over the majority of the major LOD (Linked Open Data) data sources which total around 9 billion Triples

Virtuoso - http://virtuoso.openlinksw.com
LOD Application - http://lod.openlinksw.com/sparql

RobV
A: 

Intellidimension provides a solution called Semantic Server that is developed on top of Microsoft's SQL Server 2005 or 2008. It easily scales to the hundreds of millions of triples and I know they have at least one customer happily running an enterprise deployment with over a billion statements.

I am one of their customers working with datasets > 100 million. Our plans are to move towards the 10s of billions of statements.

spoon16
+2  A: 

Virtuoso - is the datastore used by Bio2RDF and DBPedia

Pierre
A: 

4store looks to be a good solution however the documentation is pretty sparse at this time and when I last looked at it there was no ability to delete an individual triple from the graph.

I would also take a look at BigData

Here is a quote from their main page summarizing their offering.

Bigdata(R) is an open-source scale-out storage and computing fabric supporting optional transactions, very high concurrency, and very high aggregate IO rates. Bigdata was designed from the ground up as a distributed database architecture optimized for very high aggregate IO rates running over clusters of 100s to 1000s of machines, but can also run in a single-server mode. Bigdata offers a distributed file system, similar to the Google File System but also useful for workflow queues, a data extensible sparse row store, similar to Googles widely recognized bigtable project, and map/reduce processing for parallelizing data intensive workflows over a cluster.

Bigdata(R) comes packaged with a very high-performance RDF store supporting RDF(S) and OWL Lite inference. The Bigdata RDF Store is currently the only RDF database capable of operating distributed on a cluster with dynamic key-range partitioning of indices. The Bigdata RDF Store was designed specifically to meet requirements for very large scale semantic alignment and federation. RDF is a Semantic Web technology particularly well-suited to modeling graph-shaped data and metadata, such as an associative entity-link model, whereby actors are linked to one another in an ad-hoc fashion within the context of an evolving ontology of concepts for entity types and link types related to a particular problem domain. The Bigdata RDF Store is used operationally in data harvesting systems to create mash-ups of structured, semi-structured, and unstructured data from myriad sources in a schema-flexible manner.

grimesjm
+1  A: 

I maintain this list of large triplestores on the W3C wiki:
http://esw.w3.org/topic/LargeTripleStores

There are 7 seven triplestores that are known to be able to hold over a billion triples. Four of them are open source. Please update the above-mentioned wiki page if you have more information.

Obviously, performance depends on what you use it for. I used Virtuoso in a large-scale industrial project, and it is quite fast.

Nicolas Raoul
A: 

Neo4j handles around 1+ Billion triples out of the box, SAIL API here, while still have the whole graph to do advanced stuff with things like Gremlin, or SPARQL.

Disclaimer: I am part of the Neo4j team.