views:

107

answers:

5

Considering the case of having a large and active user base where each user wants to store a profile picture and some additional images or other artifacts, are there any libraries or frameworks that allow for easy storage and query of such data?

A reference implementation would be Facebook's Haystack Photo Infrastructure.

The following characteristics are important

  • Data store should scale well: adding resources should be transparent to the application using the store (similar question had an answer referring to LinkedIn's Voldemort).
  • Ability to add some meta-data alongside the data being stored.
  • Meta-data can be queried with good performance (e.g. stored in configurable index like Lucene/Solr).
  • Quick key-based access and some intermediate caching layer

Any recommendations for libraries or frameworks that can be easily integrated into a Java web application are welcome.

Update: thank you for the first few answers. I have to go into more detail on what type of answers are expected. Tobu's answer, although not java related is very good (just voted up). It is possible to implement a solution with a combination of file system access and a DB and add some layer of caching in between, but I consider it a waste of time, if someone more qualified than me has already designed, implemented and run a better solution. Something based on a solution with underlying DB or JCR implementations is a good fit, but implementing the other infrastructure is not what I want to do.

+2  A: 

MogileFS is what LiveJournal uses. Not particularly Java though.

Tobu
A: 

I feel your requirements are pretty close to what a database is providing. Just make sure the tables design correspond to your needs (for example, you could have the big data like images in a separate table from the metadata).

All your requirements would be covered, including the caching layer in the database (and you could have an additional caching layer in your application as needed, that would probably be used also for the rest of your application).

KLE
A: 

Apache Jackrabbit is a fully conforming implementation of the Content Repository for Java Technology API (JCR, specified in JSR 170 and 283). But it has some performance issues (at least in the 2 years old version I use), best way to overcome them is replicating static images to a webserver. (Using WebDAV, davfs and rsync)

stacker
+1  A: 

We've made good experiences with the media repository from Fedora Commons (http://www.fedora-commons.org/), which allows you to store media assets alongside their associated metadata. We did not have any problems with scalability or customization nor was it difficult to exchange the underlying storage layer with a triple store (if this would be needed in your case). If you need to index your data using Solr you can use a predefined meta data field ("RELS-EXT") to store XML based data.

Philipp
Thank you Philipp, great input! We will definitely try this one.
Kariem
A: 

It depends on the quantification of "large and active user base"...

80% of websites could simply use a NoSQL schema-free approach like y_serial:

y_serial.py module :: warehouse Python objects with SQLite

"Serialization + persistance :: in a few lines of code, compress and annotate Python objects into SQLite; then later retrieve them chronologically by keywords without any SQL. Most useful "standard" module for a database to store schema-less data."

http://yserial.sourceforge.net

If the photos and artifacts per user are under 2M compressed, performance should be good.

For the remaining 20% case usage, one easily import the data from yserial into Cassandra -- which is now adopted by Facebook, Digg, and Twitter.

code43