views:

69

answers:

3

hello everyone,

I'm looking for some architecture ideas on a problem at work that I may have to solve.

the problem.
1) our enterprise LDAP has become a "contact master" filled with years of stale data and unused and unmaintained attributes.
2) management has decided that LDAP will no longer serve as a company phone book. it is for authorization purposes only.
3) the company has contact type data about people in hundreds of different sources. we need to scrub all the junk out of LDAP and give the other applications a central repo to store all this data about a person.

the ideal goal
1) have a single source to store all the various attributes about a person
2) the company probably has info on 500k people ( read 500K rows)
3) i estimate there could be 500 to 1000 optional attributes on these people. (read 500+ columns)
4) data would primarily be set/get via xml over jms (this infrastructure is already in place)
5) individual groups within the company could "own" columns. only they would be allowed to write to their columns, they would be responsible for keeping the data clean.
6) a single record lookup should be returned in sub seconds
7) system should support 1 million requests per hour at peak.
8) the primary goal is to serve real time data to the enterprise, reporting is a secondary goal.
9) we are a java, oracle, terradata shop. we are your typical big IT shop.

my thoughts:
1) originally i thought LDAP might work, but it doesn't scale when new columns are added.
2) my next thought was some kind of no-sql solution, but from what i have read, I don't think i cant get the performance I need, and its still relatively new. I'm not sure i can get my manager to sign off on something like that for such a critical project.
3) i think there will be a meta-data component to the solution that will track who owns the columns and what each column represents, and the original source system.

Thanks for reading, and thanks in advance for any thoughts.

+1  A: 

You may want to look into Len Silverston's Party Model. Here's a link to his book: http://www.amazon.com/Data-Model-Resource-Book-Vol/dp/0471380237.

I have no experience building something on that scale, though I think that thinking of it as 500k rows x 500 - 1000 columns sounds a bit ridiculous.

Shlomo
thanks for the book recommendation, i'll check it out tonight.can you elaborate a little bit on your 500k rows by 500 columns comment? if i understand the scope of my data up front, why wouldn't i build with those numbers in mind?
bostonBob
500 columns implies a physical structure (one table) that may not be an optimal solution to your problem. Also, when you have that much optional data, doesn't enforcing 'required' data become a pain?
Shlomo
in my head i guess i see the solution as one large sparse table. every row would have a uuid. whether a column is required or not would be up to the column owner and any consumers of that column. ideally the system would enforce this at write time.
bostonBob
I see it more as a system of tables heavily implementing inheritance pattern. This would also prevent harsh schema changes most a lot of the time. I found Silverston's book to be a real eye-opener (read some of the comments on Amazon).
Shlomo
+2  A: 

A couple thoughts:

1) our enterprise LDAP has become a "contact master" filled with years of stale data and unused and unmaintained attributes.

This isn't really a technological problem. You will have this problem with a new system as well, LDAP or not.

"LDAP ... doesn't scale"

There are lots of huge LDAP systems out there. LDAP is surely a dark art, but I'd willing to bet that it scales better than any SQL equivalent in this situation. Not to mention that LDAP is a standard for this kind of info, and as such it is accessible from zillions of different kinds of systems.

Maybe what you're looking for is a new LDAP system that's easier to manage / has better admin tools?

Seth
i hope to "solve" the stale data issue by giving ownership of columns to the groups that own the data. they would have to authenticate before they could write to the data store. the data could still become stale, but with meta data on all of the columns we know who owns the data and the last time it was updated.
bostonBob
as far as ldap scaling, i think i was a little unclear in my original post. our ldap is extremely fast, and very good at retrieving a single record given a key value. the part about ldap that doesn't scale is the number of attributes we need. to add an attribute in ldap requires a schema change.i'd like a solution that is a little more dynamic. where we can create columns and groups of columns and assign ownership of the column groups.maybe the answer is LDAP. i'm just hoping to find something a little more dynamic.
bostonBob
+1  A: 

SQL

With Teradata-grade tools an SQL-based solution may be feasible. I came across an article on database design awhile ago that discussed "anchor modeling".

Basically, the idea is to create a single, dumb, synthetic primary key table, while all real or meta data lives in other tables (subsets) and is attached by way of a foreign key + join.

I see the benefit of this design being two-fold. First, you can more easily compartmentalize data storage either for organizational or performance reasons. Second, you only create additional rows for records that have data in any given subset, so you use less space and indexing and searching are faster.

Subsets might be based on maintainer or some other criteria. XML set/get would be per-subset/record (rather than global record). All subsets for a given records can be composited and cached. Additional subsets can be created for metadata, search indexes, etc., and these can be queried independently.

NoSQL

NoSQL seems similar to LDAP (in theory, at least) but the benefit of a good NoSQL tool would include greater abstraction of metadata, versioning, and organization. In fact, from what I've read it seems that NoSQL datastores are designed to address some of the issues you've raised with respect to scaling and loosely structured data. There's a good question on SO regarding datastores.

Production NoSQL

Off-hand, there are a handful of large companies using NoSQL in massively-scaled environments, such as Google's Bigtable. It seems like the perfect tool for:

6) a single record lookup should be returned in sub seconds
7) system should support 1 million requests per hour at peak.

Bigtable is only available (to my knowledge) through AppEngine. Other, similar technologies are listed here.

Other Thoughts

The bigger picture view looks more or less the same regardless of the technology you decide to use. E.g. compartmentalize storage, composite views, cache views, stick metadata somewhere so you can find things.

The performance characteristics you're targeting are going to require some kind of caching and/or optimization based on real-world usage patterns. Regardless of the solution you choose, you probably can't resolve that in the design phase.

banzaimonkey
wow, thanks for the link on anchor modeling, i feel like i owe you some consulting fee now :) big table was my original thought, and that led me to hbase and hypertable, with hypertable taking a slight lead. i think i'm going to have to take a step back and dig into this new topic.thanks again.
bostonBob
@bostonBob Glad I could share some new ideas. :) I updated to include a link regarding datastores, too, that may be useful. Cheers!
banzaimonkey