views:

152

answers:

4

Now that "NOSQL" or "object only" storage systems like MongoDB or memcached are really picking up steam in the world. I was wondering if there are any requests that cannot be performed on them that can be performed using multiple object joins (in SQL that is JOIN "table"). In other words, are there any multi-table queries that cannot be handled by several single table queries in a row?

Basically, is there a use-case were a multi-table join cannot be replicated by accessing one table at a time in object based storage systems?

Here are some examples of normal 3NF queries using has_man and has_many_through relations. These aren't the most complex queries - but they should give you a starting point for the concept. Note that any value in {} means a value of the result of the last query.


Company Has Many Users

SELECT user.*, company.name as company_name FROM user 
LEFT JOIN company ON company.id = user.company_id
WHERE user.id = 4

vs

SELECT * FROM user WHERE id = 4
SELECT * FROM company WHERE id = {user.comany_id}

Club Has Many Students Through Memberships

SELECT student.* FROM student LEFT JOIN membership on
membership.student_id = sudent.id WHERE membership.club_id = 5

vs

SELECT * FROM membership WHERE club.id = 5
SELECT * FROM student WHERE id = {membership.student_id}

The reason I'm wondering is because I want to know if Object-based systems (that rely on accessing single table objects at a time) can do what RDBMS databases like PostgreSQL or MySQL can do.

So far the only thing wrong seems to be that more queries are necessary.

+4  A: 

Just because you can, doesn't mean you should.

The multiple SELECT statement alternative cons:

  • the less trips to the database, the better. TCP overhead can not be recouped, and it looks like Network Neutrality is officially dead so we could expect to see a movement away from multi-select/nosql because you might have to pay for that bandwidth...
  • because of delay between initial and subsequent statements, risk of supporting data not reflecting what's in the system when the first query was run
  • less scalable--the larger the data set, the more work the application is doing to deal with business rules and association that can scale far better in a database
  • more complexity in the application, which also makes the business less portable (IE: migrate from Java to .NET or vice versa - you're looking at building from scratch when business logic in the DB would minimize that)
OMG Ponies
Not exactly an answer but you brought up some good points. 1) Less trips to the database is a plus. But I'm not sure where the thing about paying for it comes from since applications use a private LAN. 2) Very true. Though I can't actually think of a case. 3) Possibly, however without the use of JOIN'ed data any caches would be many times smaller since they are only storing 1:1 objects from the database. 4) Another good point - the more code written - the more needs to be re-written. Then again, how often to projects move from one language to another? Should this shape your application design?
Xeoncross
@Xeoncross: You're assuming a lot of absolutes... Not everyone is accessing their data over a LAN - hardcore setups will have data centers at different geolocations. Re: coding in different languages - it depends on the client needs, and those can change at will. I disagree that this is "not exactly an answer" - I pointed out the shortcomings of the alternative queries you posted. Some are applicable to a situation at hand, some pitfalls are only visible on moderate to large projects. Doesn't mean small projects can't benefit; minimal difference doesn't mean it's always that case.
OMG Ponies
Correct, I was assuming that it was a small/mid-sized business with only a single data center. Your totally right about needing cheap bandwidth. Then again, the cost of all data in a 4kb JOIN vs the cost of 3x packets couldn't be too much higher. At that point the cost of the data transfered is close enough - only the waiting time is the problem.
Xeoncross
@Xeoncross: 3x the packets is a lot of wasted bandwidth, assuming a single user. With multiple users, you're looking at exponential growth...
OMG Ponies
Sorry, I didn't mean 3x the packets. I meant one packet spit into three parts. In other words, three queries with three results = one query with combined results.
Xeoncross
@Xeoncross: Understood, but packet overhead itself would be multiplied for however many queries were being run for an equivalent single query statement. The resultset itself wouldn't be multiplied...
OMG Ponies
+1  A: 

You could nosql like an old fashioned 'hierarchical' database too!

In addition to OMGPonies' answers, reporting is harder to do.

About scaling - that's not right. nosql is meant for scaling, if you use it right.

Another reason to do nosql - if you are doing all your work in objects, going thru o-r mapping to sql, and no work thru complicated (i.e., hand-rolled for efficiency) UPDATE statements. e.g., an update of a join, or update 'where ... in (...)'.

If the database is single-purpose (usu. the case for high volume apps) nosql is more likely to be OK.

Multipurpose - OLTP - Line of Business - go with SQL.

I could go on but this is eating into my lunch break. Not that I would ever eat into work-time. I prefer to just eat during my lunch break.

FastAl
+2  A: 

Actually one of the biggest problems is that some of the NoSQL databases are not transactional across multiple queries. ORM like Hibernate will do multiple queries with out "joining" sometimes but have the advantage that they are with in the same transaction.

With NoSQL you do not have that luxury. So this could very easily have misleading results:

SELECT * FROM user WHERE id = 4
SELECT * FROM company WHERE id = {user.comany_id}

If the company for user.company_id is deleted between the two statement calls. This is a well known issue with these databases. So regardless of whether or not you can properly do JOINs the issue will be not having transactions.

Otherwise you can model anything so long as it can store bytes :)

Adam Gent
Good point - data not matching what's in the DB is something to watch out for. However, I'm not so sure "fail" is the correct term since a result of "0" rows from the second query would be perfectly acceptable. In fact, I would rather have more of this real-time interaction than using a single query that might contain data that no longer existed by the time I got it.
Xeoncross
@Xeoncross you are correct. It was a mental/wording slip.
Adam Gent
+3  A: 

1 - running multiple separated queries leaves you with consurrency mess - by the time you got something from table 1 it could have been deleted and it might still be in table 2 - now assume 5 correlated tables.

2 - running queries with at least moderately complex logic over fields that are not mythical ID

3 - controling the amount of data fetched (you hardly ever need more than 50% of the data which is needed to deserialize/create valid objects and even worse whole trees of connected objects)

4 - correlated queries (nested selects) which SQL server will optimize like joins to additive complexity or better (|T1|+|T2|+|T3|+|T4|) while any ORM or nonSQL will have to keep repeating inner queries and giving rise to multiplicative complexity (|T1||T2||T3|*|T4|)

5 - dataset sizes, scalability not just in dataset sizes but also in handling concurrency under updates. Even ORM-s which maintain transactions make them so long that chances for deadlocks increase exponentially.

6 - blind updates (a lot more data touched for no reason) and their dependency and failure based on a blind instrument (mythical version which is realistically needed in say 1% of relational data model but ORM and alikes have to have it everywhere)

7 - lack of any standards and compatibility - this means that your system and data will always be at much higher risk and dependent on software changes driven by academic adventurism rather that any actual business responsibility and with expectation to invest a lot of resources just in testing changes.

8 - data integrity - oops some code just deleted half of today's order records from T1 since there was no foreign key to T2 to stop it. Prefecly normal thing to do with separated queries.

9 - negative maturity trend - keeps splintering instead of standardizing - give it 20 yr and maybe it will get stable

Last but not least - it doesn't reduce any compexity (the same correlation between data is still there) but it makes it very hard to track and manage complexity or have any realistic remedy or transparency when something goes wrong. And it adds the complexity of 1-2 layers. If something goes wrong in your SQL tables you have tools and queries to discover and even fix your data. What are you going to do when some ORM just tells you that it has "invalid pointer" and throws exception since surely you don't want "invalid object" ?

I think that's enough :-)

ZXX
This is the best list of cons I have ever seen for this topic. You pointed out some important points in there that I believe has answered the question.
Xeoncross
Let's just say that I've spent 2 mo recovering a codebase from an ORM tie-up :-) and while doing that I was tempted by another 2 ORM "frameworks". The perf went up through the ceiling, the CPU usage down to the floor and it takes connection stability and pooling for granted now, which counts as a health benefit for the poor programmer :-))
ZXX
+1 Excellent answer.
Martin Smith