ansaurus

Question

Answer 1

A:

Your approach of querying the database is the best one. If you do it every 5 seconds and you have 15 concurrent users then you're looking at ~3 queries a second. It should be a very small query too, returning only one row of data. If your database can't handle 3 transactions a second then you might have to look at a better database because 3 queries/second is nothing.

Timestamp the records in the table so you can quickly see if anything has changed without having to diff each field.

stimms 2009-11-22 17:25:21

Yes, but what if i have 200+ schemas, each schemas is used by around 10 people for 8 hours at day, and every people ofter have 3 or more documents open (as browser's tabs)? (200*10) * 3 = 6000 query every 5s.. The same postgresql (or mysql) handle others application such e-commerce, and other stuff.. I want to be clear: actually, is not a problem 6000+ query, but it maybe will be if my costumer will become 300, or 400, or (hopefully) 1000+.. Im lookig to a way that as fast as this one, but dont require a db query and wont cause problems with heavy workload.

DaNieL 2009-11-22 19:47:21

Answer 2

A:

Hibernate uses a version field to do that. Give every table such a field and use a trigger to increment it on every update. When storing an update, compare the current version with the version when the data was read earlier. If those don't match, throw an exception. Use transactions to make the check-and-update atomic.

wallenborn 2009-11-22 17:26:42

I suppose, then, that you throw it back at the user when you catch that exception.

Ewan Todd 2009-11-22 17:34:21

Obviously i have the last_update field in the table, and for postgresql a trigger that update it when updating the row (mysql use another way).This is not the point, my problem is: how to check every $x seconds if the db record has been updated, but giving the db the less stress possible?

DaNieL 2009-11-22 17:59:30

Normally, i would recommend checking only on update. But this may be a functional requirement of your system: whenever the underlying data changes, inform all clients immediately. In Java, i would implement this using the publisher/subscriber pattern, where the publisher gets fed by the DAOs.I don't know how to do that in PHP, but i would at least try to 1. have a status cache in the webserver layer, 2. have the PHP database update routines notify it, 3. it that is impossible, at least limit the number of threads hitting the database.

wallenborn 2009-11-23 12:53:31

Answer 3

A:

This is slightly off topic, but you can use the PEAR package (or PECL package, I forget which) xdiff to send back good user guidance when you do get a collision.

Ewan Todd 2009-11-22 17:37:38

Answer 4

+1 A:

You will need some type of version stamp field for each record. What it is doesn't matter as long as you can guarantee that making any change to a record will result in that version stamp being different. Best practice is to then check and make sure the loaded record's version stamp is the same as the version stamp in the DB when the user clicks save, and if it's different handle it.

How you handle it is up to you. At the very least you'd want to offer to reload from the DB so the user can verify that they still want to save. One up from that would be to attempt to merge their changes into the new DB record and then ask them to verify that the merge worked correctly.

If you want to periodically poll any DB capable of handling your system should be able to take the poll load. 10 users polling once every 5 seconds is 2 transactions per second. This is a trivial load, and should be no problem at all. To keep the average load close to the actual load, just jitter the polling time slightly (instead of doing it exactly every 5 seconds, do it every 4-6 seconds, for example).

Donnie 2009-11-22 17:45:53

I agree, 2 transactions per second is nothing for a MySQL database, I think he is trying to make sure that if the system becomes very popular overnight that it does not judder, fall and stumble under it's own weight.

Mark Tomlin 2009-11-22 17:50:22

Even under a moderately heavy load this should be ok since you're doing a PK lookup and fetching one field. That's a very rapid operation, and the optimizer should be smart enough to compile it and cache the results until the row actually changes.

Donnie 2009-11-22 19:35:11

Answer 5

A:

First off only update the fields that have changed on when writing to the database, this will decrease database load.

Second, query the timestamp of the last update, if you have a older timestamp then the current version in the database then throw the warning to the client.

Third is to somehow push this information to the client, though some kind of persistent connection with the server, enabling a concurrent two way connection.

Mark Tomlin 2009-11-22 17:46:20

What ares the kind of 'persistent connection'?

DaNieL 2009-11-22 19:41:23

Answer 6

+1 A:

I agree that I probably wouldn't hit the database for this. I suppose I would use APC cache (or some other in-memory cache) to maintain this information. What you are describing is clearly optimistic locking at the detailed record level. The higher the level in the database structure the less you need to deal with. It sounds like you want to check with multiple tables within a structure.

I would maintain a cache (in APC) of the IDs and the timestamps of the last updated time keyed by the table name. So for example I might have an array of table names where each entry is keyed by ID and the actual value is the last updated timestamp. There are probably many ways to set this up with arrays or other structures but you get the idea. I would probably add a timeout to the cache so that entries in the cache are removed after a certain period of time - i.e., I wouldn't want the cache to grow and assume that 1 day old entries aren't useful anymore).

With this architecture you would need to do the following (in addition to setting up APC):

on any update to any (applicable) table, update the APC cache entry with the new timestamp.
within ajax just go as far "back" as php (to obtain the APC cache to check the entry) rather than all of the way "back" to the database.

Arthur Frankel 2009-11-22 21:05:21

This can be a solution.. i'll give it a try.

DaNieL 2009-11-23 08:24:08

Took this way, but using memcache instead of apc.

DaNieL 2009-11-30 08:17:22

Answer 7

+1 A:

Donnie's answer (polling) is probably your best option - simple and works. It'll cover almost every case (its unlikely a simple PK lookup would hurt performance, even on a very popular site).

For completeness, and if you wanted to avoid polling, you can use a push-model. There's various ways described in the Wikipedia article. If you can maintain a write-through cache (everytime you update the record, you update the cache), then you can almost completely eliminate the database load.

Don't use a timestamp "last_updated" column, though. Edits within the same second aren't unheard of. You could get away with it if you add extra information (server that did the update, remote address, port, etc) to ensure that, if two requests came in at the same second, to the same server, you could detect the difference. If you need that precision, though, you might as well use a unique revision field (it doesn't necessarily have to be an incrementing integer, just unique within that record's lifespan).

Someone mentioned persistent connections - this would reduce the setup cost of the polling queries (every connection consumes resources on the database and host machine, naturally). You would keep a single connection (or as few as possible) open all the time (or as long as possible) and use that (in combination with caching and memoization, if desired).

Finally, there are SQL statements that allow you to add a condition on UPDATE or INSERT. My SQl is really rusting, but I think its something like UPDATE ... WHERE .... To match this level of protection, you would have to do your own row locking prior to sending the update (and all the error handling and cleanup that might entail). Its unlikely you'd need this; I'm just mentioning it for completness.

Edit:

Your solution sounds fine (cache timestamps, proxy polling requests to a another server). The only change I'd make is to update the cached timestamps on every save. This will keep the cache fresher. I'd also check the timestamp directly from the db when saving to prevent a save sneaking in due to stale cache data.

If you use APC for caching, then a second HTTP server doesn't make sense - you'd have to run it on the same machine (APC uses shared memory). The same physical machine would be doing the work, but with the additional overhead of a second HTTP server. If you want to off load the polling requests to a second server (lighttpd, in your case), then it would be better to setup lightttpd in front of Apache on a second physical machine and use a shared caching server (memcache) so that the lighttpd server can read the cached timestamps, and Apache can update the cached timestamps. The rationale for putting lighttpd in front of Apache is, if most requests are polling requests, to avoid the heavier-weight Apache process usage.

You probably don't need a second server at all, really. Apache should be able to handle the additional requests. If it can't, then I'd revisit your configuration (specifically the directives that control how many worker processes you run and how many requests they are allowed to handle before being killed).

Richard Levasseur 2009-11-23 15:16:06

Nice answer, clear and complete.I'll give a look at push-model, and if the php caching (as Arthur Frankel suggested, APC) doesnt nuke hit the php performance, this can be the solution.

DaNieL 2009-11-23 16:34:50

Edited my question, please comment if still interested

DaNieL 2009-11-26 16:34:43

Ops, maybe im explained bad my 'solution': i wont take another physical server, i installed lighttpd in the same machine with apache and une memcache becose can be shared between them both!

DaNieL 2009-11-28 12:41:30

Answer 8

+2 A:

I think you can use a condition in the UPDATE statement like WHERE ID=? AND LAST_UPDATE=?.

The idea is that you will only succeed in updating when you are the last one reading that row. If someone else has committed something, you will fail, and once you know you've failed, you can query the changes.

Oso 2009-11-23 15:25:09

+1 becose is a good idea that i didnt have, but still remains the problem to notify the user that the document is changed and giving him the choice to overwrite the other edits, or reload the data.. and that's my first goal!

DaNieL 2009-11-23 16:27:51

Answer 9

A:

Hi DaNieL

Polling is rarely a nice solution.
You could do the timstamp check only when the user (with the open document) is doing something active with the document like scrolling, moving the mouse over it or starts to edit. Then the user gets an alert if the document has been changed.

.....
I know it was not what you asked for but ... why not a edit-singleton?
The singleton could be a userID column in the document-table.
If a user wants to edit the document, the document is locked for edit by other users.

Or have edit-singletons on the individual fields/groups of information.

Only one user can edit the document at a time. If another user has the document open and want to edit a single timestamp check reveal that the document has been altered and is reloaded.

With a singleton there is no polling and only one timestamp check when the user "touches" and/or wants to edit the document.

But perhaps a singleton mechanism doesn't fit your system.

Regards
Sigersted

Sigersted 2009-11-23 22:05:01

Actually, i do exactly like that: hte document page is showed without a form, the user click on a pencil icon, an ajax request change the table with a form.I could check the timestamp before showing the form, or when he press on 'Save', then alert him if timestamp is old.But i'll love to alert the user as soon as the document is been edited; To avoid that a user see (and rely on) old data:let's say the user is mading up a box to ship to the costumer: bad thing if he send the box in the old shipment address.Im really wondering if isnt better to show the document as 'readonly' when already open.

DaNieL 2009-11-24 08:14:22

Answer 10

A:

Ahhh, i though it was easyer.

So, lets make the point: i have a generic database (pgsql or mysql doesn't matter), that contains many generic objects.

I have $x (actually $x = 200, but is growing, hoping will reach 1000 soon) of exact copy of this database, and for each of them up to 20 (avg 10) users for 9 hours at day.

If one of those users is viewing a record, any record, i must advice him if someone edit the same record.

Let's say Foo is watching the document 0001, sit up for a coffee, Bar open and edit the same document, when Foo come back he must see an 'Warning, someone else edited this document! click here to refresh tha page.'.

That'all i need atm, probably i'll extend this situation, adding a way to see the changes and rollback, but this is not the point.

Some of you suggested to check the 'last update' timestamp only when foo try to save the document; Can be a solution too, but i need something in real-time ( 10 sec deelay ).

Long polling, bad way, but seem to be the only one.

So, what i've done:

Installed Lighttp on my machine (and php5 as fastcgi);
Loaded apache2's proxy module (all, or 403 error will hit you);
Changed the lighttpd port from 80 (that is used by apache2) to 81;
Configured apache2 to proxying the request from mydomain.com/polling/* to polling.mydomain.com (served with Lighttp)
Now, i have another sub http-service that i'll use both for polling and load static content (images, etc..), in order to reduce the apache2's load.
Becose i dont want to nuke the database for the timestamp check, i've tryed some caches system (that can be called from php).
1. APC: quite simple to install and manage, very lightweight and faster, this would be my first choice.. if only the cache would be sharable between two cgi process (i need to store in cache a value from apache2's php process, and read it from lighttpd's php process)
2. Memcached: around 4-5 times slower than APC, but run as a single process that can be touched everywhere in my environment. I'll go with this one, atm. (even if is slower, the use i'll do of it is relatively simple).

Now, i just have to try this system loading some test datas to see ho will move 'under pressure' and optimize it.

I suppost this environment will work for other long-polling situations (chat?)

Thanks to everyone who gave me hear!

DaNieL 2009-11-27 13:35:34

ansaurus

tags:

views:

answers:

Best way to handle concurrency issues

related questions