views:

97

answers:

4

Ok, so the story is like this:

-- I am having lots of files (pretty big, around 25GB) that are in a particular format and needs to be imported in a datastore

-- these files are continuously updated with data, sometimes new, sometimes the same data

-- I am trying to figure out an algorithm on how could I detect if something has changed for a particular line in a file, in order to minimize the time spent updating the database

-- the way it currently works now is that I'm dropping all the data in the database each time and then reimport it, but this won't work anymore since I'll need a timestamp for when an item has changed.

-- the files contains strings and numbers (titles, orders, prices etc.)

The only solutions I could think of are:

-- compute a hash for each row from the database, that it's compared against the hash of the row from the file and if they're different the update the database

-- keep 2 copies of the files, the previous ones and the current ones and make diffs on it (which probably are faster than updating the db) and based on those update the db.

Since the amount of data is very big to huge, I am kind of out of options for now. On the long run, I'll get rid of the files and data will be pushed straight into the database, but the problem still remains.

Any advice, will be appreciated.

+1  A: 

Instead of computing the hash for each row from the database on demand, why don't you store the hash value instead?

Then you could just compute the hash value of the file in question and compare it against the database stored ones.

Update:

Another option that came to my mind is to store the Last Modified date/time information on the database and then compare it against that of the file in question. This should work, provided the information cannot be changed either intentionally or by accident.

Anax
If you're saying to compute the hash of the whole file vs. the hash of the whole database that won't help me. But if you're saying to store the hash per row in the database, yes, that's one of the solution I thought of. I am just wondering whether that's faster than just figuring out whether the data has changed by just comparing element with element.
hyperboreean
Abdel Olakara
I don't have any timestamp in that file.
hyperboreean
@hyperboreean, filesystem should give you the timestamp, but depending on the structure of the 25GB files this might or might not be useful. Further suggestions in my answer.
Unreason
+1  A: 

Well regardless what you use your worst case is going to be O(n), which on n ~ 25GB of data is not so pretty.

Unless you can modify the process that writes to the files.

Since you are not updating all of the 25GBs all of the time, that is your biggest potential for saving cycles.

1. Don't write randomly
Why don't you make the process that writes the data append only? This way you'll have more data, but you'll have full history and you can track which data you already processed (what you already put in the datastore).

2. Keep a list of changes if you must write randomly
Alternatively if you really must do the random writes you could keep a list of updated rows. This list can be then processed as in #1, and the you can track which changes you processed. If you want to save some space you can keep a list of blocks in which the data changed (where block is a unit that you define).

Furthermore you can keep checksums/hashes of changed block/lines. However this might not be very interesting - it is not so cheap to compute and direct comparison might be cheaper (if you have free CPU cycles during writing it might save you some reading time later, YMMV).

Note(s)

  • Both #1 and #2 are interesting only if you can make adjustment to the process that writes the data to the disk
  • If you can not modify the process that writes in the 25GB data then I don't see how checksums/hashes can help - you have to read all the data anyway to compute the hashes (since you don't know what changed) so you can directly compare while you read and come up with a list of rows to update/add (or update/add directly)
  • Using diff algorithms might be suboptimal, diff algorithm will not only look for the lines that changed, but also check for the minimal edit distance between two text files given certain formatting options. (in diff, this can be controlled with -H or --minimal to work slower or faster, ie search for exact minimal solution or use heuristic algorithm for which if iirc this algorithm becomes O(n log n); which is not bad, but still slower then O(n) which is available to you if you do direct comparison line by line)
Unreason
+1  A: 

Problem definition as understood.

Let’s say your file contains

ID,Name,Age
1,Jim,20
2,Tim,30
3,Kim,40

As you stated Row can be added / updated , hence the file becomes

ID,Name,Age
1,Jim,20    -- to be discarded 
2,Tim,35    -- to be updated
3,Kim,40    -- to be discarded 
4,Zim,30    --  to be inserted 

Now the requirement is to update the database by inserting / updating only above 2 records in two sql queries or 1 batch query containing two sql statements.

I am making following assumptions here

  • You cannot modify the existing process to create files.
  • You are using some batch processing [Reading from file - Processing in Memory- Writing in DB] to upload the data in the database.

Store the hash values of Record [Name,Age] against ID in an in-memory Map where ID is the key and Value is hash [If you require scalability use hazelcast ].

Your Batch Framework to load the data [Again assuming treats one line of file as one record], needs to check the computed hash value against the ID in in-memory Map.First time creation can also be done using your batch framework for reading files.

 If (ID present)
--- compare hash 
---found same then discard it
—found different create an update sql 
In case ID not present in in-memory hash,create an insert sql and insert the hashvalue

You might go for parallel processing , chunk processing and in-memory data partitioning using spring-batch and hazelcast.

http://www.hazelcast.com/

http://static.springframework.org/spring-batch/

Hope this helps.

A: 

practically it's kind of problem that has to be solved by backup software, so why not use some of their standard solutions? the best one would be to hook the WriteFile calls so that you'll receive callbacks on each update. This would work pretty well with binary records.

Something that I cannot understand: the files are actually text files that are not just appended, but updated? this is highly ineffective ( together with idea of keeping 2 copies of files, because it will make the file caching work even worse).

ruslik