views:

467

answers:

5

Hi everibody, Have a good day. So my problem is basically this, I need to process 37.800.000 files.

Each "file" is really more than that, what I have is:

  • 37.800.000 XML documents.
  • More than 120.000.000 of Tiff images.

Each of the XML documents reference one or more Tiff images and provides a set of common keywords for the images it represent.

What I need to build is a system that parse each one of the XML files (wich not only have ther keywords I need, but a lot of garbage). For each of the files it needs to store the index on a database (as columns) and the path of the images (also in the database), the path only because I don't think is a good idea to store also the images inside.

The final purpose is that users can search the db using the index keywords and the system loads the image or images associated with that index.

I already build the parser using XPath, and also define the schema of the db (wich is simple). But I'm stucked with two things, that causes my system to work very slow and ocassionally throws SQLExceptions:

I guess that, in order to don't full the pc memory while processing files I need a kind of pagination code but inverse, in order to send the corresponding the items to the db (as, say, packages every 1000 documents), so, how to implement that is the first of my problems.

Second one is that the XML files are not consecutive named, so I need to deal with duplicates like this way: when trying to index and existing image or images (By looking if its unique keyname is also in the db), i need to compare that image index date, with the latest indexed image to see wich of duplicates must go ( system's only matter about the latest index, by looking on the index file date keyword).

Anyone have an idea of how to solve this? I'm working with Java for the parser and JSP for the images search portal, also using MySQL.

Thank's in advance.

This is the structure of one of the Index file.

The Image file is inside the "dwFileName" attribute of the "FileInfo" element. The file name of the current index document is "DW5BasketFileName". If there are several images with this same index, there are more index files that are equals except for the extension (it starts with 001 and keep counting.

The average size of every document is 4KB.

<DWDocument DW5BasketFileName="DOCU0001.001">
  <FileInfos>
    <ImageInfos>
      <ImageInfo id="0,0,0" nPages="0">
        <FileInfo fileName="c:\bandejas\otra5\D0372001.DWTiff" dwFileName="D0001001.DWTiff" signedFileName="D0372001.DWTiff" type="normal" length="66732" />
      </ImageInfo>
    </ImageInfos>
  </FileInfos>
  <FileDatas />
  <Section number="0" startPage="0" dwguid="d3f269ed-e57b-4131-863f-51d147ae51a3">
    <Metadata version="0">
      <SystemProperties>
        <DocID>36919</DocID>
        <DiskNo>1</DiskNo>
        <PageCount>1</PageCount>
        <Flags>2</Flags>
        <StoreUser>DIGITAD1</StoreUser>
        <Offset>0</Offset>
        <ModificationUser>ESCANER1</ModificationUser>
        <StoreDateTime>2009-07-23T21:41:18</StoreDateTime>
        <ModificationDateTime>2009-07-24T14:36:03</ModificationDateTime>
      </SystemProperties>
      <FieldProperties>
        <TextVar length="20" field="NO__REGISTRO" id="0">10186028</TextVar>
        <TextVar length="20" field="IDENTIFICACION" id="1">85091039325</TextVar>
        <TextVar length="40" field="APELLIDOS" id="32">DYMINSKI MORALES</TextVar>
        <TextVar length="40" field="NOMBRES" id="33">JHONATAN OSCAR</TextVar>
        <Date field="FECHA_DEL_REGISTRO" id="64">1985-10-10T00:00:00</Date>
      </FieldProperties>
      <DatabaseProperties />
      <StoreProperties DocumentName="10/10/1985 12:00:00 a.m." />
    </Metadata>
    <Page number="0">
      <Rendition type="original">
        <Content id="0,0,0" pageNumberInFile="0" />
        <Annotation>
          <Layer id="1" z_order="0" dwguid="5c52b1f0-c520-4535-9957-b64aa7834264">
            <LayerLocation x="0" y="0" />
            <CreateUser>ESCANER1</CreateUser>
            <CreateTime>2009-07-24T14:37:28</CreateTime>
            <Entry dwguid="d36f8516-94ce-4454-b835-55c072b8b0c4">
              <DisplayFlags>16</DisplayFlags>
              <CreateUser>ESCANER1</CreateUser>
              <CreateTime>2009-07-24T14:37:29</CreateTime>
              <Rectangle x="6" y="0" width="1602" height="20" flags="0" size="10" color="#ffffff" bkgcolor="#000000" />
            </Entry>
            <Entry dwguid="b2381b9f-fae2-49e7-9bef-4d9cf4f15a3f">
              <DisplayFlags>16</DisplayFlags>
              <CreateUser>ESCANER1</CreateUser>
              <CreateTime>2009-07-24T14:37:31</CreateTime>
              <Rectangle x="1587" y="23" width="21" height="1823" flags="0" size="10" color="#ffffff" bkgcolor="#000000" />
            </Entry>
            <Entry dwguid="9917196d-4384-4052-8193-8379a61be387">
              <DisplayFlags>16</DisplayFlags>
              <CreateUser>ESCANER1</CreateUser>
              <CreateTime>2009-07-24T14:37:33</CreateTime>
              <Rectangle x="0" y="1836" width="1594" height="10" flags="0" size="10" color="#ffffff" bkgcolor="#000000" />
            </Entry>
            <Entry dwguid="3513e0c8-a6c9-42ec-ae9c-dc084376fcdb">
              <DisplayFlags>16</DisplayFlags>
              <CreateUser>ESCANER1</CreateUser>
              <CreateTime>2009-07-24T14:37:35</CreateTime>
              <Rectangle x="0" y="0" width="23" height="1839" flags="0" size="10" color="#ffffff" bkgcolor="#000000" />
            </Entry>
          </Layer>
          <DW4CheckSum dwCheckSum="1479972439" dwDate="131663617" dwTime="319564778" dwImageSize="66732" dwSource="0" source="" />
        </Annotation>
      </Rendition>
    </Page>
  </Section>
</DWDocument>
A: 

Well, I'd read and parse the files multithreaded (mapping them to objects), putting these objects into a threadsafe queue which checks for duplicates, first in memory, then in the database itself, or if you have the space, totally in memory (don't know how big your xmls are). After or while doing that, you push it into the database one after another, stopping this if you need to check for a duplicate in the database. It's neither pretty nor fast, but considering you have these files unordered... maybe the best solution at hand.

Femaref
Hi femaref.But if I do that. Will not that just full the memory? So, there's no silver bullet for this... :(Thanks for the reply
Sheldon
+3  A: 

The first problem here, I would say, comes from the disk access time. Even if your xml files only have 1k then they amount to 37GB of data and that takes time to read. Nothing can be done to improve this.

However, you can make sure that you don't waste extra time doing other unnecessarily blocking computation.

  1. If the database is also in the same disk, the batches should be far greater than 1000, you want to access the database as little times as your memory allows it (if the xml files are stored consecutively in disk)
  2. Make sure you free your variables as soon as possible so that the garbage collector can free the memory.
  3. You want to do the xml parsing while the computer is waiting to read the files so you should set up another thread to do the tasks in parallel.

As for your second problem you could do, for each image, an update sql statement on the images with the same index and if no rows are updated the insert this image in a new row. Whether this will perform better than using a select followed by an insert/update depends on the percentage of duplicates you have.

I'm going on a limb and assume the xml files are not being created faster than you can process them, if that's the case all you need to do is save the file names that already have been processed either to the database or to a flat file and read them back the next time it starts, making sure you don't have to start all over again.

Vasco Fernandes
Hi Vasco, thanks for your answer. I didn't even consider the thing about disk access.Well, the indexer app will be installed in the same machine. But the DB uses exclusively 2 Velociraptors mounted at raid5.I will sure implement the algorithm by using multithreading. I don't now exactly how to do the third point. Isn't it just to start a new thread. Or to do it while disk access I need to use another technique. I'm just starting with this of Java and Multithreading, I came from Ruby.... :SAlso the files exists and are stored on that server. I don't know if that changes something. Thanks!
Sheldon
About duplicates: in MySQL there is the useful INSERT ... ON DUPLICATE KEY UPDATE statement(http://dev.mysql.com/doc/refman/5.1/en/insert-on-duplicate.html) which deals with updating and inserting in a single instruction! Just discovered in one of my StackOverflow questions: http://stackoverflow.com/questions/1492761/performing-an-update-or-insert-depending-whether-a-row-exists-or-not-in-mysql :-)
Enrico Detoma
If you were using c# you'd have two possibilities, the simpler one would be reading the files asynchronous (maybe you can find an non native implementation in java?). The hard way is to create a different thread. As soon as you create the thread you get parallelization on disk access for free. The hard part and that's what you have to worry about, is sharing information between threads for that you must brush up on locks, synchronized methods and monitors
Vasco Fernandes
+2  A: 

For indexing I suggest you to use Solr which is extremely fast at indexing a large number of documents. Solr also have a dedicated class StreamingUpdateSolrServer for updating the index using multiple threads and batch-commits.

Solr is written in Java and based on the fast full-text search engine library Lucene. It has a simple Java api through which you can submit your documents to the server for indexing.

Here's a discussion about the new StreamingUpdateSolrServer class for indexing documents in large batches.

StreamingUpdateSolrServer is available in the new Solr 1.4 release which should be out in a few days (I'm working with a nigthly 1.4 Solr build since June and I find it pretty stable already).

Enrico Detoma
Thanks Enrico, I will give it a try.
Sheldon
A: 

Just an addition: I would strongly recommend to store the images in the database, too. If you already start with 120 Mio image-files you will soon be at a point where even modern state of the art filesystems have their boundries.

And you will likely have to get rid of MySQL and get a real DB (DB2, Oracle or SQL Server).

Foxfire
Hi Foxfire.Can you explain in more details, why to put the images inside the DB?Also one thing that is very important. Is MySQL a bad DBMS for large documents?
Sheldon
I do not advise storing images inside the DB. The file system is the right tool for the job, of course with subdirectory splitting of the whole corpus of images. Too much innecessary burden for the database to store 120M images. I do not even advise using a relational DB for storing a 37M records index, but that's another story, as I explained in my answer :-)
Enrico Detoma
Just the MFT/Inode Table will likely be about 150GB in size (that means without ANY file actually having any content). Next thing is even NTFS and ext4 have a hard limit at 4 Billion files. You won't be able to have more files even if backed by a big SAN (or you need multiple FSs, then much fun with organisation of your files).120 Mio files will likely take about 1 day (if you have a decent system) just to open and close them all.
Foxfire
I had a project where we had a similar problem (image files for a digital geo map). And although we had FAR fewer files (not even 1 Mio) we already had HUGE performance problems when storing in a filesystem, so we switched to the DB approach that solved most of the problems.
Foxfire
Which DB did you use to store such a large number of images? I always thought that storing images in a DB, which in turn uses the filesystem, was inherently a bad idea, but your comments are very interesting.
Enrico Detoma
The Database was DB2
Foxfire
Thank you very much. I'm actually doing this following the recommendations of Foxfire. Another plus I've seen is that images' storage, search and retrieve benefits from the DB files management system. What I'm trying to do, Is convince my client to buy Oracle. I make a test with just 20GB of data. And it performs very fast on oracle compared to MySQL and SQL Server, Haven't test DB2.
Sheldon
+1  A: 

I suspect you are loading this data in one giant db transaction. Your individual data files are nothing challenging-looking. But the number of entries you are making into this database is 120M x avg # of keywords/image. A naive loader program does this:

start transaction
for each index file
    parse file
    for each keyword
        insert (keyword,imagename) into db
commit transaction

This forces your database to buffer the entire data load in a journal file, in case a rollback is required. Conversely, if you have enabled auto-commit, then you are doing a separate transaction for each keyword, which can also be suboptimal, since it is doing concurrency lock/unlock several hundred million times.

A middle ground is to commit every 'n' inserts, say 10,000:

inserts = 0
reset = 10000
start transaction
for each index file
    parse file
    for each keyword
        insert (keyword,imagename) into db
        inserts += 1
        if inserts % reset == 0
            commit transaction
            start transaction
commit transaction

Ideally, a loader program is also restartable - this would require a SELECT query for each index file to see if its image exists in your database before inserting the keywords.

inserts = 0
reset = 10000
start transaction
for each index file
    parse file
    if "SELECT count(*) from IMAGES where name='<insert imagename>'" == 0
        for each keyword
            insert (keyword,imagename) into db
            inserts += 1
            if inserts % reset == 0
                commit transaction
                start transaction
commit transaction

But in your case, this may be prohibitively expensive. As an alternative, you could look into what kind of import facility your database supports, and use your program to create one or more text files, probably no more complex then a list of keyword,imagename comma-delimted pairs. Then the loader should have options like commit interval or exclusive locking to minimize the lock and journal overhead.

Lastly, leave off the index on the KEYWORDS table until after the data has been loaded. During your load, this index does not help in any way, but you are updating it with every keyword,imagename insert. Adding this index after the table is loaded will take some time, but you will only pay this penalty once at the end, instead of multiple times during the load. (If you do the SELECT code in the restartable option, then do have the IMAGES table index during the load, else each SELECT would have to do a full table scan.)

Paul McGuire
Thank you very much! I'm almost done with that
Sheldon