Hi everibody, Have a good day. So my problem is basically this, I need to process 37.800.000 files.
Each "file" is really more than that, what I have is:
- 37.800.000 XML documents.
- More than 120.000.000 of Tiff images.
Each of the XML documents reference one or more Tiff images and provides a set of common keywords for the images it represent.
What I need to build is a system that parse each one of the XML files (wich not only have ther keywords I need, but a lot of garbage). For each of the files it needs to store the index on a database (as columns) and the path of the images (also in the database), the path only because I don't think is a good idea to store also the images inside.
The final purpose is that users can search the db using the index keywords and the system loads the image or images associated with that index.
I already build the parser using XPath, and also define the schema of the db (wich is simple). But I'm stucked with two things, that causes my system to work very slow and ocassionally throws SQLExceptions:
I guess that, in order to don't full the pc memory while processing files I need a kind of pagination code but inverse, in order to send the corresponding the items to the db (as, say, packages every 1000 documents), so, how to implement that is the first of my problems.
Second one is that the XML files are not consecutive named, so I need to deal with duplicates like this way: when trying to index and existing image or images (By looking if its unique keyname is also in the db), i need to compare that image index date, with the latest indexed image to see wich of duplicates must go ( system's only matter about the latest index, by looking on the index file date keyword).
Anyone have an idea of how to solve this? I'm working with Java for the parser and JSP for the images search portal, also using MySQL.
Thank's in advance.
This is the structure of one of the Index file.
The Image file is inside the "dwFileName" attribute of the "FileInfo" element. The file name of the current index document is "DW5BasketFileName". If there are several images with this same index, there are more index files that are equals except for the extension (it starts with 001 and keep counting.
The average size of every document is 4KB.
<DWDocument DW5BasketFileName="DOCU0001.001">
<FileInfos>
<ImageInfos>
<ImageInfo id="0,0,0" nPages="0">
<FileInfo fileName="c:\bandejas\otra5\D0372001.DWTiff" dwFileName="D0001001.DWTiff" signedFileName="D0372001.DWTiff" type="normal" length="66732" />
</ImageInfo>
</ImageInfos>
</FileInfos>
<FileDatas />
<Section number="0" startPage="0" dwguid="d3f269ed-e57b-4131-863f-51d147ae51a3">
<Metadata version="0">
<SystemProperties>
<DocID>36919</DocID>
<DiskNo>1</DiskNo>
<PageCount>1</PageCount>
<Flags>2</Flags>
<StoreUser>DIGITAD1</StoreUser>
<Offset>0</Offset>
<ModificationUser>ESCANER1</ModificationUser>
<StoreDateTime>2009-07-23T21:41:18</StoreDateTime>
<ModificationDateTime>2009-07-24T14:36:03</ModificationDateTime>
</SystemProperties>
<FieldProperties>
<TextVar length="20" field="NO__REGISTRO" id="0">10186028</TextVar>
<TextVar length="20" field="IDENTIFICACION" id="1">85091039325</TextVar>
<TextVar length="40" field="APELLIDOS" id="32">DYMINSKI MORALES</TextVar>
<TextVar length="40" field="NOMBRES" id="33">JHONATAN OSCAR</TextVar>
<Date field="FECHA_DEL_REGISTRO" id="64">1985-10-10T00:00:00</Date>
</FieldProperties>
<DatabaseProperties />
<StoreProperties DocumentName="10/10/1985 12:00:00 a.m." />
</Metadata>
<Page number="0">
<Rendition type="original">
<Content id="0,0,0" pageNumberInFile="0" />
<Annotation>
<Layer id="1" z_order="0" dwguid="5c52b1f0-c520-4535-9957-b64aa7834264">
<LayerLocation x="0" y="0" />
<CreateUser>ESCANER1</CreateUser>
<CreateTime>2009-07-24T14:37:28</CreateTime>
<Entry dwguid="d36f8516-94ce-4454-b835-55c072b8b0c4">
<DisplayFlags>16</DisplayFlags>
<CreateUser>ESCANER1</CreateUser>
<CreateTime>2009-07-24T14:37:29</CreateTime>
<Rectangle x="6" y="0" width="1602" height="20" flags="0" size="10" color="#ffffff" bkgcolor="#000000" />
</Entry>
<Entry dwguid="b2381b9f-fae2-49e7-9bef-4d9cf4f15a3f">
<DisplayFlags>16</DisplayFlags>
<CreateUser>ESCANER1</CreateUser>
<CreateTime>2009-07-24T14:37:31</CreateTime>
<Rectangle x="1587" y="23" width="21" height="1823" flags="0" size="10" color="#ffffff" bkgcolor="#000000" />
</Entry>
<Entry dwguid="9917196d-4384-4052-8193-8379a61be387">
<DisplayFlags>16</DisplayFlags>
<CreateUser>ESCANER1</CreateUser>
<CreateTime>2009-07-24T14:37:33</CreateTime>
<Rectangle x="0" y="1836" width="1594" height="10" flags="0" size="10" color="#ffffff" bkgcolor="#000000" />
</Entry>
<Entry dwguid="3513e0c8-a6c9-42ec-ae9c-dc084376fcdb">
<DisplayFlags>16</DisplayFlags>
<CreateUser>ESCANER1</CreateUser>
<CreateTime>2009-07-24T14:37:35</CreateTime>
<Rectangle x="0" y="0" width="23" height="1839" flags="0" size="10" color="#ffffff" bkgcolor="#000000" />
</Entry>
</Layer>
<DW4CheckSum dwCheckSum="1479972439" dwDate="131663617" dwTime="319564778" dwImageSize="66732" dwSource="0" source="" />
</Annotation>
</Rendition>
</Page>
</Section>
</DWDocument>