views:

346

answers:

2

Hello,

I have written an importer which copies data from a flat table into several other tables, mapping them by a given XML. This is for a shop database, where each products can have several properties and each property can have several different languages, meaning it pretty fast sums up to a whole lot of data.

There are over 50,000 rows as of right now. My current import code looks like this:

string query = "SELECT * FROM " + tableDataProducts + " ORDER BY "
            + productIdField;

        DataSet importData = new DataSet();
        Hashtable data = new Hashtable();

        db.DoSelectQuery(query, ref importData, tableDataProducts);

        foreach (DataRow row in importData.Tables[0].Rows) {
            foreach (MapEntry e in mapping[tableObjPropertyValue]) {
                string value = row[e.ImportXmlAttributeName].ToString();

                if (value.Equals("null",
                            StringComparison.OrdinalIgnoreCase)
                        || value.Length < 1)
                    continue;

                data.Clear();

                data.Add("ProductSN", productIdToSn[row[
                    productIdField].ToString()]);
                data.Add("ObjPropertyGroupID", "0");
                data.Add("ObjPropertyID", e.ObjPropertyID);
                data.Add("LanguageID", e.LanguageID);
                data.Add("Value", value);

                db.DoPreparedInsertQuery(tableObjPropertyValue, data);
            }
        }

As can be seen, I first read the data from the flat import table, then iterate over each row representing a single product and for each product I iterate over the property mapping and copy each property into a Hashtable called data. null values are skipped.

After all columns are copied into the hashtable, I insert the row.

Currently, this approach only processes around 700 rows per minute, which results in this import taking approximately one hour. How can I optimize this?

[EDIT]

Here is a simplified version of the XML, as the actual XML is way too big to show here:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<DATAPACKET Version="2.0">
<METADATA>  
<FIELDS>
   <FIELD FieldName="source_id" DisplayLabel="source_id" FieldType="String" FieldClass="TField"/>
   <FIELD FieldName="data_field" DisplayLabel="data_field" FieldType="Unknown" FieldClass="TField"/>
</FIELDS>
</METADATA>
<ROWDATA>
   <ROW source_id="data_1" data_field="some string"/>
   <ROW source_id="data_2" data_field="another string"/>
</ROWDATA>
</DATAPACKET>

This XML is imported into a single table which each FIELD becoming a column. There is a mapping XML which looks as follows:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<DATAPACKET Version="2.0">
<METADATA>  
<FIELDS>
   <FIELD FieldName="source_id" DisplayLabel="source_id" FieldType="String" FieldClass="TField"/>
   <FIELD FieldName="target" DisplayLabel="target" FieldType="Unknown" FieldClass="TField"/>
</FIELDS>
</METADATA>
<ROWDATA>
   <ROW source_id="data_1" target="products::id"/>
   <ROW source_id="data_2" target="products::name"/>
</ROWDATA>
</DATAPACKET>

The target attribute contains the target table and column in the following format: target='table::column'.

A: 

Bulk operations are lightning fast in SQL. If you can translate your XML document to a series of SQL queries, that could dramatically improve performance.

Edit: I don't grok what you're trying to do, but it seems to me you start with a flat table, and end with a bunch of other tables. Why not do it like this:

insert into Product
(id, name)
select source_id, data_field
from FlatTable

This is pretty fast, at the cost of being less flexible than XML mappings.

Andomar
I've added a simplified XML example.
Michael Barth
Could you add a simplified row of the "flat" table? Is there a special meaning to the "data_1" labels, for example does "data_1" correspond to "source_id" because source_id is the first field?
Andomar
The flat table has one row for every field of the first xml. (in this case, it has "source_id" and "data_field" columns). The source_id is used by the mapping to identify the column in the flat table. This column is taken and its "data_field" value is copied over into the target (which holds the table and column info for the mapping).
Michael Barth
Alas, that's not possible since there are some dependencies (e.g. ObjPropertyValue needs a ObjPropertyID, which is automatically generated while filling the table ObjProperty. My programm does relate those) and other values need to be modified (some IDs are generated by concatenating other IDs) or detected and inserted while importing (MimeType based on file extension).
Michael Barth
A: 

Okay, two things: First, I've changed the insert-per-row approach to caching around 1000 rows and inserting them with a single MySQL insert (see multiple inserts).

Second, and probably the most important, was that I was having lots of duplicates per product, which accumulated into a big, bloody mess taking 1 hour to import. After eliminating those duplicates before the import I'm down to 10 seconds for the same action...

One should check the result of his selects for duplicates before importing them. In this case I wanted to select every single product once, but I selected each product with every language version. (meaning I've got 4 products which are basically the same, just in another language)

Michael Barth