views:

15

answers:

2

I have a program with a function that retrieves a relatively short text string from one SQL database table ("html"), runs a number of regular expression queries on it, and then stores the output in another database table ("profiles"). "Html" has about 8 million records and "profiles" stores only the results. It is currently very slow - well over 1 second per record on a fast server. I'm hoping someone might be able to point out something obvious that I would be able to fix:

dataComm dc = new dataComm(@"SERVER1", "html");
dataComm dc_bp = new dataComm(@"SERVER1", "profiles");

Int32 numinserted = 0;

for (int selectc = 0; selectc < 85000; selectc++)
{
  DataTable dt = dc.fetchData("SELECT top 100 * from html where processed=0");

  label3.Text = "Iteration " + selectc + " of 85,000";

  string sq="";
  string squpdate = "";

  Int32 thisid=0;

  for (int i = 0; i < dt.Rows.Count; i++)
  {                    
    // CODE THAT RUNS THE REGULAR EXPRESSIONS.
    thisid = Convert.ToInt32(dt.Rows[i]["ID"]);

    sq += @"INSERT INTO profiles (field1, field2, [etc.]) VALUES ('value1','value2', [etc.]); ";
    numinserted++;

    squpdate += "UPDATE html SET processed=1 WHERE ID=" + thisid.ToString() + "; ";
  }

  dc.executeNonQuery(squpdate);

}
A: 

If you have SQL 2005+, you can build some CLR to execute regular expressions in SQL. Very handy. Rather fast. Then, you can handle everything in SQL and push your code towards set-based operation rather than looping.

http://msdn.microsoft.com/en-us/magazine/cc163473.aspx

Brad
A: 

The answer might be simply not running regular expressions on that HTML but parse it in some other way, e.g. with the HTML agility pack. Parsing HTML with regex is "bad", doing a lot of it is very "bad". (It will depend on the regex, but they can be written so that they are really inefficient for this task.)

steinar