views:

708

answers:

3

My program is now still running to import data from a log file into a remote SQL Server Database. The log file is about 80MB in size and contains about 470000 lines, with about 25000 lines of data. My program can import only 300 rows/second, which is really bad. :(

public static int ImportData(string strPath)
{
    //NameValueCollection collection = ConfigurationManager.AppSettings;

    using (TextReader sr = new StreamReader(strPath))
    {
        sr.ReadLine(); //ignore three first lines of log file
        sr.ReadLine();
        sr.ReadLine();
        string strLine;
        var cn = new SqlConnection(ConnectionString);
        cn.Open();

        while ((strLine = sr.ReadLine()) != null)
        {
            {
                if (strLine.Trim() != "") //if not a blank line, then import into database
                {
                    InsertData(strLine, cn);
                    _count++;
                }
            }
        }
        cn.Close();
        sr.Close();

        return _count;
    }
}

InsertData is just a normal insert method using ADO.NET. It uses a parsing method:

public Data(string strLine)
{
    string[] list = strLine.Split(new[] {'\t'});
    try
    {
        Senttime = DateTime.Parse(list[0] + " " + list[1]);
    }
    catch (Exception)
    {
    }

    Clientip = list[2];
    Clienthostname = list[3];

    Partnername = list[4];
    Serverhostname = list[5];
    Serverip = list[6];

    Recipientaddress = list[7];
    Eventid = Convert.ToInt16(list[8]);
    Msgid = list[9];
    Priority = Convert.ToInt16(list[10]);
    Recipientreportstatus = Convert.ToByte(list[11]);
    Totalbytes = Convert.ToInt32(list[12]);
    Numberrecipient = Convert.ToInt16(list[13]);
    DateTime temp;
    if (DateTime.TryParse(list[14], out temp))
    {
        OriginationTime = temp;
    }
    else
    {
        OriginationTime = null;
    }
    Encryption = list[15];
    ServiceVersion = list[16];
    LinkedMsgid = list[17];
    MessageSubject = list[18];
    SenderAddress = list[19];
}

InsertData method:

private static void InsertData(string strLine, SqlConnection cn)
{
    var dt = new Data(strLine); //parse the log line into proper fields 
    const string cnnStr =
        "INSERT INTO LOGDATA ([SentTime]," + "[client-ip]," +
        "[Client-hostname]," + "[Partner-Name]," + "[Server-hostname]," +
        "[server-IP]," + "[Recipient-Address]," + "[Event-ID]," + "[MSGID]," +
        "[Priority]," + "[Recipient-Report-Status]," + "[total-bytes]," +
        "[Number-Recipients]," + "[Origination-Time]," + "[Encryption]," +
        "[service-Version]," + "[Linked-MSGID]," + "[Message-Subject]," +
        "[Sender-Address]) " + " VALUES (     " + "@Senttime," + "@Clientip," +
        "@Clienthostname," + "@Partnername," + "@Serverhostname," + "@Serverip," +
        "@Recipientaddress," + "@Eventid," + "@Msgid," + "@Priority," +
        "@Recipientreportstatus," + "@Totalbytes," + "@Numberrecipient," +
        "@OriginationTime," + "@Encryption," + "@ServiceVersion," +
        "@LinkedMsgid," + "@MessageSubject," + "@SenderAddress)";


    var cmd = new SqlCommand(cnnStr, cn) {CommandType = CommandType.Text};

    cmd.Parameters.AddWithValue("@Senttime", dt.Senttime);
    cmd.Parameters.AddWithValue("@Clientip", dt.Clientip);
    cmd.Parameters.AddWithValue("@Clienthostname", dt.Clienthostname);
    cmd.Parameters.AddWithValue("@Partnername", dt.Partnername);
    cmd.Parameters.AddWithValue("@Serverhostname", dt.Serverhostname);
    cmd.Parameters.AddWithValue("@Serverip", dt.Serverip);
    cmd.Parameters.AddWithValue("@Recipientaddress", dt.Recipientaddress);
    cmd.Parameters.AddWithValue("@Eventid", dt.Eventid);
    cmd.Parameters.AddWithValue("@Msgid", dt.Msgid);
    cmd.Parameters.AddWithValue("@Priority", dt.Priority);
    cmd.Parameters.AddWithValue("@Recipientreportstatus", dt.Recipientreportstatus);
    cmd.Parameters.AddWithValue("@Totalbytes", dt.Totalbytes);
    cmd.Parameters.AddWithValue("@Numberrecipient", dt.Numberrecipient);
    if (dt.OriginationTime != null)
        cmd.Parameters.AddWithValue("@OriginationTime", dt.OriginationTime);
    else
        cmd.Parameters.AddWithValue("@OriginationTime", DBNull.Value);
            //if OriginationTime was null, then insert with null value to this column
    cmd.Parameters.AddWithValue("@Encryption", dt.Encryption);
    cmd.Parameters.AddWithValue("@ServiceVersion", dt.ServiceVersion);
    cmd.Parameters.AddWithValue("@LinkedMsgid", dt.LinkedMsgid);
    cmd.Parameters.AddWithValue("@MessageSubject", dt.MessageSubject);
    cmd.Parameters.AddWithValue("@SenderAddress", dt.SenderAddress);
    cmd.ExecuteNonQuery();
}

How can my program run faster? Thank you so much!

+11  A: 

Use SqlBulkCopy.

Edit: I created a minimal implementation of IDataReader and created a Batch type so that I could insert arbitrary in-memory data using SqlBulkCopy. Here is the important bit:

IDataReader dr = batch.GetDataReader();
using (SqlTransaction tx = _connection.BeginTransaction())
{
    try
    {
        using (SqlBulkCopy sqlBulkCopy =
            new SqlBulkCopy(_connection, SqlBulkCopyOptions.Default, tx))
        {
            sqlBulkCopy.DestinationTableName = TableName;
            SetColumnMappings(sqlBulkCopy.ColumnMappings);
            sqlBulkCopy.WriteToServer(dr);
            tx.Commit();
        }
    }
    catch
    {
        tx.Rollback();
        throw;
    }
}

The rest of the implementation is left as an exercise for the reader :)

Hint: the only bits of IDataReader you need to implement are Read, GetValue and FieldCount.

Matt Howells
SQLBulkCopy is the way to go. I used to use bcp back in SQL 6.5/7.0 days to import data from CSVs and found it to be blisteringly quick. SqlBulkCopy is essentially the same functionality exposed to managed code.
davewasthere
my log file contains 3 three lines of header, and, it needs 2 fields to express datetime. I have to combine them to convert to a Datetime value. How can I do that?
Vimvq1987
Convert your file to a format suitable for SqlBulkCopy as a first step (read line from old file, write line to new file) and then use SqlBulkCopy.
marc_s
You can do whatever parsing you are doing already to get the data into an `object[]` which represents the row of data that needs to be inserted into the table. Then use `SqlBulkCopy` instead of an insert statement.
Matt Howells
+4  A: 

Hmmm, let's break this down a little bit.

In pseudocode what you did is the ff:

  1. Open the file
  2. Open a connection
  3. For every line that has data:
    • Parse the string
    • Save the data in SQL Server
  4. Close the connection
  5. Close the file

Now the fundamental problems in doing it this way are:

  • You are keeping a SQL connection open while waiting for your line parsing (pretty susceptible to timeouts and stuff)
  • You might be saving the data line by line, each in its own transaction. We won't know until you show us what the InsertData method is doing
  • Consequently you are keeping the file open while waiting for SQL to finish inserting

The optimal way of doing this is to parse the file as a whole, and then insert them in bulk. You can do this with SqlBulkCopy (as suggested by Matt Howells), or with SQL Server Integration Services.

If you want to stick with ADO.NET, you can pool together your INSERT statements and then pass them off into one large SQLCommand, instead of doing it this way e.g., setting up one SQLCommand object per insert statement.

Jon Limjap
+2  A: 

You create the SqlCommand object for every row of data. The simplest improvement would therefore to create a

private static SqlCommand cmdInsert

and declare the parameters with the Parameters.Add() method. Then for each data row, set the parameter values using

cmdInsert.Parameters["@paramXXX"].Value = valueXXX;

A second performance improvement might be to skip creation of Data objects for each row, and assign Parameter values directly from the list[] array.

devio
Thank you. I will try! :)
Vimvq1987