views:

158

answers:

2

I have text files with a lot of uniform rows that I'd like to load into a mysql database, but the files are not completely uniform. There are several rows at the beginning for some miscellaneous information, and there are timestamps about every 6 lines.

"LOAD DATA INFILE" doesn't seem like the answer here because of my file format. It doesn't seem flexible enough.

Note: The header of the file takes up a pre-determined number of lines. The timestamp is predicatable, but there are some other random notes that can pop up that need to be ignored. They always start with several keywords that I can check for though.

A sample of my file in the middle:

  103.3     .00035
  103.4     .00035
  103.5     .00035
  103.6     .00035
  103.7     .00035
  103.8     .00035
  103.9     .00035
Time: 07-15-2009 13:37
  104.0     .00035
  104.1     .00035
  104.2     .00035
  104.3     .00035
  104.4     .00035
  104.5     .00035
  104.6     .00035
  104.7     .00035
  104.8     .00035
  104.9     .00035
Time: 07-15-2009 13:38
  105.0     .00035
  105.1     .00035
  105.2     .00035

From this I need to load information into three fields. The first field needs to be the filename, and the other are present in the example. I could add the filename to be in front of each data line, but this may not be necessary if I use a script to load the data.

If required, I can change the file format, but I don't want to lose the timestamps and header information.

SQLAlchemy seems like a possible good choice for python, which I'm fairly familiar with.

I have thousands of lines of data, so loading all my files that I already have may be slow at first, but afterwards, I just want to load in the new lines of the file. So, I'll need to be selective about what I load in because I don't want duplicate information.

Any suggestions on a selective data loading method from a text file to a mysql database? And beyond that, what do you suggest for only loading in lines of the file that are not already in the database?

Thanks all. Meanwhile, I'll look into SQLAlchemy a bit more and see if I get somewhere with that.

+2  A: 

LOAD DATA INFILE has an IGNORE LINES option which you can use to skip the header. According to the docs, it also has a " LINES STARTING BY 'prefix_string'" option which you could use since all of your data lines seem to start with two blanks, while your timestamps start at the beginning of the line.

oggy
This may work for loading the file in the first time, but how would you read only the last few new lines to update the database?
mouche
Use IGNORE LINES?
oggy
+2  A: 

Another way to do this is to just have Python transform the files for you. You could have it filter the input file to an output file based on the criteria that you specify pretty easily. This code assumes you have some function is_data(line) that checks line for the criteria you specify and returns true if it is data.

with file("output", "w") as out:
  for line in file("input"):
    if is_data(line):
      out.write(line)

Additionally, if you files just continue to concat you could have it store and read the last recorded offset (this code may not be 100% right, I haven't test it. But you get the idea):

if os.path.exists("filter_settings.txt"):
   start=long(file("filter_settings.txt").read())
else:
   start=0

with file("output", "w") as out:
  input = file("input")
  input.seek(start)
  for line in input:
    if is_data(line):
      out.write(line)
  file("filter_settings.txt", "w").write(input.tell())
Christopher
Thanks for the code example. Perhaps python i/o is a good way to go. I'm going to look into that last snippet. I do continue to append data to the end of my files.
mouche
+1: Two part pipeline. Python to transform to a "clean" form. MySQL to load. Runs faster broken down this way. And you have a lot of control over the filtering without having to sweat the SQL stuff.
S.Lott