+1  A: 

If your data file contains all your records (i.e. it includes records that do not have dupicate ids within the file) you could pre-process it and produce a file that only contains records that have duplicate (ids).

If this is the case that would reduce the size of file you need to process with your AWK program.

David
+1  A: 

How is the input file sorted? Like, cat file|sort, or sorted via a single specific field, or multiple fields? If multiple fields, what fields and what order? It appears the hour fields are a 24 hour clock, not 12, right? Are all the date/time fields zero-padded (would 9am be "9" or "09"?)

Without taking into account performance it looks like your code has problems with month boundaries since it assumes all months are 30 days long. Take the two dates 2008-05-31/12:00:00 and 2008-06-01:12:00:00. Those are 24 hours apart but your code produces the same time code for both (63339969600)

jj33
A: 

jj33 you're right about how my code manage the date wrongly. Thanks a lot!

All records are well formed in respect to have padded dates in a 24hr. format. I failed to see the relationship of that with improving the performance. Do you have an idea? I want to hear it!

The records are sort with

sort inputFile > outputFile
ggasp
A: 

jj33 I'll add an array to consider the amount of days passed until a given month:

daysForMonth={ 0, 31, 59, 90, 120, 151, 181, 212, 243, 273, 304, 334, 365 }

I'll not consider leap years because I don't need that precision, right?

ggasp
+1  A: 

I think you would need to consider leap years. I didn't do the math, but I think during a leap year, with a hard code of 28 days for feb, a comparison of noon on 2/29 and noon on 3/1 would result in the same duplicate time stamp as before. Although it looks like you didn't implement it like that. They way you implemented it, I think you still have the problem but it's between dates on 12/31 of $leapyear and 1/1 of $leapyear+1.

I think you might also have some collisions during time changes if your code has to handle time zones that handle them.

The file doesn't really seem to be sorted in any useful way. I'm guessing that field $1 is some sort of status (the "OK" you're checking for). So it's sorted by record status, then by DAY, then MONTH, YEAR, HOURS, MINUTES, SECONDS. If it was year,month,day I think there could be some optimizations there. Still might be but my brain's going in a different direction right now.

If there are a small number of duplicate keys in proportion to total number of lines, I think your best bet is to reduce the file your awk script works over to just duplicate keys (as David said). You could also preprocess the file so the only lines present are the /OK/ lines. I think I would do this with a pipeline where the first awk script only prints the lines with duplicate IDs and the second awk script is basically the one above but optimized to not look for /OK/ and with the knowledge that any key present is a duplicate key.

If you know ahead of time that all or most lines will have repeated keys, it's probably not worth messing with. I'd bite the bullet and write it in C. Tons more lines of code, much faster than the awk script.

jj33
+1  A: 

On many unixen, you can get sort to sort by a particular column, or field. So by sorting the file by the ID, and then by the date, you no longer need to keep the associative array of when you last saw each ID at all. All the context is there in the order of the file.

On my Mac, which has GNU sort, it's:

sort -k 8 < input.txt > output.txt

to sort on the ID field. You can sort on a second field too, by saying (e.g) 8,3 instead, but ONLY 2 fields. So a unix-style timet timestamp might not be a bad idea in the file - it's easy to sort, and saves you all those date calculations. Also, (again at least in GNU awk), there is a mktime function that makes the timet for you from the components.

AnotherHowie
+1  A: 

@AnotherHowie, I thought the whole preprocessing could be done with sort and uniq. The problem is that the OP's data seems to be comma delimited and (Solaris 8's) uniq doesn't allow you any way specify the record separator, so there wasn't a super clean way to do the preprocessing using standard unix tools. I don't think it would be any faster so I'm not going to look up the exact options, but you could do something like:

cut -d, -f8 <infile.txt | sort | uniq -d | xargs -i grep {} infile.txt >outfile.txt

That's not very good because it executes grep for every line containing a duplicate key. You could probably massage the uniq output into a single regexp to feed to grep, but the benefit would only be known if the OP posts expected ratio of lines containing suspected duplicate keys to total lines in the file.

jj33
A: 

I'm using AIX 5.3 and its AWK doesn't has a mktime function as suggested by AnotherHowie.

ggasp
A: 

I'll implement this in C as suggested by jj33 and I'll tell you how much I gain then.

ggasp
+3  A: 

This sounds like a job for an actual database. Even something like SQLite could probably help you reasonably well here. The big problem I see is your definition of "within 4 hours". That's a sliding window problem, which means you can't simply quantize all the data to 4 hour segments... you have to compute all "nearby" elements for every other element separately. Ugh.

Randal Schwartz