views:

1277

answers:

15

So, for some strange reason I end up with a 100GB log file that is unsorted (actually it's partially sorted), while the algorithms that I'm attempting to apply require sorted data. A line in the log file looks like so

data <date> data data more data

I have access to C# 4.0 and about 4 GB of RAM on my workstation. I would imagine that merge-sort of some kind would be best here, but short of implementing these algorithms myself - I want to ask if there's some kind of a shortcut I could take.

Incidentally parsing the date string with DateTime.Parse() is very slow and takes up a lot of CPU time - The chugging-rate is measly 10 MB/sec. Is there a faster way than the following?

    public static DateTime Parse(string data)
    {            
        int year, month, day;

        int.TryParse(data.Substring(0, 4), out year);
        int.TryParse(data.Substring(5, 2), out month);
        int.TryParse(data.Substring(8, 2), out day);

        return new DateTime(year, month, day);
    }

I wrote that to speed up DateTime.Parse() and it actually works well, but is still taking a bucket-load of cycles.

Note that for the current log-file I'm interested in hours, minutes and seconds also. I know that I can provide DateTime.Parse() with format, but that doesn't seem to speed it up all that much.

I'm looking for a nudge in the right direction, thanks in advance.

EDIT: Some people have suggested that I use string comparison in order to compare dates. That would work for the sorting phase, but I do need to parse dates for the algorithms. I still have no idea how to sort 100GB file on 4GB of free ram, without doing it manually.

EDIT 2 : Well, thanks to several suggestions that I use windows sort, I found out that there's a similar tool for Linux. Basically you call sort and it fixes everything for you. As we speak it's doing something, and I hope it'll finish soon. The command I'm using is

sort -k 2b 2008.log > 2008.sorted.log

-k specifies that I want to sort on the second row, which is an date-time string in the usual YYYY-MM-DD hh:mm:ss.msek format. I must admit that the man-pages are lacking explaining all the options, but I found a lot of examples by running info coreutils 'sort invocation'.

I'll report back with results and timings. This part of the log is about 27GB. I am thinking of sorting 2009 and 2010 separately and then merging the results into a single file with the sort -m option.

Edit 3 Well, checking iotop suggests that it's reading in small chunks of the data file and then furiously doing something in order to process them. This process seems to be quite slow. =(

sort isn't using any memory, and only a single core. When it does read data from the drive it's not processing anything. Am I doing something wrong?

Edit 4 Three hours in and it's still doing the same thing. Now I'm at that stage where I want to try playing with parameters of the function, but I'm three hours invested... I'll abort in in about 4 hours, and try to put it for overnight computation with smarter memory and space parameters...

Edit 5 Before I went home, I restarted the process with the following command:

sort -k 2b --buffer-size=60% -T ~/temp/ -T "/media/My Passport" 2010.log -o 2010.sorted.log

It returned this, this morning:

sort: write failed: /media/My Passport/sortQAUKdT: File too large

Wraawr! I thought I would just add as many hard drives as possible to speed this process up. Apparently adding a USB-drive was the worst idea ever. At the moment I can't even tell if it's about FAT/NTFS or some such, because fdisk is telling me that the USB drive is a "wrong device"... no kidding. I'll try to give it another go later, for now let's put this project into the maybe failed pile.

Final Notice This time it worked, with the same command as above, but without the problematic external hard drive. Thank you all for your help!

Benchmarking

Using 2 workstation grade (at least 70mb/sec read/write IO) hard-disks on the same SATA controller, it took me 162 minutes to sort a 30GB log file. I will need to sort another 52 GB file tonight, I'll post how that goes.

+15  A: 

Code like this is completely bound by how fast you can get the data off the disk. The file simply can never fit in the file system cache so you're always waiting on the disk to supply the data. You're doing fairly well at 10 MB/sec, optimizing the code is never going to have a discernible effect.

Get a faster disk. Defrag the one you've got as an intermediate step.

Hans Passant
Well, I don't know what to say, other than you are completely wrong. If I don't parse the string for each line then I can chug at a much more comfortable 80 mb/sec. If I do parse the string, then the CPU utilization is 25% (4 cores, so 100% of one core) and the entire process slows down to 10 mb/sec or thereabout.
Gleno
Well, seeing 100% utilization on one core is indeed very bad. This should be completely bogged down by the disk. 80 MB/sec is very good, most consumer level drives can't go faster than ~65 MB/sec. Excellent opportunities for a profiler. Please save yourself the trouble of having to read inaccurate answers by revealing this kind of info in your question instead of a comment.
Hans Passant
Which is exactly why I posted that DateTime.Parse takes forever. I profiled and found out that this was the bottleneck. I'm sorry if I wasn't clear.
Gleno
+4  A: 

The best way of optimising the parsing of the dates is to not parse them at all.

As the dates is in ISO 8601 format, you can just compare them as strings. There is no parsing needed at all.

Regarding the sorting, you should be able to effectively use the fact that it's partially sorted. One approach could be to read the file and write into separate files divided on time ranges, for example daily or hourly. If you make each file small enough you can then read them into memory and sort them, and then just merge all the files.

Another approach could be to read the file and write the records that are in order into one file, and the the other ones into another file. Sort the second file (possibly using this process recursively if it's large) and zip the two files together. I.e. a modified merge sort.

Guffa
This is actually a very good idea, but as I commented in the edit I do need to parse dates. Actually I need to know differences in seconds between groups of events, so what I'm doing is subtracting two DateTime objects. Seems to go blazingly fast, at the very least compared to the DateTime.Parse().
Gleno
Hm, I like the idea of taking the out-of-place data out so much, that I'm currently just running a test ignoring all of it. Do you suppose that .NET can do orderedseta.Concat(orderedsetb).OrderBy(e => e) efficiently if orderedseta is large and orderedsetbis small? I have my doubts... Maybe there's a specific container that could do this...
Gleno
@Gleno: You need to fit all the data in memory to use the `OrderBy` method, It has to read the entire source into memory before it can start returning any items.
Guffa
@Guffa but if I use some magic container that is guaranteed to be ordered already, wouldn't it auto-magically know what to do?
Gleno
@Gleno: No, the `Concat` method simply returns all items from one collection, then all the items from the other collection, so the `OrderBy` method has no idea that there were two collections.
Guffa
+1  A: 

Apart from whatever you are doing (probably, willw's suggestion is helpful), your parsing could be done over multiple threads provided you have multiple processors or processor cores.

Josh
I tried that, and got mixed results. Mixed in the sense that it works with plinq and ForAll(), but I have now idea how well it works, because for some reason multithreading disables windows Resource Monitors ability to see how fast I'm reading and writing to the hard drive.
Gleno
+7  A: 

Just to answer your question about sorting a long file that doesn't fit into the memory - you'll need to use some external sorting algorithm such as Merge sort. The process is roughly following:

  • Partition the input into several parts that fit into memory and can be sorted using standard in-memory sorting algorithms (e.g. 100 MB or larger - you'll need to keep ~4 parts in memory at once). Sort all the parts and write them back to disk.

  • Read two parts from the disk (they are both sorted) and merge them, which can be done just by simultaneously iterating over the two inputs. Write the merged data set to another place in the disk. Note that you don't need to read the whole part into memory - just read it/write it in blocks as you go.

  • Repeat merging of parts until you have only a single part (which will be sorted file with all the data from your original input data set).

You mentioned that the data is partially sorted already, so it would be a good idea to pick some algorithm for in-memory sorting (in the first phase) that is efficient in this case. You can see some suggestions in this question (though I'm not sure if the answer will be the same for very large data sets - and it depends on how much partially sorted the input is).

Tomas Petricek
Thank you for typing it all out, but I kinda know how merge sort works. I was looking for a ready implementation of it - a kind of shortcut. I would imagine that a library such as .NET had it all figured out for me. :)
Gleno
@Gleno: Okay :-). I thought it will help since you wrote _"I still have no idea how to sort 100GB file on 4GB of free ram"_. Anyway, I haven't seen any existing library that does this - it is probably quite difficult to write a general library for this as the use cases will differ quite a bit.
Tomas Petricek
Complicated.... use any sql db
JPCF
+7  A: 

If a string sort will work for you, then just use the Windows SORT command. Sort the file and be done with it. It'll happily sort your 100GB file, and it's simple to use.

If you need to filter and convert the file, specifically the date field, then I would simply write a small conversion program that converts the data field in to a 0 filled integer (like # of seconds since 1970, or whatever you like), and rewrites the record. Then you can pipe (|) the output in to the sort command, then you have a final, sorted file thats more readily parsed by your utility program.

I think the mistake you're making is simply trying to do this all in one go. 100GB of data is a lot, and it takes some time to copy, but it doesn't take THAT long. Since you have to sort it, you already have to deal with a copy of the file at some point (i.e. you need as much free space on your machine to handle both copies at some time), even with an external sorting routine like merge sort.

Writing a simple reformatter and piping it in to sort will save you a couple trips through the file, and save space on disk, since you'll inevitably just need the two copies.

I would also tweak the formatter in to pulling only the fields I'm really interested in, and do all of the "heavy" parsing at that point so that what you end up with is essentially a formatted file that easily handled by your reporting routines. That way you'll save time later when potentially running your reports more than once.

Use a simple CSV or, even better, a fixed length file format for output if possible.

Make sure your date information, if you choose to use an integer, has all of the fields the same length. Otherwise the SORT utility won't sort them correctly (you end up with 1 10 2 3 instead of 1 2 3 10. You're better to have 01 02 03 10.).

Edit --

Let's approach it from a different tact.

The biggest question is "do you need all this data". This relates to the earlier suggestion about doing the heavy parsing first. Obviously, the more you can reduce the initial set the better. For example, simply removing 10% of the data is 10GB.

Something I like to think about as a rule of thumb, especially when dealing with a lot of data: "If you have 1 Million of something, then every millisecond saved, is 20 minutes off the bottom line."

Normally, we really don't think in terms of milliseconds for our work, it's more "seat of the pants", "that feels faster". But the 1ms == 20min/million is a good measure to get a grasp of how much data you're dealing with, and how long stuff should/could take.

For you case, 100GB of data. With a swag of 100 bytes per record, you're taking 1 Billion rows. 20,000 minutes per millisecond. -- 5 1/2 hours. gulp (It's a rule of thumb, if you do the math it doesn't quite work out to this.)

So, you can appreciate the desire to reduce the raw data if at all possible.

That was one reason I deferred to the Windows SORT command. It's a basic process, but one affected by nuance, and one that can use some optimization. The folks who wrote SORT had time and opportunity to make it "optimal", in many ways. Whether they did or did not, I can't say. But its a fair assumption that they would put more time and attention in to this process to make their SORT as good as practical, versus you who are under a tight deadline.

There are 3rd party sorting utilities for large data sets, that probably (ideally) work better for that case. But, those are unavailable to you (you can get them but I don't think you wanted to rush out and get some other utility right away). So, SORT is our best guess for now.

That said, reducing the data set will gain more than any sort utility.

How much detail do you really need? And how much information are you really tracking? For example, if it were, say, web statistics, you may have 1000 pages on your site. But even with hourly numbers for a year, 365 * 24 * 1000, that's only 8.7M "buckets" of information -- a far cry from 1B.

So, is there any preprocessing you can do that does not require sorting? Summarizing the information into a coarser granularity? You can do that without sorting, simply using memory based hash maps. Even if you don't have "enough memory" to process all 100GB of data in one throw, you probably have enough to do it in chunks (5 chunks, 10 chunks), and write out the intermediary results.

You may also have a lot better luck splitting the data as well. Into monthly, or weekly file chunks. Maybe that's not easily done because the data is "mostly" sorted. But, in that case, if it's by date, the offenders (i.e. the data that's out of sort) may well be clustered within the file, with the "out of order" stuff being just mixed up on the barriers of the time periods (like around day transitions, maybe you have rows like 11:58pm, 11:59pm, 00:00am, 00:01am, 11:58pm, 00:02pm). You might be able to leverage that heuristic as well.

The goal being that if you can somewhat deterministically determine the subset that's out of order, and break the file up in to chunks of "in order data" and "out of order data", your sorting task may be MUCH MUCH smaller. Sort the few rows that are out of order, and then you have a merge problem (much simpler than a sorting problem).

So, those are tactics you can take approaching the problem. Summarization is obviously the best one as anything that reduces this data load in any measurable, is likely worth the trouble. Of course it all boils down to what you really want from the data, clearly the reports will drive that. This is also a nice point about "pre-mature optimization". If they're not reporting on it, don't process it :).

Will Hartung
This is by far the best answer so far.
Gabe
Will, I edited my post to reflect that i'm trying your solution. Thanks for the advice. Got any tips on how to speed it up?
Gleno
Gleno: Be sure to tell it to use lots of buffer space and give it plenty of temp space on fast hard drives!
Gabe
Thank you will, it worked out very well; although I used linux'es sort. :)
Gleno
steamer25
@Gleno - for curiosity, how long did you sort take? Any other steps you did?
Will Hartung
@Will Hartung, I am curious myself. Less than 10 hours, but it didn't say exactly. Tonight I'll be sorting a ~ 40 GB file in the same way, and I'll time execution and post results.
Gleno
+3  A: 

I do need to parse dates for the algorithms.

On *NIX, I generally would have first converted dates into something simple, suitable for text comparison and made it first word on the string. It's too early for date/time object creation. My usual date presentation is YYYYMMDD-hhmmss.millis. Make it that all files would have same date format.

I still have no idea how to sort 100GB file on 4GB of free ram, without doing it manually.

As you have figured it out already, merge sort is the only option.

So to me the tasks falls into the following step:

  1. dumb conversion to make dates sortable. Complexity: read/write sequentially 100GB.

  2. split data in chunks of usable size, e.g. 1GB and sort every chunk using plain quick sort before writing it to disk. Complexity: read/write sequentially 100GB; memory for quick sort.

  3. merge-sort the small files into one large. One can do it step-wise, using a program which takes two files and merges them into new one. Complexity: read/write sequentially 100GB log(N) times (where N is the number of files). HDD space requirement: 2*100GB (last merge of 2 x 50GB files into single 100GB file).

  4. A program to automate the previous step: pick two (e.g. smallest) files, start program to sort-merge them into a new file, remove the two original files. Repeat until number of files is greater than 1.

  5. (Optional) split the 100GB sorted file into smaller chunks of manageable size. After all you are going to do something with them. Number them sequentially or put first and last time stamps into the file name.

General concept: do not try to find a way to do it fast, piping 100GB would take time anyway; plan for the programs one every step to run over-night as a batch, without your attention.

On Linux that is all doable with shell/sort/awk/Perl, and I do not think that it is a problem to write it all in any other programming language. This is potentially 4 programs - but all of them are rather simple to code.

Dummy00001
+11  A: 

Short answer - load the data into a relational database eg Sql Express, create an index, and use a cursor based solution eg DataReader to read each record off and write it to disk.

James Westgate
+1: I'm not exactly sure why people would go through all sorts of hoops just to duplicate functionality that exists inside every free sql type server on the planet.
Chris Lively
Sorting data in a database engine requires jumping through hoops in the first place: install the DB engine, write a program to load the data, load the data, sort the data, write a program to extract the data, extract the data. And when that's 100GB of data, those load, sort, and extract steps take a long time!
Gabe
@Gabe I differ... sorting problems are already solved by DB's... actually he needs to read the file at least one time. To write an algorithm to parse the file and load those data is so easy... I think this is the solution...
JPCF
JPCF: Every modern OS comes with a `sort` utility that is specifically designed to sort text files. Why introduce a whole new problem to solve a simple one?
Gabe
Does anybody even know which free databases can handle sorting a single 100GB table?
Gabe
@Gabe. Sql Express is free and limited only by the number of concurrent connections. 2. Loading data can be done using SSIS or any number of other tools that may be available - and probably exported again. Edit: Sql Express is actually limited to a 4GB database, so I guess its the 180 day trial.
James Westgate
James: Do you see how you've turned a sorting problem into a "research databases, learn how to use database, learn how to debug problems in database, learn how to write code to access database, learn how to sort database" problem? If you're already a DBA, it may be the obvious solution, but for the average coder it's much simpler to spend a couple hours writing the merge sort algorithm you learn in CS101.
Gabe
@Gabe: The original post mentions "algorithms that he's attempting to apply" - these algorithms may well be simple SELECT queries, so the database solution may well be a more thorough answer than it looks at first. (Also, using a database from code isn't all that difficult, I'd expect most programmers to be able to pull it off without breaking a sweat.)
Douglas
Douglas: Algorithms that require the source data to be in sorted order are not possible with simple SELECT queries. Any decent CS student should be able to write an algorithm in C/C#/Java that can find the largest gap between successive log entries with O(n) time and O(1) memory complexity. In SQL it's tricky or impossible.
Gabe
I guess we are coming at this from two different sides - student or hobbyist environment / or business environment. They are both valid solutions depending on which one applies. The way the question was written, and the amount of data involved makes me assume that this is a commercial environment.
James Westgate
@Gabe - I cant see how loading data into a table with an index sorted on eg date is tricky - its absolutely trivial. And there may be other reporting or processing requirements on this data. Its an absolute no brainer to me.
James Westgate
James: I said that writing SQL "that can find the largest gap between successive log entries with O(n) time and O(1) memory complexity" is tricky. You can write the query, but you can't guarantee it will be fast. If you think it's trivial to load 100GB into an indexed table, though, you obviously haven't tried it.
Gabe
As for the commercial environment, I've never seen a business environment where the guy whose job it is to analyze log files was also a database programmer. I think most commercial coders would be able to write a mergesort in the time it would take just to get access to a DBMS that can handle a 100GB table!
Gabe
Disagree. By the time our OP has worked out whats he's doing, the code works perfectly etc - a relational database (using all the cores of the processor) would have indexed that data. Its not THAT big. Ive worked with IIS log files and tables that big in Site Server and that was a few years ago now.
James Westgate
I've had some time to catch up to this discussion, and I have to side heavily with Gabe on this one. I'm not used to working with databases, rather than querying them, and my background is in applied maths. The DBA actually supplied me with the file and said it was sorted, albeit in passing.
Gleno
OMG. The DBA could have sorted it! The data is already in a database!
James Westgate
@Gleno: I bet with some donuts you could have gotten the DBA to give you a sorted/formatted file. :)
Paul Nathan
Well, actually it's a bit complicated. The data center is in Switzerland and we would have to resend the data via 100mbit uplink, which would have taken it's fair share of time since we can use it during off-hours only...
Gleno
+2  A: 

Actually I don`t have many ideas about the date conversion, but the things that I would try to use to do that is:

  1. A database with a Index in the Date Column (to be easy to search in this data after).
  2. To Insert in this base use Bulk Insert.
  3. And some way to parallel the reading (In think parallel LINQ would be good and is very easy to use).
  4. Lots of patience (the most important/hard thing)
Leo Nowaczyk
+8  A: 

Why don't you try this relatively unkown tool from microsoft called logparser. It basically allows you to do an SQL query over a CSV file (or any other formatted textfile).

Saves you the trouble of pumping it into a database, doing your sort, and pumping it back out again

Toad
That's a neat idea, but I'm afraid it's like a magic bullet. Sure it says it will sort it for you, but when you are dealing with a large data set a suboptimal approach can really end up biting you in the buttocks.
Gleno
@gleno: it's supposed to work with logfiles from IIS which can be huge as well. Who knows, maybe it'll work. Saves you a lot of time if it did. ;^)
Toad
Yes, someone else posted ITT that it might be optimized for large files also. Pretty cool. I'm definitely going to check it out later. Thanks for the advice!
Gleno
A: 

You can try implement radix sort algorithm. Because radix scans the whole list sequentially and only few times, it can help here to prevent gigant number of scans and seeks of your 100 GB file.

Radix sort intend to classificate your records each iteration by one part. This part can be a digit, or a datetime part like year, month, day. in this case you dont even need to convert the string to DateTime, you can convert only the specific part to int.

Edit:

For sorting purposes, you can create temp binary file with only 2 columns: DateTime (DateTime.ToBinary() as Int64) and line address in the source file (as Int64).

Then you getting a much smaller file with fixed size records, only 16 bytes per record, then you can sort it much faster (IO operations will be faster at least).

Once finished sorting the temp file, you can create back the full sorted 100 GB log file.

DxCK
Basic problem of any radix sort is memory complexity - O(n). Interestingly my professor has worked a lot on radix sort algorithms and was the first one in the world to introduce O(n log log n) time search. Albeit with terrible memory complexity.... :)
Gleno
In this case, you can use additional hard-drive as the "memory" for the radix sort. I saying additional because you need always to read from the source and write to destination, so you can read and write to separate disks to not affect performance be seeks every read-write cycle.
DxCK
+3  A: 

Assuming that your log file only has 1-2% of the rows out of order, you could make a single pass through the complete log, outputing two files: one file that is in order and another file containing the 1-2% of rows that are out of order. Then sort the out-of-order rows in memory and perform a single merge of the formerly out-of-order rows with the in-order rows. This will be much faster than a full mergesort which will do many more passes.

Assuming that your log file has no row more than N rows out of place, you could make a single pass through the log with a sorted queue N rows deep. Whenever you encounter a log row that is out of order, just insert it into the proper place in the queue. Since this only requires a single pass through the log, it's going to be as fast as you can get.

Gabe
A: 

Not really as a solution, but just out of interest, one way to do it like this:

  • First break the file down into 1GB files
  • Then reading 2 files at a time, load the contents into a list of string and sort it
  • Write it back down to the individual files.

The problem is that you would need to read/write 100 files on each pass and do 100 passes to make sure that the data is sorted.

If my maths is correct: That is 10 000 GB read and 10 000 GB write, at an average 10MB/sec that is 20 000 000 sec which is 231 days

One way that is might work is that you scan the file once and write to smaller files, one for each time period for example day or hour. Then sort these individual files.

Shiraz Bhaiji
I didn't get the math, and actually it doensn't seem rational to me! how can it be so long? isn't that merge sort? merge sort suppose to be pretty quick... if you are right, then there is probably no way to sort it much faster than few months..
Itay
A: 

Wow. First of all, that is a whole new level of documenting-obssession.

My actual edvice would be, try to consider how neccessary this file really is.

About sorting, I have no idea if this will work or not, but you might want to try to build an Enumerator that returns the data directly from the Hard Disk (not saving anything but few pointers maybe), and then trying to use LINQ's OrderBy, which returns IEnumerator as well, which you, hopefuly, can Enamurate and save directly back to the disk.

The only question is whether or not OrderBy saves anything in the RAM.

Itay
Well, the OrderBy method isn't magical... it still need to do the work somehow. :)
Gleno
it probably work on the memory, but it's not necessary :D
Itay
+2  A: 

Pre-emptive comment: My answer only addresses the sub-problem of parsing date time values.

DateTime.Parse contains checks for all possible date formats. If you have a fix format you can optimize parsing quite well. A simple optimization would be to convert the characters directly:

class DateParserYyyyMmDd
{
    static void Main(string[] args)
    {
        string data = "2010-04-22";

        DateTime date = Parse(data);
    }

    struct Date
    {
        public int year;
        public int month;
        public int day;
    }

    static Date MyDate;

    static DateTime Parse2(string data)
    {
        MyDate.year = (data[0] - '0') * 1000 + (data[1] - '0') * 100 
            + (data[2] - '0') * 10 + (data[3] - '0');
        MyDate.month = (data[5] - '0') * 10 + (data[6] - '0');
        MyDate.day = (data[8] - '0') * 10 + (data[9] - '0');

        return new DateTime(MyDate.year, MyDate.month, MyDate.day);
    }
}
0xA3
Well, this is like something I tried, but it didn't occur to me that indexing would be faster than substrining, which it obviously must be.
Gleno
+4  A: 

For sorting you could implement a file-based bucket sort:

  1. Open input file
  2. Read file line by line
  3. Get date as string from line
  4. Append line to file <date>.log

The result would be a separate log file for each day, or separate for each hour. Choose so that you get files of a size that you can easily sort.

The remaining task would be to sort the created files and possibly merge the file again.

0xA3
+1, I feel that this could be an easy and elegant solution. Probably a cache of open files should be kept to minimize the impact of continuously open-close such files, probably an LRU cache could be ok, since the OP says that the data is already partially sorted.
Matteo Italia