views:

3031

answers:

12

I want to be able to read from an unsorted source text file (a record in each line), and insert the line/record into a destination text file by specifying the line number where it should be inserted. It will be determined where to insert the line/record into the destination file by comparing the incoming line from the source file to the already ordered list in the destination file. (the destination file will start as empty and be inserted/sorted a line at a time from iterating over the source file lines).

Incoming File Example:
1 10/01/2008 line1data
2 11/01/2008 line2data
3 10/15/2008 line3data

Desired Destination File Example:
2 11/01/2008 line2data
3 10/15/2008 line3data
1 10/01/2008 line1data

I could do this by performing the sort in memory via a linked list or similar, but I want to allow this to scale to very large files (and am having fun trying to solve this problem as a I am a C++ newbie :)

One of the ways to do this may be to open 2 file streams with fstream (1 in and 1 out, or just 1 in/out stream), but then I run into the difficulty that it's difficult find and search the file position because it seems to depend on absolute position from the start of the file rather than line numbers :)

I'm sure problems like this have been tackled before, and I would appreciate advice on how to proceed in a manner that is good practice.

I'm using Visual Studio 2008 Pro C++, and I'm just learning C++.

Thanks!!!

+1  A: 

If the file is just a plain text file, then I'm afraid the only way to find a particular numbered line is to walk the file counting lines as you go.

The usual 'non-memory' way of doing what you're trying to do is to copy the file from the original to a temporary file, inserting the data at the right point, and then do a rename/replace of the original file.

Obviously, once you've done your insertion, you can copy the rest of the file in one big lump, because you don't care about counting lines any more.

Will Dean
A: 

I wonder if there are any examples of using Merge Sort and C++ to write to files. Too bad there doesn't seem to be an easy way to read/insert a text file by line number.

Will Dean's post is helpful for me in that I think that's a good way to determine line numbers.

Hopefully when I'm trying to insert at the line number, it doesn't overwrite the existing line (hopefully it just inserts the line).

Hopefully my question is understandable, Thanks again

Rich
A: 

I think the question is more about implementation rather than specific algorithms, specifically, handling very large datasets.

Suppose the source file has 2^32 lines of data. What would be an efficent way to sort the data.

Here's how I'd do it:

  1. Parse the source file and extract the following information: sort key, offset of line in file, length of line. This information is written to another file. This produces a dataset of fixed size elements that is easy to index, call it the index file.

  2. Use a modified merge sort. Recursively divide the index file until the number of elements to sort has reached some minimum amount - true merge sort recurses to 1 or 0 elements, I suggest stopping at 1024 or something, this will need fine tuning. Load the block of data from the index file into memory and perform a quicksort on it and then write the data back to disk.

  3. Perform the merge on the index file. This is tricky, but can be done like this: load a block of data from each source (1024 entries, say). Merge into a temporary output file and write. When a block is emptied, refill it. When no more source data is found, read the temporary file from the start and overwrite the two parts being merged - they should be adjacent. Obviously, the final merge doesn't need to copy the data (or even create a temporary file). Thinking about this step, it is probably possible to set up a naming convention for the merged index files so that the data doesn't need to overwrite the unmerged data (if you see what I mean).

  4. Read the sorted index file and pull out from the source file the line of data and write to the result file.

It certainly won't be quick with all that file reading and writing, but is should be quite efficient - the real killer is the random seeking of the source file in the final step. Up to that point, the disk access is usually linear and should therefore be reasonably efficient.

Skizz

Skizz
are you thinking something along the lines of the initial solution to this: http://cm.bell-labs.com/cm/cs/pearls/sec013.html ?
warren
Sort of (the top diagram I guess). The 'wonder sort' bitmap wouldn't work here because a) there's extra data and b) there can be duplicate keys.
Skizz
+1  A: 

A [distinctly-no-c++] solution would be to use the *nix sort tool, sorting on the second column of data. It might look something like this:

cat <file> | sort -k 2,2 > <file2> ; mv <file2> <file>

It's not exactly in-place, and it fails the request of using C++, but it does work :)

Might even be able to do:

cat <file> | sort -k 2,2 > <file>

I haven't tried that route, though.
* http://www.ss64.com/bash/sort.html - sort man page

warren
Yes, this is the obvious solution, but I am not sure if the implementation of `sort` scales to the mentioned gigantic files.
Svante
true enough... just wanted to provide an alternative :)
warren
@Svante: I have used GNU sort on files over 5 GB in size and it worked well.
Zan Lynx
Thanks, Zan. @warren: by the way, the use of `cat` here is superfluous, `sort` can take the file as an argument. (This kind of using `cat` is an anti-pattern.)
Svante
@Svante - I am aware that `sort` can take a file command, but it makes sense to me to split the command into each part (display the file, sort it, do something to it) :)
warren
+1  A: 

One way to do this is not to keep the file sorted, but to use a separate index, using berkley db (BerkleyDB). Each record in the db has the sort keys, and the offset into the main file. The advantage to this is that you can have multiple ways of sorting, without duplicating the text file. You can also change lines without rewriting the file by appending the changed line at the end, and updating the index to ignore the old line and point to the new one. We used this successfully for multi-GB text files that we had to make many small changes to.

Edit: The code I developed to do this is part of a larger package that can be downloaded here. The specific code is in the btree* files under source/IO.

KeithB
A: 

Thanks for all the feedback so far. I like the posts by Will Dean and Skizz the best so far. I'm just trying to do it a certain way for learning purposes, even though the other proposals may be better.

Yeah, I'm hoping for another good post like Will Dean gave. This seems to give the most insight so far into how I'm trying to do it. Although, I'm still sure sure how to move the pointer around in the file very well while I'm reading/writing to it since it depends on an absolute position from the start of the file.

Skizz has a well thought out post on this, but I'm afraid it's too advanced for me ;)

Rich
Unless you have a good reason not to it's better to read the file and write to a new one, then remove the old one.It's more efficent than jumping around in a file and reduces the chance of an oops.
Martin Beckett
Yeah, I was thinking of doing just that. I would like to leave the source file intact though for other sorting once I can get past this.
Rich
A: 

Try a modifed Bucket Sort. Assuming the id values lend themselves well to it, you'll get a much more efficient sorting algorithm. You may be able to enhance I/O efficiency by actually writing out the buckets (use small ones) as you scan, thus potentially reducing the amount of randomized file/io you need. Or not.

Brian
A: 

All of these are good answers. I'm afraid most are too advanced for me at the present.

I'm mainly interested in what the code would look like to read records from a source file and then sort/insert into a destination file. Hopefully, there are some good code examples on how to insert a record based on line number into the destination file.

Although, maybe my approach is wrong.. Maybe I need to forget about the line number bit.. And try to extrapolate what the line number is by looking at the carriage return/line feeds and seeing what the absolute position is from the beginning of the file.

This is great that I'm getting all this feedback, thanks guys, can tell there's some real pros here :)

Rich
A: 

Hopefully, there are some good code examples on how to insert a record based on line number into the destination file.

You can't insert contents into a middle of the file (i.e., without overwriting what was previously there); I'm not aware of production-level filesystems that support it.

zvrba
I didn't know that. I just assumed C++ supported this because other languages do (the most recent that comes to mind is mirc scripting)
Rich
A: 

Sorry for all my answers back.. just trying to clarify my problem better (guess I should've done that better in my original post ;)

I guess the biggest problem I'm facing is that I haven't seen any c++ examples where lines are being inserted into a destination file based on reads (sort comparisons) of that same file. Obviously, I don't think you'de want to keep the filestreams open in the destination file until the sorting is complete.

Rich
+1  A: 

The basic problem is that under common OSs, files are just streams of bytes. There is no concept of lines at the filesystem level. Those semantics have to be added as an additional layer on top of the OS provided facilities. Although I have never used it, I believe that VMS has a record oriented filesystem that would make what you want to do easier. But under Linux or Windows, you can't insert into the middle of a file without rewriting the rest of the file. It is similar to memory: At the highest level, its just a sequence of bytes, and if you want something more complex, like a linked list, it has to be added on top.

KeithB
A: 

Alright, the consensus seems to be that I can't do what I want in C++ by being able to insert a record into the destination file by line number or similar (without copying the remainder of the file and appending it or something)...

I'll probably end up sorting a linked list and then writing to a destination file (even though I didn't want to ;)

I'll just leave this post open in case anybody has anymore insight, thanks for all the responses

Rich