tags:

views:

899

answers:

5

Hello,

I have a huge text file with 25k lines.Inside that text file each line starts with "1 \t (linenumber)"

Example:

1   1 ITEM_ETC_GOLD_01 골드(소) xxx xxx xxx_TT_DESC 0 0 3 3 5 0 180000 3 0 1 0 0 255 1 1 0 0 0 0 0 0 0 0 0 0 -1 0 -1 0 -1 0 -1 0 -1 0 0 0 0 0 0 0 100 0 0 0 xxx item\etc\drop_ch_money_small.bsr xxx xxx xxx 0 2 0 0 1 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 표현할 골드의 양(param1이상) -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx 0 0
1   2 ITEM_ETC_GOLD_02 골드(중) xxx xxx xxx_TT_DESC 0 0 3 3 5 0 180000 3 0 1 0 0 255 1 1 0 0 0 0 0 0 0 0 0 0 -1 0 -1 0 -1 0 -1 0 -1 0 0 0 0 0 0 0 100 0 0 0 xxx item\etc\drop_ch_money_normal.bsr xxx xxx xxx 0 2 0 0 1 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1000 표현할 골드의 양(param1이상) -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx 0 0
1   3 ITEM_ETC_GOLD_03 골드(대) xxx xxx xxx_TT_DESC 0 0 3 3 5 0 180000 3 0 1 0 0 255 1 1 0 0 0 0 0 0 0 0 0 0 -1 0 -1 0 -1 0 -1 0 -1 0 0 0 0 0 0 0 100 0 0 0 xxx item\etc\drop_ch_money_large.bsr xxx xxx xxx 0 2 0 0 1 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 10000 표현할 골드의 양(param1이상) -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx 0 0
1   4 ITEM_ETC_HP_POTION_01 HP 회복 약초 xxx SN_ITEM_ETC_HP_POTION_01 SN_ITEM_ETC_HP_POTION_01_TT_DESC 0 0 3 3 1 1 180000 3 0 1 1 1 255 3 1 0 0 1 0 60 0 0 0 1 21 -1 0 -1 0 -1 0 -1 0 -1 0 0 0 0 0 0 0 100 0 0 0 xxx item\etc\drop_ch_bag.bsr item\etc\hp_potion_01.ddj xxx xxx 50 2 0 0 1 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 120 HP회복양 0 HP회복양(%) 0 MP회복양 0 MP회복양(%) -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx 0 0
1   5 ITEM_ETC_HP_POTION_02 HP 회복약 (소) xxx SN_ITEM_ETC_HP_POTION_02 SN_ITEM_ETC_HP_POTION_02_TT_DESC 0 0 3 3 1 1 180000 3 0 1 1 1 255 3 1 0 0 1 0 110 0 0 0 2 39 -1 0 -1 0 -1 0 -1 0 -1 0 0 0 0 0 0 0 100 0 0 0 xxx item\etc\drop_ch_bag.bsr item\etc\hp_potion_02.ddj xxx xxx 50 2 0 0 2 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 220 HP회복양 0 HP회복양(%) 0 MP회복양 0 MP회복양(%) -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx 0 0
1   6 ITEM_ETC_HP_POTION_03 HP 회복약 (중) xxx SN_ITEM_ETC_HP_POTION_03 SN_ITEM_ETC_HP_POTION_03_TT_DESC 0 0 3 3 1 1 180000 3 0 1 1 1 255 3 1 0 0 1 0 200 0 0 0 4 70 -1 0 -1 0 -1 0 -1 0 -1 0 0 0 0 0 0 0 100 0 0 0 xxx item\etc\drop_ch_bag.bsr item\etc\hp_potion_03.ddj xxx xxx 50 2 0 0 3 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 370 HP회복양 0 HP회복양(%) 0 MP회복양 0 MP회복양(%) -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx 0 0
1   7 ITEM_ETC_HP_POTION_04 HP 회복약 (대) xxx SN_ITEM_ETC_HP_POTION_04 SN_ITEM_ETC_HP_POTION_04_TT_DESC 0 0 3 3 1 1 180000 3 0 1 1 1 255 3 1 0 0 1 0 400 0 0 0 7 140 -1 0 -1 0 -1 0 -1 0 -1 0 0 0 0 0 0 0 100 0 0 0 xxx item\etc\drop_ch_bag.bsr item\etc\hp_potion_04.ddj xxx xxx 50 2 0 0 4 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 570 HP회복양 0 HP회복양(%) 0 MP회복양 0 MP회복양(%) -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx -1 xxx 0 0

Question:How do I directly read, for example, line 5?

+4  A: 

You can use my LineReader class (either the one in MiscUtil or a simple version here) to implement IEnumerable<string> and then use LINQ:

string line5 = new LineReader(file).Skip(4).First();

This assumes .NET 3.5, admittedly. Otherwise, open a TextReader (e.g. with File.OpenText) and just call ReadLine() four times to skip the lines you don't want, and then once more to read the fifth line.

There's no way of "shortcutting" this unless you know exactly how many bytes are in each line.

Jon Skeet
Under the hood that still reads line by line till it gets to the line you want. Is there a way to go *directly* to line 5?
BFree
How is the Stream supposed to know the byte offset of line 5 ahead of time, BFree?
Matthew Flaschen
Is there a reason LineReader doesn't have a Stream constructor overload, only Func<Stream>?
Matthew Flaschen
@Matthew.. Yea, that's what I thought too, was just wondering if there was some way I wasn't aware of.
BFree
Yes - it's responsible for opening and closing the data source. It doesn't need to implement IDisposable, because it does all the opening and closing at the right time.
Jon Skeet
My problem is that the file is big(25k lines).But lets change the subject to make it more easily.How does notepad work? When you press Edit->Go to - it does it quite fast.
John
Not that I know the inner workings of notepad, but I would guess that it reads in the whole file into memory. Searching in memory is fast, (but consumes memory, obviously). Also, notepad needs to know where all the line breaks are to draw the text correctly, so it is likely to parse it at program start.
driis
Absolutely - it's not like it loads the file from disk every time you say "go to line". (Advanced editors *will* do that though - you try loading a 25K-line file into notepad - it'll hang while it loads the lot into memory; a smarter editor will load it as required.)
Jon Skeet
Jon, interesting library- looks like it will be useful for something I intend to do- skip the first X lines in a file using a fluent interface.
RichardOD
@RichardOD: I've found it makes it really easy to write LINQ queries based on text files.
Jon Skeet
+2  A: 

If you are dealing with a fixed-width data format (ie. you know all the lines to be the same length), you can multiply the length with your desired line number and use Stream.Seek to find the start point of the nth line.

If the lines are not fixed length, you need to find the right number of line breaks until you are at the beginning of the line you want. That would be easiest done with StreamReader.ReadLine. (You can make an extension method to make the file en IEnumerable<string> as Jon Skeet suggests - this would get you nicer syntax, but under the hood you will be using ReadLine).

If performance is an issue, it might be (a little bit) more efficient to scan for <CR><LF> byte sequences in the file manually using the Stream.Read method. I haven't tested that; but the StreamReader obviously need to do some work to construct a string out of the byte sequence - if you don't care about the first lines, this work can be saved, so theoretically you should be able to make a scanning method that performs better. This would be a lot more work for you, however.

driis
The lines are not fixed length,but finding the length of each line can be done.However,it will take time as it takes by reading each line for such big files with 25 000 lines.
John
If you're only going to the 5th line, it doesn't need to read all the lines though...
Jon Skeet
If you don't know each line's length ahead of time, there is no other option than running through each line one way or another, to find a specific line. There are no magic shortcut. If this is a file that get's appended to, and you need to process new data, you could store the last byte offset between reads, and start off from there on the next read.
driis
+2  A: 

You can't jump directly to a line in a text file unless every line is fixed width and you are using a fixed-width encoding (i.e. not UTF-8 - which is one of the most common now).

The only way to do it is to read lines and discard the ones you don't want.

Alternatively, you might put an index at the top of the file (or in an external file) that tells it (for example) that line 1000 starts at byte offset [x], line 2000 starts at byte offset [y] etc. Then use .Position or .Seek() on the FileStream to move to the nearest indexed point, and walk forwards.

Assuming the simplest approach (no index), the code in Jon's example should work fine. If you don't want LINQ, you can knock up something similar in .NET 2.0 + C# 2.0:

// to read multiple lines in a block
public static IEnumerable<string> ReadLines(
        string path, int lineIndex, int count) {
    if (string.IsNullOrEmpty(path)) throw new ArgumentNullException("path");
    if (lineIndex < 0) throw new ArgumentOutOfRangeException("lineIndex");
    if (count < 0) throw new ArgumentOutOfRangeException("count");
    using (StreamReader reader = File.OpenText(path)) {
        string line;
        while (count > 0 && (line = reader.ReadLine()) != null) {
            if (lineIndex > 0) {
                lineIndex--; // skip
                continue;
            }
            count--;
            yield return line;
        }
    }
}
// to read a single line
public static string ReadLine(string path, int lineIndex) {
    foreach (string line in ReadLines(path, lineIndex, 1)) {
        return line;
    }
    throw new IndexOutOfRangeException();
}

If you need to test values of the line (rather than just line index), then that is easy enough to do too; just tweak the iterator block.

Marc Gravell
+1  A: 

If you are going to be looking up a lot of different lines from the file (but not all), then you may get some benefit from building an index as you go. Use any of the suggestions that are already here, but as you go along build up an array of byte-offsets for any lines that you have already located so that you can save yourself from re-scanning the file from the beginning each time.

ADDENDUM:
There is one more way you can do it fast if you only need the occasional 'random' line, but at the cost of a more complicated search (If Jon's answer is fast enough, I'd definitely stick with that for simplicity's sake).

You could do a 'binary search', by just starting looking halfway down the file for the sequence '1', the first occurrence you find will give you an idea what line number you have found; then based on where the line you are looking for is relative to the found number you keep splitting recursively.

For extra performance you could also make the assumption that the lines are roughly the same length and have the algorithm 'guess' the approximate position of the line you are looking for relative to the total number of lines in the file and then perform this search from there onwards. If you do not want to make assumptions about the length of the file you can even make it self-prime by just splitting in half first, and using the line number it finds first as an approximation of how many lines there are in the file as a whole.

Definitely not trivial to implement, but if you have a lot of random access in files with a large number of lines, it may pay off in performance gains.

jerryjvl
A: 

If you need to be able to jump to line 24,000 using a function that does ReadLine() in the background will be a bit slow.

If the line number is high you may want to make some sort of educated guess as to where in the file the line may be and start reading from there. That way to get to line 24,567 you don't have to read 24,566 lines first. You can skip to somewhere in the middle find out what line you are on based on the number after the /t and then count from there.

A while back I worked with a dev who had to build a DB before RDBMSs where common. His solution to your problem was similar to what I just wrote about but in his case he kept a map in a separate file. The map can map every hundredth line to its location in the document. A map like this can be loaded very quickly and this may increase read times. At the time his system was very fast and efficient for readonly data but not very good for read/write data. (every time you change the lines you have to change the whole map, this is not very efficient)

Sruly