Multi-line regex search in whole file

views:

442

answers:

+3 Q:

Multi-line regex search in whole file

I've found loads of examples on to to replace text in files using regex. However it all boils down to two versions:
1. Iterate over all lines in the file and apply regex to each single line
2. Load the whole file.

No. 2 Is not feasible using "my" files - they're about 2GiB...
As to No. 1: Currently this is my approach, however I was wondering... What if need to apply a regex spanning more than one line ?

Perhaps you could load in 2 lines at a time (or more, depending on how many lines you think your matches are going to span), and overlap them, e.g: load lines 1-2, then the next loop load lines 2-3, the next load 3-4; and do your multiline regexes over both lines combined, in each loop.

Mark B 2009-10-02 13:27:14

Good idea, however every line would be regex'd possibly multiple times. One would have to consider possible side-effects..

Nils 2009-10-02 14:00:37

Hmm yes, you're right. Perhaps only match when the match starts on the first line (before any instance of a line break)?

Mark B 2009-10-02 14:06:04

+1 A:

Regex is not the way to go, especially not with these large amounts of text. Create a little parser of your own:

read the file line by line;
for each line:
- loop through the line char by char keeping track of any opening/closing string literals
- when you encounter '/*' (and you're not 'inside' a string), store that offset number and loop until you encounter the first '*/' and store that number as well

That will give you all the starting- and closing-offset numbers of the comment blocks. You should now be able to replace them by creating a temp-file and writing the text from the original file to the temp file (and writing something else if you're inside a comment block of course).

Edit: source files of 2GiB??

Bart Kiers 2009-10-02 13:42:09

Did I say source ? ;-) No "raw" data, csv in fact.

Nils 2009-10-02 13:56:42

Ah, I see. Don't know C#, but would imagine it wouldn't even be possible to create such large source files.

Bart Kiers 2009-10-02 14:24:32

I would say you should pre-parse/normalize the data before doing your replacements so that each line describes one possible set of data that needs to have replacements applied. Otherwise you get into complications with data integrity that cannot really be solved without a host of other difficulties.

If there is a way to chunk the data into logical blocks then you could build a program that uses a mapreduce pattern to parse the data.

Harv 2009-10-02 14:47:18

I'm with Bart; you really should be using some kind of parser for this.

Or, if you don't mind spawning a child process, you could just use sed (there's a native port on windows, or you can use Cygwin)

elo80ka 2009-10-03 00:47:00

Here's the Answer:
There is no easy way

I found a StreamRegex-Class which could be able to do what I am looking for.
From what I could grasp of the algorithm:

Start at the beginning of the file with an empty buffer
do (
- add a chunk of the file to the buffer
- if there is a match in the buffer
  - mark the match
  - drop all data which appeared before the end of the match from the buffer
) while there is still something of the file left

That way it is not nessesary to load the full file -- or at least the chances of loading the full file in memory are reduced...
However: Worst case is that there is no match in the whole file - in this case the full file will be loaded into memory.

Nils 2009-11-19 08:31:53

If you don't mind getting your hands a little dirty (and your regex is simple enough, or perhaps you have a strong desire for speed and don't mind suffering a bit), you can use Ragel. It can target C#, though the site doesn't mention it. You'll need to wrap a FileStream to provide a buffered indexer or use a memory mapped file (with unsafe pointers) in a 64 bit process to use this with large files though.

Nathan Howell 2009-11-19 08:51:25

ansaurus

tags:

views:

answers:

Multi-line regex search in whole file

related questions