ansaurus

Question

Regex to Match Anything Except Certain Delimiters

Answer 1

+5 A:

People don't seem to get the fact that they don't have to use REs (or SQL, but that's another issue :-) for every task, especially those with procedural code is cleaner.

If you're limiting yourself to using REs, I think that's a lack of vision.

I would simply process the string, token by token, where a token is one of:

a non-delimiter.
a column delimiter.
a row delimiter.

Start with an empty column list, then extract (using indexOf/substring stuff) up to the first next row/column delimiter, adding that text to the column list.

If the delimiter is column, keep going.

If the delimiter is row, check the number of columns and process the list as required.

If there's no final row delimiter and the column list is non-empty, then the format was invalid.

Sorry if you were really after an RE method but I don't believe it's required (or even desirable) here.

Pseudo-code (only a first cut, may be slightly buggy) follows:

def processStr(s):
    if not s.endsWith ("|ROW-DELIM|"):
        error "Invalid format"
    columnList = []
    while not s.equals (""):
        nextRowDelim = s.indexOf ("|ROW-DELIM|")
        nextColDelim = s.indexOf ("|COL-DELIM|")
        if nextColDelim == NotFound:
            nextColDelim = nextRowDelim + 1
        nextDelim = minimumOf (nextRowDelim,nextColDelim)

        columnList.add (s.substring (0, nextDelim))
        s = s.substring (nextDelim)

        if nextDelim == nextRowDelim:
            s = s.substring (length ("|ROW-DELIM|"))
            processColumns (columnList)
            columnList = []
        else:
            s = s.substring (length ("|COL-DELIM|"))

You could easily add code to check the correct number of columns in this code, or in processColumns(), if that was your desire.

paxdiablo 2009-03-05 12:07:49

True, I was just going to use String.split [in Java] and do what you suggested, but I thought I might at least be able to use Regex to perform a "matches". Just to verify the String is in the expected format.

codecraig 2009-03-05 12:15:59

+1 (especially since the OP didn't manage to up-vote you)

Tomalak 2009-03-05 12:19:37

@codecraig, given your spec that you can have anything between delimiters (including empty strings) and the fact that you don't need "|COL-DELIM||ROW-DELIM|" at the end of each row, the only real check is that the string ends in a row delim (and maybe number of columns is right and/or consistent).

paxdiablo 2009-03-05 13:01:33

REs probably wouldn't help much there, I'm afraid.

paxdiablo 2009-03-05 13:02:03

Answer 2

+3 A:

You don't have to use ".*" to match "anything". In fact, most of the time, ".*" is wrong.

If your col-delim was a single character (say, ";"), you can use this to match a column:

[^;]*                      // "anything that's *not* a semi-colon"
([^;]*);([^;]*);([^;]*)\n  // three columns, ending with \n

Since this task is essentially parsing CSV, and regex is not entirely the best tool for parsing, I suggest you look for a Java CSV parsing package.

If "|COL-DELIM|" and "|ROW-DELIM|" are indeed fixed sequences of characters, I suggest you split() the string on them instead of relying on regex.

split on "|ROW-DELIM|" to get an array of "row" strings
split each "row"- string on "|COL-DELIM|" to get an array of columns
check the array length to ensure you have the correct number of columns
iterate the columns array to process the data.

This approach will of course work for single-character delimiters as well.

Tomalak 2009-03-05 12:09:18

The split() approach will work but for large data sets will create lots of temporary objects. This could be a problem; it would be more efficient (from a memory perspective) to parse the string manually.

Mr. Shiny and New 2009-03-05 14:31:35

I don't think you should bother too much how many arrays and strings this produces along the way. This "overhead" (if you want to call it that) is negligible. The runtime and the garbage collector will take care of this, that's what they are for.

Tomalak 2009-03-05 15:24:05

ansaurus

tags:

views:

answers:

Regex to Match Anything Except Certain Delimiters

related questions