tags:

views:

730

answers:

2

I am expecting a String from an application that looks like:

john|COL-DELIM|doe|COL-DELIM|55|ROW-DELIM|george|COL-DELIM|jetson|COL-DELIM|90|ROW-DELIM|

I want to do two things:

1) Verify the string "looks" correct (i.e. does it match a regex)

2) Pull out each "row", then be able to parse each row

The values in between the delimiters (|COL-DELIM| and |ROW-DELIM|) can be any value (not just strings, numbers, whatever).

((.)(\|COL-DELIM\|)(.)(\|COL-DELIM\|)(.*)(\|ROW-DELIM\|))+

Naturally that doesn't work b/c of the (.*) things...any suggestions?

+5  A: 

People don't seem to get the fact that they don't have to use REs (or SQL, but that's another issue :-) for every task, especially those with procedural code is cleaner.

If you're limiting yourself to using REs, I think that's a lack of vision.

I would simply process the string, token by token, where a token is one of:

  • a non-delimiter.
  • a column delimiter.
  • a row delimiter.

Start with an empty column list, then extract (using indexOf/substring stuff) up to the first next row/column delimiter, adding that text to the column list.

If the delimiter is column, keep going.

If the delimiter is row, check the number of columns and process the list as required.

If there's no final row delimiter and the column list is non-empty, then the format was invalid.

Sorry if you were really after an RE method but I don't believe it's required (or even desirable) here.

Pseudo-code (only a first cut, may be slightly buggy) follows:

def processStr(s):
    if not s.endsWith ("|ROW-DELIM|"):
        error "Invalid format"
    columnList = []
    while not s.equals (""):
        nextRowDelim = s.indexOf ("|ROW-DELIM|")
        nextColDelim = s.indexOf ("|COL-DELIM|")
        if nextColDelim == NotFound:
            nextColDelim = nextRowDelim + 1
        nextDelim = minimumOf (nextRowDelim,nextColDelim)

        columnList.add (s.substring (0, nextDelim))
        s = s.substring (nextDelim)

        if nextDelim == nextRowDelim:
            s = s.substring (length ("|ROW-DELIM|"))
            processColumns (columnList)
            columnList = []
        else:
            s = s.substring (length ("|COL-DELIM|"))

You could easily add code to check the correct number of columns in this code, or in processColumns(), if that was your desire.

paxdiablo
True, I was just going to use String.split [in Java] and do what you suggested, but I thought I might at least be able to use Regex to perform a "matches". Just to verify the String is in the expected format.
codecraig
+1 (especially since the OP didn't manage to up-vote you)
Tomalak
@codecraig, given your spec that you can have anything between delimiters (including empty strings) and the fact that you don't need "|COL-DELIM||ROW-DELIM|" at the end of each row, the only real check is that the string ends in a row delim (and maybe number of columns is right and/or consistent).
paxdiablo
REs probably wouldn't help much there, I'm afraid.
paxdiablo
+3  A: 

You don't have to use ".*" to match "anything". In fact, most of the time, ".*" is wrong.

If your col-delim was a single character (say, ";"), you can use this to match a column:

[^;]*                      // "anything that's *not* a semi-colon"
([^;]*);([^;]*);([^;]*)\n  // three columns, ending with \n

Since this task is essentially parsing CSV, and regex is not entirely the best tool for parsing, I suggest you look for a Java CSV parsing package.

If "|COL-DELIM|" and "|ROW-DELIM|" are indeed fixed sequences of characters, I suggest you split() the string on them instead of relying on regex.

  • split on "|ROW-DELIM|" to get an array of "row" strings
  • split each "row"- string on "|COL-DELIM|" to get an array of columns
  • check the array length to ensure you have the correct number of columns
  • iterate the columns array to process the data.

This approach will of course work for single-character delimiters as well.

Tomalak
The split() approach will work but for large data sets will create lots of temporary objects. This could be a problem; it would be more efficient (from a memory perspective) to parse the string manually.
Mr. Shiny and New
I don't think you should bother too much how many arrays and strings this produces along the way. This "overhead" (if you want to call it that) is negligible. The runtime and the garbage collector will take care of this, that's what they are for.
Tomalak