ansaurus

Question

Answer 1

+1 A:

Is your problem that you have an input list that is guaranteed to be in the format you showed here, and you just need to split it out into individual items? For that, you probably don't need a regular expression at all.

If the strings can't contain commas, just split on comma to get your individual tokens. Then for the tokens that aren't numbers, remove the start/ending quote. Then replace '' with '. Problem solved, no regex required.

Brian Schroth 2009-11-10 13:24:35

yes... from the question it's not clear what he wants to do

mga 2009-11-10 13:36:12

As elaborated in the question's comments, it appears that strings *can* contain commas.

Amber 2009-11-10 13:36:38

And in case, a string may contain commas, you just have to implement a custom split method to detect the 'string' types, then proceed as Brian suggested. No regexp needed.

Andreas_D 2009-11-10 13:38:39

Interestingly enough, however, the regex involved isn't actually that complicated. See my answer for a fairly simple one.

Amber 2009-11-10 13:41:42

sorry if I was not clear enough. The string can contain commas. e.g 'hel,lo' OR 'hel'',''lo' OR ',''hello' OR any other weird combination you can imagine.

Bob 2009-11-10 13:48:19

well, Andreas_D, I don't think that a custom split function to find the tokens would be a trivial task. Given that commas are allowed, I think a regex is exactly the right solution for the problem. Although I second Bart's suggestion of checking out an existing CSV parser library first (not sure if the one he linked likes the format Bob has of single quotes and escaping quotes with quotes, so it might not be the right choice).

Brian Schroth 2009-11-10 18:16:39

Answer 2

+1 A:

You might be better off doing this as a two-step process; first break it into fields, then post-process the content of each field.

\s*('(?:''|[^'])*'|\d+)\s*(?:,|$)

Should match a single field. Then just iterate through each match (by alternating .find() and then .group(1)) to grab each field in order. You can convert double-apostrophes into singles after pulling the field value out; just do a simple string replace for '' -> '.

Amber 2009-11-10 13:27:02

Answer 3

+2 A:

All your example strings satisfy the following regex:

('(''|[^'])*'|\d+)(\s*,\s*('(''|[^'])*'|\d+))*

Meaning:

(               # open group 1
  '             #   match a single quote
  (''|[^'])*    #   match two single quotes OR a single character other than a single quote, zero or more times
  '             #   match a single quote
  |             #   OR
  \d+           #   match one or more digits
)               # close group 1
(               # open group 3
  \s*,\s*       #   match a comma possibly surrounded my white space characters
  (             #   open group 4
    '           #     match a single quote
    (''|[^'])*  #     match two single quotes OR a single character other than a single quote, zero or more times
    '           #     match a single quote
    |           #     OR
    \d+         #     match one or more digits
  )             #   close group 4
)*              # close group 3 and repeat it zero or more times

A small demo:

import java.util.*;
import java.util.regex.*;

public class Main { 

    public static List<String> tokens(String line) {
        if(!line.matches("('(''|[^'])*'|\\d+)(\\s*,\\s*('(''|[^'])*'|\\d+))*")) {
            return null;
        }
        Matcher m = Pattern.compile("'(''|[^'])*+'|\\d++").matcher(line);
        List<String> tok = new ArrayList<String>();
        while(m.find()) tok.add(m.group());
        return tok;
    }

    public static void main(String[] args) {
        String[] tests = {
                "1, 2, 3",
                "'a', 'b',    'c'",
                "'a','b','c'",
                "1, 'a', 'b'",
                "'this''is''one string', 1, 2",
                "'''this'' is a weird one', 1, 2",
                "'''''''', 1, 2",
                /* and some invalid ones */
                "''', 1, 2",
                "1 2, 3, 4, 'aaa'",
                "'a', 'b', 'c"
        };
        for(String t : tests) {
            System.out.println(t+" --tokens()--> "+tokens(t));
        }
    }
}

Output:

1, 2, 3 --tokens()--> [1, 2, 3]
'a', 'b',    'c' --tokens()--> ['a', 'b', 'c']
'a','b','c' --tokens()--> ['a', 'b', 'c']
1, 'a', 'b' --tokens()--> [1, 'a', 'b']
'this''is''one string', 1, 2 --tokens()--> ['this''is''one string', 1, 2]
'''this'' is a weird one', 1, 2 --tokens()--> ['''this'' is a weird one', 1, 2]
'''''''', 1, 2 --tokens()--> ['''''''', 1, 2]
''', 1, 2 --tokens()--> null
1 2, 3, 4, 'aaa' --tokens()--> null
'a', 'b', 'c --tokens()--> null

But, can't you simply use an existing (and proven) CSV parser instead? Ostermiller's CSV parser comes to mind.

Bart Kiers 2009-11-10 13:36:54

That matches the whole line, but your first group `('(''|[^'])*'|\d+)` does the trick (for me at least)

ApoY2k 2009-11-10 13:41:25

thanks for the response. yes it does match the examples but my point is to read out the values in groups.

Bob 2009-11-10 13:44:02

Ah yes, I though you simply wanted to validate the lines. See my edited answer how to get the values from the line (if the line is properly formed!).

Bart Kiers 2009-11-10 13:49:23

upvote for commenting your regex! It's a Christmas miracle!

Brian Schroth 2009-11-10 13:50:23

<grumble>Those Christmas advertisements come earlier each year!</grumble> ;)

Bart Kiers 2009-11-10 13:53:42

Bart, great answer and explanation. thanks

Bob 2009-11-10 14:01:43

You're welcome Bob.

Bart Kiers 2009-11-10 14:10:29

Answer 4

A:

Matching quoted strings with RegExp is a difficult proposition. It's helpful for you that your delimiter text isn't just a single quote, but in fact it's a single quote plus one of: comma, start of line, end of line. This means the only time that back-to-back single quotes appear in a legitimate entry will be as part of string escaping.

Writing a regexp to match this isn't too hard for success cases, but for failure cases it can become very challenging.

It might be in your best interests to sanitize the text before matching it. Replace all \ instances with a literal \u005c then all '' instances with a literal \u0027 (in that order). You're providing a level of escaping here which leaves a string with no particular special characters.

Now you can use a simple pattern such as (?:(?:^\s*|\s*,\s*)(?:'([^']*)'|[^,]*?)))*\s*$

Here's a breakdown of that pattern (for clarity, I use the terminology 'set' to indicate non-capturing grouping, and 'group' to indicate capturing grouping):

(?:               Open a non-capturing / alternation set 1
  (?:             Open a non-capturing / alternation set 2
    ^\s*          Match the start of the line and any amount of white space.
    |             alternation (or) for alternation set 2
    \s*,\s*       A comma surrounded by optional whitespace
  )               Close non-capturing group 2 (we don't care about the commas once we've used them to split our data)
  (?:             Open non-capturing set 3
    '([^']*)'     Capturing group #1 matching the quoted string value option.
    |             alternation for set 3.
    ([^,]*?)      Capturing group #2 matching non-quoted entries but not including a comma (you might refine this part of the expression if for example you only want to allow numbers to be non-quoted).  This is a non-greedy match so that it'll stop at the first comma rather than the last comma.
  )               Close non-capturing set 3
)                 Close non-capturing set 1
*                 Repeat the whole set as many times as it takes (the first match will trigger the ^ start of line, the subsequent matches will trigger the ,comma delimiters)
\s*$              Consume trailing spaces until the end of line.

Your quoted parameters will be in capturing group 1, your non-quoted parameters will be in capturing group 2. Everything else will be discarded.

Then loop over the matched entries and reverse the encoding (replace \u0027 with ', and \u005c with \ in that order), and you're done.

This should be fairly fault tolerant and correctly parse some obtuse technically incorrect but recoverable scenarios such as 1, a''b, 2 but still fail on unrecoverable values such as 1, a'b, 2, while succeeding on the technically correct (but probably unintentional) entry 1, 'ab, 2'

MightyE 2009-11-10 14:46:53

ansaurus

tags:

views:

answers:

Complex Regex getting value from string

related questions