tags:

views:

63

answers:

3

I've got a tough one.

I've got tab-delimited text to match with a regex.

My regex looks like:

^([\w ]+)\t(\d*)\t(\d+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)$

and an example source text is (tabs converted to \t for clarity):

JJ\t345\t0\tTest\tSome test text\tmore text: pcre:"/\x20\x62\x3b\x0a\x09\x61\x2e\x53\x74\x61\x72/"\tNone

However, the problem is that in my source text, the 6th field contains a regex string. Therefore, it can contain \x09, which naturally blows up the regex since it's seen as a tab as well.

Is there any way to tell the regex engine, "Match on \t but not on the text \x09." My guess is no, since they're the same thing.

If not, is there any character that could be safely used for delimiting text that contains a regex string?

A: 

I would recommend encoding all of the characters in the pcre string prior to running the regular expression against it.

bso
This was an excellent solution to the problem.
Wade Williams
A: 

It appears the possible 0x09 come in the context of a string, maybe you can simply detect "\t rather than just \t for the separator between 6th and 7th field.

mjv
A: 

Seems like a problem with the test case. A regex might have tabs in it, but your sample above doesn't. Your string in Java would look like:

String testString = "JJ\t345\t0\tTest\tSome test text\tmore text: pcre:"/\\x20\\x62\\x3b\\x0a\\x09\\x61\\x2e\\x53\\x74\\x61\\x72/"\tNone";

If you look at this string in the debugger you'll have \x09 as 4 characters instead of as 1 (the tab).

Epsilon Prime