tags:

views:

79

answers:

2

Hi,

I have recently figured out that I haven't been using regex properly in my code. Given the example of a tab delimited string str, I have been using str.split("\t"). Now I realize that this is wrong and to match the tabs properly I should use str.split("\\t").

However I happen to stumble upon this fact by pure chance, as I was looking for regex patterns for something else. You see, the faulty code split("\t")has been working quite fine in my case, and now I am confused as to why it does work if it's the wrong way to declare a regex for matching the tab character. Hence the question, for the sake of actually understanding how regex is handled in Java, instead of just copying the code into Eclipse and not really caring why it works...

In a similar fashion I have come upon a piece of text which is not only tab-delimited but also comma delimited. More clearly put, the tab-delimited lists I am parsing sometimes include "compound" items which look like: item1,item2,item3 and I would like to parse them as separate elements, for the sake of simplicity. In that case the appropriate regex expression should be: line.split("[\\t,]"), or am I mistaken here as well??

Thanks in advance,

+9  A: 

When using "\t", the escape sequence \t is replaced by Java with the character U+0009. When using "\\t", the escape sequence \\ in \\t is replaced by Java with \, resulting in \t that is then interpreted by the regular expression parser as the character U+0009.

So both notations will be interpreted correctly. It’s just the question when it is replaced with the corresponding character.

Gumbo
so how come it is considered "wrong" to use `split("\t")` for parsing a tab-delimited string?
posdef
@posdef Is it? \t The tab character ('\u0009') is what the oracle reference tells you for a tab regex. See http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html#sum
InsertNickHere
@InsertNickHere - actually the oracle reference tells you to use a `\t` in a String. If this String is given by a literal, you need to double the backslash, see the subsequent paragraph: http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html#bs but the reference also tells that you can use `x` for character `x` so it is NOT wrong to use "\t" or "\u0009"
Carlos Heuberger
@Carlos ok, dident know about that.
InsertNickHere
@InsertNickHere: well the posts I have read clearly put it that any regex expression should be escaped with an extra backslash in Java, which makes my single backslashed expression "wrong". I was a bit dumbfounded as the "wrong" expression did work for in my code. I presume the explanation is as @Gumbo puts it above.
posdef
A: 

\ is consider to be escape char in java, so to get correct regex you need to escape \ with \ and t to indicate tab.

This tutorial will help more

Jaydeep
Thanks for the reply, though I think you should read the original question again... I have already stated that I had learned the correct way to get the regex, and the escaping. My question is why the "wrong" regex still worked. By the way, I posted the question AFTER reading the tutorial...
posdef
Java has escape char to indicate some special chars e.g. new line, backslash, tab etc. So (\\t==\t) and (\t=tab char). But I am not sure how regex internally takes care of tab.
Jaydeep