tags:

views:

829

answers:

4
+1  Q: 

Java regex split

I have some data formated like the following

2009.07.02 02:20:14  40.3727   28.2330        6.4      2.6  -.-  -.-   BANDIRMA-BALIKESIR
2009.07.02 01:38:34  38.3353   38.8157        3.5      2.7  -.-  -.-   KALE (MALATYA)
2009.07.02 00:10:28  38.8838   26.9328        3.0      3.0  -.-  -.-   CANDARLI KÖRFEZI (EGE DENIZI)
2009.07.01 23:33:31  36.8027   34.0975        8.2      2.9  -.-  -.-   GÜZELOLUK-ERDEMLI (MERSIN)
2009.07.01 22:32:44  38.9260   27.0338        5.0      3.4  -.-  -.-   CANDARLI KÖRFEZI (EGE DENIZI)
2009.07.01 22:12:37  40.2120   41.0378        3.7      2.9  -.-  -.-   OVACIK-ILICA (ERZURUM)
2009.07.01 22:10:53  38.9208   26.9502        5.0      3.5  -.-  -.-   ÇANDARLI-DIKILI (IZMIR)
2009.07.01 21:44:29  38.8695   27.1268        6.9      2.9  -.-  -.-   YUNTDAG-BERGAMA (IZMIR)
2009.07.01 21:27:53  38.9073   26.9895        5.0      3.0  -.-  -.-   CANDARLI KÖRFEZI (EGE DENIZI)
2009.07.01 21:18:19  38.9212   26.9060        5.0      3.4  -.-  -.-   CANDARLI KÖRFEZI (EGE DENIZI)
2009.07.01 21:12:15  38.8657   26.9447       13.7      3.8  -.-  -.-   CANDARLI KÖRFEZI (EGE DENIZI)
2009.07.01 21:09:43  38.9260   27.0853        5.0      3.1  -.-  -.-   ZEYTINDAG-BERGAMA (IZMIR)
2009.07.01 21:05:40  38.9153   26.9710        5.0      3.4  -.-  -.-   ÇANDARLI-DIKILI (IZMIR)
2009.07.01 20:29:02  37.6888   38.7212        5.0      3.3  -.-  -.-   AKINCILAR-KAHTA (ADIYAMAN)
2009.07.01 18:17:12  41.2700   36.0502        2.7      2.7  -.-  -.-   TAFLAN- (SAMSUN)
2009.07.01 17:50:03  38.6312   35.7962        5.0      2.8  -.-  -.-   ELBASI-BÜNYAN (KAYSERI)

I would like to split this on white lines but i would like last column to not split when there are parenthesis? I would like each line to split in to 8 pieces. Is this possible?

A: 

Put this into a Regular Expression tool, such as RegexBuddy.

But for your purposes, it will be easy to split on \s+ or \s\s+ and set the limit. It depends on which parts of the text you want, which is why you use the tool to help you write your regex.

If you specifically want to avoid matching spaces preceded by "(" which doesn't actually solve your problem due to possible lines like "Words (word word)" you can use a zero-width negative lookahead group. Something like \s+(?!\().

dlamblin
This doesn't work. Splitting with \s+ and a limit of 8 results in the last -.- and the name field being combined, and that's assuming you are performing the split on each line. A split with a limit of 9 will work assuming you don't mind the timestamp being split into two parts also.
Trampas Kirk
He specifically said: "I would like each line to split in to 8 pieces" clearly, I didn't spot the single white space between the timestamp. You could take each line and .split(@"\s\s+",8)
dlamblin
+2  A: 

Why are you using regex here?

The data file is perfectly aligned, you can extract the data with

line.substring(0,12)
line.substring(13,20)
..
..

It is much faster this way.

J-16 SDiZ
A: 

This looks like formatted text. First guess would be to break on tab chars.

String[] parts = line.split('\t');

If that doesn't work I'd break on spaces not followed by parens. Look in the javadoc under Pattern for lookahead pattern syntax: e.g. if you split

"ABC DEF (GHI)"

on the regex:

String regex="\\ (?!\\()";

(read this as 'space(?!X)' where the "(?!X) means "negative look-ahead matching "X" and the escaped openparen "\(" is substituted).

you get "ABC,DEF (GHI)"

Assuming the text is tab-delimited, parsing by numerical position will not work.

Steve B.
if it is tab-delimited, just use string.split("\t")
J-16 SDiZ
A: 

I guess you need 9 pieces and not 8. So try, line.split("\\s+", 9);.