ansaurus

Question

How to extract fields from a text line that has no constant deliminator?

Answer 1

A:

You can use Strtokenizer from Commons Lang and specify multiple delimiters to split on:

There are a number of built in types that is supports via StrMatcher.

StrTokenizer(char[] input, StrMatcher delim)

e.g.

StrMatcher delims = StrMatcher.charSetMatcher(new char[] {' ', ',', '\n'});
StrTokenizer str = new StrTokenizer(match.toString(), delims);
while (str.hasNext()) {
    System.out.println("Token:[" + str.nextToken() + "]");
}

will give (from the example above):

Token:[3/3/2010]
Token:[11:00:46]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:04:04]
Token:[AM]
Token:[2]
Token:[YaserAlNaqeb]
Token:[BASEMENT-OUT]
Token:[3/3/2010]
Token:[11:04:06]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:04:18]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:14:32]
Token:[AM]
Token:[4]
Token:[Dhileep]
Token:[BASEMENT-OUT]
Token:[3/3/2010]
Token:[11:14:34]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:14:41]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:15:33]
Token:[AM]
Token:[4]
Token:[Dhileep]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:15:42]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:15:42]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:30:22]
Token:[AM]
Token:[34]
Token:[KumarRaju]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:31:28]
Token:[AM]
Token:[39]
Token:[Eldrin]
Token:[BASEMENT-OUT]
Token:[3/3/2010]
Token:[11:31:31]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:31:39]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:32:38]
Token:[AM]
Token:[39]
Token:[Eldrin]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:32:47]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:32:47]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:33:26]
Token:[AM]
Token:[34]
Token:[KumarRaju]
Token:[BASEMENT-OUT]
Token:[3/3/2010]
Token:[11:33:28]
Token:[AM]
Token:[BASEMENT-IN]

Jon 2010-03-04 07:29:00

but what are my multiple delimiters? Amount of Spaces can change.

MAK 2010-03-04 07:32:03

Amount of spaces is not determined, and field 5 can be empty in some cases.

MAK 2010-03-04 07:43:40

Yep that's fine, it'll work as per the example above (I dumped your fragment into a sample program) ran it and it tokenizes fine...

Jon 2010-03-04 07:56:58

It'll also cope with field 5 being empty...

Jon 2010-03-04 08:01:06

Nice! but I need an empty (or one space) token to be returned for the empty field to keep the order of fields.

MAK 2010-03-04 08:06:11

Maybe you should update your question with the extra requirement...

Jon 2010-03-04 08:20:44

Thanks Jo, I did already.

MAK 2010-03-04 08:26:33

Answer 2

+2 A:

Well you can strip off the date and the BASEMENT-FOO data by column number, since they always appear at the same point in the line. Then you can split the remainder based on commas. Whether you need to handle escaped commas \, or commas in quotes "foo, bar" is up to you and your business requirements.

Philip Potter 2010-03-04 07:30:31

Exactly what I was about to answer. It looks like a fixed format file to me.

Software Monkey 2010-03-04 07:34:37

Amount of spaces for the field "name" (field 5) amount of spaces can change.I can't count on column number.

MAK 2010-03-04 07:35:57

@MAK If that's the case, your example would have been clearer if it showed a name so large it pushed BASEMENT-FOO to the right. Because you've made it look as if BASEMENT-FOO will always be in the same column.

Philip Potter 2010-03-04 07:39:05

@Philip Potter: you are right, I just updated my question. Sorry for the confusion.

MAK 2010-03-04 07:45:55

To downvoter: please comment when you downvote, so I can learn what you think I did wrong.

Philip Potter 2010-03-04 08:36:32

Answer 3

+1 A:

You can do:

read an entire line as string.
split the read line on spaces(\s+). You should get 5 or 6 pieces.
piece0, piece1 and piece2 will be date, time and AM/PM.
check if piece3 has number: if yes then read next piece as name
last piece is that Basement thing.
convert the pieces from string to say date,time,int as needed.

codaddict 2010-03-04 07:30:47

I'm not sure this will work when (field 5) is empty..will it?

MAK 2010-03-04 07:40:24

@MAK: you'll have to modify it a bit. If you are sure that last piece will have "BASEMENT" as prefix and you'll not have a name starting with "BASEMENT" :) then if you find a number is piece3 you can see if the next piece is actually a name or not.

codaddict 2010-03-04 07:43:40

I wish it would that easy :) No garantee for the last field to start with a constant expression.

MAK 2010-03-04 07:48:43

@MAK: I see. But you'll have to find a way that differentiates the name from the last field. Something like name will not be all uppercase, last field will be all upper case.

codaddict 2010-03-04 07:53:51

hmmm...It seems last word is always capital letters...I think this would work. But Isn't there other way that would not relay on the case of the last word letters?

MAK 2010-03-04 08:09:28

Answer 4

A:

Find the columns in each line where blank characters are adjacent to non-blank ones, then do a statistical analysis on those numbers: those which occur in every line or almost every line are very probably the field boundaries.

Similarly for punctuation adjacent to letters, but in general it is impossible to guess whether a - or a , is meant to delimit a field or not. If it occurs in the same position in every line, it might be a delimiter, but in lists of things such as D-FL R-TX D-NY it probably isn't. So there can be no fully automatic solution for arbitrary data.

Kilian Foth 2010-03-04 07:33:49

Answer 5

A:

Since each field is very distinct (atleast in the example you pasted above) you can do this:

Split the string into tokens.
Run each element of the tokenized array through a Regex Pattern.

Mihir Mathuria 2010-03-04 07:34:35

what about the field 5 where data can be empty?

MAK 2010-03-04 07:42:29

Answer 6

+1 A:

To me there seem to be 3 meta-fields:

3/3/2010 11:32:38 AM 39, Eldrin           BASEMENT-IN          
3/3/2010 11:32:47 AM                      BASEMENT-IN

MF1: 3/3/2010 11:32:38 AM

MF2: 39, Eldrin

MF3: BASEMENT-IN

of which MF2 is optional. My delimiters then would be:

MF1 up to and including [AM|PM]

MF2 number,anything except BASEMENT-*

MF3 BASEMENT-*

I'm not all that good at regexes but I would extract those 3 groups as something like

(anything)(AM|PM)(number,anything)?(BASEMENT-anything)

where the ? means optional group.

extraneon 2010-03-04 08:57:56

ansaurus

tags:

views:

answers:

How to extract fields from a text line that has no constant deliminator?

related questions