views:

123

answers:

6

What is the best way to extract each field from each line where there is no clear separator (deliminator) between each field?

Here is a sample of the lines I need to extract its fields:

3/3/2010 11:00:46 AM                      BASEMENT-IN          
3/3/2010 11:04:04 AM 2, YaserAlNaqeb      BASEMENT-OUT         
3/3/2010 11:04:06 AM                      BASEMENT-IN          
3/3/2010 11:04:18 AM                      BASEMENT-IN          
3/3/2010 11:14:32 AM 4, Dhileep              BASEMENT-OUT         
3/3/2010 11:14:34 AM                      BASEMENT-IN          
3/3/2010 11:14:41 AM                      BASEMENT-IN          
3/3/2010 11:15:33 AM 4, Dhileep           BASEMENT-IN          
3/3/2010 11:15:42 AM                      BASEMENT-IN          
3/3/2010 11:15:42 AM                      BASEMENT-IN          
3/3/2010 11:30:22 AM 34, KumarRaju        BASEMENT-IN          
3/3/2010 11:31:28 AM 39, Eldrin           BASEMENT-OUT         
3/3/2010 11:31:31 AM                      BASEMENT-IN          
3/3/2010 11:31:39 AM                      BASEMENT-IN          
3/3/2010 11:32:38 AM 39, Eldrin           BASEMENT-IN          
3/3/2010 11:32:47 AM                      BASEMENT-IN          
3/3/2010 11:32:47 AM                      BASEMENT-IN          
3/3/2010 11:33:26 AM 34, KumarRaju        BASEMENT-OUT         
3/3/2010 11:33:28 AM                      BASEMENT-IN    

There are 6 fields in each line and some of them can be empty. What is the best way to approach this problem?

  • I'm using Java

Edition 01

  • Field 5 can be empty (however its existence should be recognized in all cases)
  • Amount of spaces can change
  • Last word can change
A: 

You can use Strtokenizer from Commons Lang and specify multiple delimiters to split on:

There are a number of built in types that is supports via StrMatcher.

StrTokenizer(char[] input, StrMatcher delim) 

e.g.

StrMatcher delims = StrMatcher.charSetMatcher(new char[] {' ', ',', '\n'});
StrTokenizer str = new StrTokenizer(match.toString(), delims);
while (str.hasNext()) {
    System.out.println("Token:[" + str.nextToken() + "]");
}

will give (from the example above):

Token:[3/3/2010]
Token:[11:00:46]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:04:04]
Token:[AM]
Token:[2]
Token:[YaserAlNaqeb]
Token:[BASEMENT-OUT]
Token:[3/3/2010]
Token:[11:04:06]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:04:18]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:14:32]
Token:[AM]
Token:[4]
Token:[Dhileep]
Token:[BASEMENT-OUT]
Token:[3/3/2010]
Token:[11:14:34]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:14:41]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:15:33]
Token:[AM]
Token:[4]
Token:[Dhileep]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:15:42]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:15:42]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:30:22]
Token:[AM]
Token:[34]
Token:[KumarRaju]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:31:28]
Token:[AM]
Token:[39]
Token:[Eldrin]
Token:[BASEMENT-OUT]
Token:[3/3/2010]
Token:[11:31:31]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:31:39]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:32:38]
Token:[AM]
Token:[39]
Token:[Eldrin]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:32:47]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:32:47]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:33:26]
Token:[AM]
Token:[34]
Token:[KumarRaju]
Token:[BASEMENT-OUT]
Token:[3/3/2010]
Token:[11:33:28]
Token:[AM]
Token:[BASEMENT-IN]
Jon
but what are my multiple delimiters? Amount of Spaces can change.
MAK
Amount of spaces is not determined, and field 5 can be empty in some cases.
MAK
Yep that's fine, it'll work as per the example above (I dumped your fragment into a sample program) ran it and it tokenizes fine...
Jon
It'll also cope with field 5 being empty...
Jon
Nice! but I need an empty (or one space) token to be returned for the empty field to keep the order of fields.
MAK
Maybe you should update your question with the extra requirement...
Jon
Thanks Jo, I did already.
MAK
+2  A: 

Well you can strip off the date and the BASEMENT-FOO data by column number, since they always appear at the same point in the line. Then you can split the remainder based on commas. Whether you need to handle escaped commas \, or commas in quotes "foo, bar" is up to you and your business requirements.

Philip Potter
Exactly what I was about to answer. It looks like a fixed format file to me.
Software Monkey
Amount of spaces for the field "name" (field 5) amount of spaces can change.I can't count on column number.
MAK
@MAK If that's the case, your example would have been clearer if it showed a name so large it pushed BASEMENT-FOO to the right. Because you've made it look as if BASEMENT-FOO will always be in the same column.
Philip Potter
@Philip Potter: you are right, I just updated my question. Sorry for the confusion.
MAK
To downvoter: please comment when you downvote, so I can learn what you think I did wrong.
Philip Potter
+1  A: 

You can do:

  • read an entire line as string.
  • split the read line on spaces(\s+). You should get 5 or 6 pieces.
  • piece0, piece1 and piece2 will be date, time and AM/PM.
  • check if piece3 has number: if yes then read next piece as name
  • last piece is that Basement thing.
  • convert the pieces from string to say date,time,int as needed.
codaddict
I'm not sure this will work when (field 5) is empty..will it?
MAK
@MAK: you'll have to modify it a bit. If you are sure that last piece will have "BASEMENT" as prefix and you'll not have a name starting with "BASEMENT" :) then if you find a number is piece3 you can see if the next piece is actually a name or not.
codaddict
I wish it would that easy :) No garantee for the last field to start with a constant expression.
MAK
@MAK: I see. But you'll have to find a way that differentiates the name from the last field. Something like name will not be all uppercase, last field will be all upper case.
codaddict
hmmm...It seems last word is always capital letters...I think this would work. But Isn't there other way that would not relay on the case of the last word letters?
MAK
A: 

Find the columns in each line where blank characters are adjacent to non-blank ones, then do a statistical analysis on those numbers: those which occur in every line or almost every line are very probably the field boundaries.

Similarly for punctuation adjacent to letters, but in general it is impossible to guess whether a - or a , is meant to delimit a field or not. If it occurs in the same position in every line, it might be a delimiter, but in lists of things such as D-FL R-TX D-NY it probably isn't. So there can be no fully automatic solution for arbitrary data.

Kilian Foth
A: 

Since each field is very distinct (atleast in the example you pasted above) you can do this:

  1. Split the string into tokens.
  2. Run each element of the tokenized array through a Regex Pattern.
Mihir Mathuria
what about the field 5 where data can be empty?
MAK
+1  A: 

To me there seem to be 3 meta-fields:

3/3/2010 11:32:38 AM 39, Eldrin           BASEMENT-IN          
3/3/2010 11:32:47 AM                      BASEMENT-IN 

MF1: 3/3/2010 11:32:38 AM

MF2: 39, Eldrin

MF3: BASEMENT-IN

of which MF2 is optional. My delimiters then would be:

MF1 up to and including [AM|PM]

MF2 number,anything except BASEMENT-*

MF3 BASEMENT-*

I'm not all that good at regexes but I would extract those 3 groups as something like

(anything)(AM|PM)(number,anything)?(BASEMENT-anything)

where the ? means optional group.

extraneon