views:

96

answers:

1

My question is almost identical to an earlier entry I found here but not quite.

I need to parse through a textfile where the data is structured in this way: Each item in the file begins with a # followed by the label. The fields in the post is separated by one or more whitespaces.

Here comes the part I'm having problem with. Each field may or may not me enclosed by quotation marks, it's only required if the data contains spaces.

So what I'm after is a regex that splits by whitespace but not if that whitespace is inside a quotation.

At the moment I'm using a separate regex for each label and then but it would be much more efficient to split it immediatly when reading from the file. As for the account example below (^#[A-z]+)\s([0-9]+)\s(.+)

Example of data

#ACCOUNT 7059 "Misc. travelexpenses"
#ADRESS "M. Jackson" "somewhere over the rainbow" WI53233-1704 555-12345
A: 

You can use an "OR" construct, to define possible forms of the fields. Like

([A-z]+|"[^"]+") 

matches both Kring and "Mr. Kring".

Edit: So, to get all your fields and the label in the above records you could use

(?:^#|\s+)([^"#\s]+|"[^"]+")

http://gskinner.com/RegExr/ is a good way to test Regular Expressions.

Jens
Use `\S+` instead of `[A-z]+`
Amarghosh
\S hits non-whitespace characters, doesn't it?Well.. looking at the data given above, that might be what Kring was looking for. I just wanted to show how to use the or construct.
Jens
But what if I want to exclude a match? In this case find all whitespaces except those within quotation marks.This is a question : Splits to 4 partsThis is "a question" : Splits to 3 parts
Kring
Just switch it around: `("[^"]+"|\S+)`
Tim Pietzcker
Excuse me for being so dense, but isn't that the exact opposite of what I was asking? But maybe it is close enough :) Thank you.
Kring
I wasn't initially sure if there could be more than two fields per record. I'll update my answer, to match your records above.
Jens