views:

4101

answers:

9

Hi all,

I'm interested in parsing a fairly large text file in Java (1.6.x) and was wondering what approach(es) would be considered best practice?

The file will probably be about 1Mb in size, and will consist of thousands of entries along the lines of;

Entry
{
    property1=value1
    property2=value2
    ...
}

etc.

My first instinct is to use regular expressions, but I have no prior experience of using Java in a production environment, and so am unsure how powerful the java.util.regex classes are.

To clarify a bit, my application is going to be a web app (JSP) which parses the file in question and displays the various values it retrieves. There is only ever the one file which gets parsed (it resides in a 3rd party directory on the host).

The app will have a fairly low usage (maybe only a handful of users using it a couple of times a day), but it is vital that when they do use it, the information is retrieved as quickly as possible.

Also, are there any precautions to take around loading the file into memory every time it is parsed?

Can anyone recommend an approach to take here?

Thanks

+5  A: 

If it is a proper grammar, use a parser builder such as the GOLD Parsing System. This allows you to specify the format and use an efficient parser to get the tokens you need, getting error-handling almost for free.

Lucero
+4  A: 

I'm wondering why this isn't in XML, and then you could leverage off the available XML tooling. I'm thinking particularly of SAX, in which case you could easily parse/process this without holding it all in memory.

So can you convert this to XML ?

If you can't, and you need a parser, then take a look at JavaCC

Brian Agnew
It's a 3rd party log file, I have no control over the format unfortunately.
C.McAtackney
+3  A: 

Use the Scanner class and process your file a line at a time. Im not sure why you mentioned regex. Regex is almost never the right answer to any parsing question because of the ambiguity and lack of symmantic contorl over whats happening in what context.

mP
Please, tell us how regular expressions are ambiguous. Yes, different flavors behave differently, but they are all (more-or-less) documented and consistent. Every expression, for a given flavor, has a precise and unambiguous meaning.
Matthew Flaschen
When they (RegEx) get complicated they dont do what people believe they are actually doing. Real parsing problems and their solutions never use RegExs. Are there any compilers written with RegExs ?
mP
+2  A: 

You can use the Antlr parser generator to build a parser capable of parsing your files.

lewap
+1  A: 

Not answering the question about parsing ... but you could parse the files and generate static pages as soon as new files arrive. So you would have no performance problems... (And I think 1Mb isn't a big file so you can load it in memory, as long as you don't load too many files concurrently...)

pgras
It's the same file that is getting parsed all the time - edited the post to clarify that.
C.McAtackney
+1  A: 

This seems like a simple enough file format, so you may consider using a Recursive Descent Parser. Compared to JavaCC and Antlr, its pros are that you can write a few simple methods, get the data you need, and you do not need to learn a parser generator formalism. Its cons - it may be less efficient. A recursive descent parser is in principle stronger than regular expressions. If you can come up with a grammar for this file type, it will serve you for whatever solution you choose.

Yuval F
+4  A: 

If it's going to be about 1MB and literally in the format you state, then it sounds like you're overengineering things.

Unless your server is a ZX Spectrum or something, just use regular expressions to parse it, whack the data in a hash map (and keep it there), and don't worry about it. It'll take up a few megabytes in memory, but so what...?

Update: just to give you a concrete idea of performance, some measurements I took of the performance of String.split() (which uses regular expressions) show that on a 2GHz machine, it takes milliseconds to split 10,000 100-character strings (in other words, about 1 megabyte of data -- actually nearer 2MB in pure volume of bytes, since Strings are 2 bytes per char). Obvioualy, that's not quite the operation you're performing, but you get my point: things aren't that bad...

Neil Coffey
Fair enough - that's actually something I was wondering as well - if I had over-egged this problem in my head.I think I'll do as you say and see how I get on. If performance turns out to be an issue then I can come back and look at the options suggested by other answers.Cheers.
C.McAtackney
I honestly don't think it will be -- 1MB is really not a lot of data.
Neil Coffey
+1  A: 

If it's the limitations of Java regexes you're wondering about, don't worry about it. Assuming you're reasonably competent at crafting regexes, performance shouldn't be a problem. The feature set is satisfyingly rich, too--including my favorite, possessive quantifiers.

Alan Moore
+1  A: 

the other solution is to do some form of preprocessing (done offline, or as a cron job) which produces a very optimized data structure, which is then used to serve the many web requests (without having to reparse the file).

though, looking at the scenario in question, that doesnt seem to be needed.

Chii