ansaurus

Question

Parsing Large Text Files in Real-time (Java)

Answer 1

+5 A:

If it is a proper grammar, use a parser builder such as the GOLD Parsing System. This allows you to specify the format and use an efficient parser to get the tokens you need, getting error-handling almost for free.

Lucero 2009-04-23 11:26:33

Answer 2

+4 A:

I'm wondering why this isn't in XML, and then you could leverage off the available XML tooling. I'm thinking particularly of SAX, in which case you could easily parse/process this without holding it all in memory.

So can you convert this to XML ?

If you can't, and you need a parser, then take a look at JavaCC

Brian Agnew 2009-04-23 11:26:46

It's a 3rd party log file, I have no control over the format unfortunately.

C.McAtackney 2009-04-23 11:37:25

Answer 3

+3 A:

Use the Scanner class and process your file a line at a time. Im not sure why you mentioned regex. Regex is almost never the right answer to any parsing question because of the ambiguity and lack of symmantic contorl over whats happening in what context.

mP 2009-04-23 11:33:28

Please, tell us how regular expressions are ambiguous. Yes, different flavors behave differently, but they are all (more-or-less) documented and consistent. Every expression, for a given flavor, has a precise and unambiguous meaning.

Matthew Flaschen 2009-05-10 04:59:42

When they (RegEx) get complicated they dont do what people believe they are actually doing. Real parsing problems and their solutions never use RegExs. Are there any compilers written with RegExs ?

mP 2009-05-10 06:02:44

Answer 4

+2 A:

You can use the Antlr parser generator to build a parser capable of parsing your files.

lewap 2009-04-23 11:47:26

Answer 5

+1 A:

Not answering the question about parsing ... but you could parse the files and generate static pages as soon as new files arrive. So you would have no performance problems... (And I think 1Mb isn't a big file so you can load it in memory, as long as you don't load too many files concurrently...)

pgras 2009-04-23 12:03:27

It's the same file that is getting parsed all the time - edited the post to clarify that.

C.McAtackney 2009-04-23 12:40:51

Answer 6

+1 A:

This seems like a simple enough file format, so you may consider using a Recursive Descent Parser. Compared to JavaCC and Antlr, its pros are that you can write a few simple methods, get the data you need, and you do not need to learn a parser generator formalism. Its cons - it may be less efficient. A recursive descent parser is in principle stronger than regular expressions. If you can come up with a grammar for this file type, it will serve you for whatever solution you choose.

Yuval F 2009-04-23 12:25:51

Answer 7

+4 A:

If it's going to be about 1MB and literally in the format you state, then it sounds like you're overengineering things.

Unless your server is a ZX Spectrum or something, just use regular expressions to parse it, whack the data in a hash map (and keep it there), and don't worry about it. It'll take up a few megabytes in memory, but so what...?

Update: just to give you a concrete idea of performance, some measurements I took of the performance of String.split() (which uses regular expressions) show that on a 2GHz machine, it takes milliseconds to split 10,000 100-character strings (in other words, about 1 megabyte of data -- actually nearer 2MB in pure volume of bytes, since Strings are 2 bytes per char). Obvioualy, that's not quite the operation you're performing, but you get my point: things aren't that bad...

Neil Coffey 2009-04-23 12:59:06

Fair enough - that's actually something I was wondering as well - if I had over-egged this problem in my head.I think I'll do as you say and see how I get on. If performance turns out to be an issue then I can come back and look at the options suggested by other answers.Cheers.

C.McAtackney 2009-04-23 13:31:55

I honestly don't think it will be -- 1MB is really not a lot of data.

Neil Coffey 2009-04-23 13:49:15

Answer 8

+1 A:

If it's the limitations of Java regexes you're wondering about, don't worry about it. Assuming you're reasonably competent at crafting regexes, performance shouldn't be a problem. The feature set is satisfyingly rich, too--including my favorite, possessive quantifiers.

Alan Moore 2009-04-23 13:23:17

Answer 9

+1 A:

the other solution is to do some form of preprocessing (done offline, or as a cron job) which produces a very optimized data structure, which is then used to serve the many web requests (without having to reparse the file).

though, looking at the scenario in question, that doesnt seem to be needed.

Chii 2009-04-23 14:26:20

ansaurus

tags:

views:

answers:

Parsing Large Text Files in Real-time (Java)

related questions