ansaurus

Question

Tricky file parsing. Inconsistent Delimeters

Answer 1

A:

You need to analyze you data by hand and find out what year, edition and publisher look like. For example if you find that year is always two digits and publisher always comes from some limited list, this is something you can start with.

stereofrog 2010-03-16 15:12:23

Good suggestion, thanks. Getting there.Year is two digits.edition is always the rank (1st, 3rd, 9th, etc.).Publisher is also tricky.. just found some that are two words.I had the thought of attacking the string from both ends, grabbing what I can get out of it then going from there.

Ben Truby 2010-03-16 15:22:44

Answer 2

A:

While I don't see any way other then guessing a bit I'd go about it something like this:

I'd scale off what I know I can parse out reliably. Leaving you with ABE WOMAN IN THE DUNES (INT'L ED) 1st 64 RANDOM

From there I'd try locate the Edition and split the string into two at that position after storing and removing the Edition leaving you with ABE WOMAN IN THE DUNES (INT'L ED) & 64 RANDOM, another option is to try with the year but of course Titles such as 1984 might present a problem . (Guessing edition is of course assuming it's 7th, 51st etc for all editions).

Finally I'd assume I could somewhat reliable guess the year 64 at the start of the second string and further limit the Publisher(/Comment) part.

The rest is pure guesswork unless you got a list of authors/publishers somewhere to match against as I'd assume there are not only comments with spaces but also publishers with spaces in their names. But at least you should be down to 2 strings containing Author/Title in one and Publisher(/Comments) in the other.

All in all it should limit the manual part a bit.

Once done I'd also save it in a better format somewhere so I don't have to go about parsing it again ;)

Don 2010-03-16 15:33:37

Answer 3

A:

I don't know if the pcre engine allows multiple groups from within selection, therefore:

([A-Z0-1]{7})\ (\d-\d{3}-\d{5}-\d)\ (.+)\ (\d(?:st|nd|rd))\ \d{2}\ ([^\d.]+)\ (\d+.\d{2})\ (\d+.\d{2})\ (\d+.\d{2})\ (\d+.\d{2})\ (\d{1})\ (\d+.\d{2})\ (\d+.\d{2})\ (\d+.\d{2})\ (\d+.\d{2})\ (\d)\ (\d+.\d{2})\ (\d+.\d{2})\ (\d+.\d{2})\ (\d+.\d{2})\ (\w{3})

It does look quite ugly and doesn't fix your author-title problem but it matches quite good for the rest of it. Concerning your problem I don't see any solution but having a lookup table for authors or using other services to lookup title and author via the ISBN.

Thats if unlike in your example above the authors are not just represented by their first name. Also double check all exception that might occur with the above regex as titles may contain 1st or alike.

lhw 2010-03-16 15:59:48

Looking up some info is an option, and will be used (amazon AWS). Just wanted to get as much as I could from the file before resorting to that.

Ben Truby 2010-03-16 16:03:03

Well if you considered using additional services anyway you might as well only take the ISBN and the additional system information from the end and beginning of the string. And take the rest from AWS or alike. That would make the job much easier.

lhw 2010-03-16 20:12:46

Answer 4

A:

I would also ask myself 'How good does this have to be' and 'How many records are there'?

If, for example, you are parsing this list to put up a catalog of books to sell on a website - you probably want to be as good as you can, but expect that you will miss some titles and build in feedback mechanism so your users can help you fix the issue ( and make it easy for you to fix it in your new format).

On the other hand, if you absolutely have to get it right because you will loose lots of money for each wrong parse, and there are only a few thousand books, you should plan on getting close, and then doing a human review of the entire file.

(In my first job, we spend six weeks on a data conversion project to convert 150 records - not a good use of time).

John Chenault 2010-03-16 16:09:11

Answer 5

+1 A:

Find the title and publisher of the book by ISBN (in some on-line database) and parse only the rest :)

BTW. are you sure that what looks like space actually is a space? There are more "invisible" characters (like non-break space). I know, not a good idea, but apparently author of that format was pretty creative...

Messa 2010-03-16 16:16:36

ansaurus

tags:

views:

answers:

Tricky file parsing. Inconsistent Delimeters

related questions