views:

116

answers:

1

I've tried various methods to strip the license from Project Gutenberg texts, for use as a corpus for a language learning project, but I can't seem to come up with an unsupervised, reliable approach. The best heuristic I've come up with so far is stripping the first twenty eight lines and the last 398, which worked for a large number of the texts. Any suggestions as to ways I can automatically strip the text (which is very similar for lots of the texts, but with slight differences in each case, and a few different templates, as well), as well as suggestions for how to verify that the text has been stripped accurately, would be very useful.

+1  A: 

You weren't kidding. It's almost as if they were trying to make the job AI-complete. I can think of only two approaches, neither of them perfect.

1) Set up a script in, say, Perl, to tackle the most common patterns (e.g., look for the phrase "produced by", keep going down to the next blank line and cut there) but put in lots of assertions about what's expected (e.g. the next text should be the title or author). That way when the pattern fails, you'll know it. The first time a pattern fails, do it by hand. The second time, modify the script.

2) Try Amazon's Mechanical Turk.

Beta
I wish it didn't come down to methods like this, but I think you're probably right. I'll update this question if I find a better way.
tehgeekmeister

related questions