ansaurus

Question

Answer 1

A:

TmeplateMaker does seem to do what you need, at least according to its documentation. Instead of receiving a template as an input, it infers ("learns") if from a few documents. Then, it has the extract method to extract the data from other documents that were created with this template.

The example shows:

# Now that we have a template, let's extract some data.
>>> t.extract('<b>red and green</b>')
('red', 'green')
>>> t.extract('<b>django and stephane</b>')
('django', 'stephane')

# The extract() method is very literal. It doesn't magically trim
# whitespace, nor does it have any knowledge of markup languages such as
# HTML.
>>> t.extract('<b>  spacy  and <u>underlined</u></b>')
('  spacy ', '<u>underlined</u>')

# The extract() method will raise the NoMatch exception if the data
# doesn't match the template. In this example, the data doesn't have the
# leading and trailing "<b>" tags.
>>> t.extract('this and that')
Traceback (most recent call last):
...

So, to achieve the task you require, I think you should:

Give it a few documents rendered from your template - it will have no trouble inferring the template from them.
Use the inferred template to extract data from new documents.

Come to think about it, it's even more useful than Perl's Template::Extract as it doesn't expect you to provide it a clean template - it learns it on its own from sample text.

Eli Bendersky 2010-01-28 07:00:09

Sorry, I should have been more specific in my original question in terms of what I am trying to accomplish. I'm looking to extract named variables from a statically structured document, ideally pass them through optional filter functions (this is something that I don't think that Template::Extract is even capable of), and then return the result as a dictionary object. You’re right though... TemplateMaker does a great job of detecting the differences between documents and returning those differences as variables, but it's not quite what I need it to do.

Kyle Derkacz 2010-01-28 07:19:31

@Kyle: maybe you need to rephrase your question then

Eli Bendersky 2010-01-28 07:56:21

Answer 2

A:

Here is an interesting discussion from Adrian the author of TemplateMaker http://www.holovaty.com/writing/templatemaker/

It seems to be a lot like what I would call a wrapper induction library.

If your looking for something else that is more configurable (less for scraping) take a look at lxml.html and BeautifulSoup, also for python.

brianray 2010-01-28 07:14:27

Answer 3

+1 A:

After digging around some more I found a solution to exactly what I was looking for. filippo posted a list of python solutions for screen scraping in this post: http://stackoverflow.com/questions/2861/options-for-html-scraping/1970411#1970411 among which is a package called scrapemark ( http://arshaw.com/scrapemark/ ).

Hope this helps anyone else who is looking for the same solution.

Kyle Derkacz 2010-03-09 18:07:29

ansaurus

tags:

views:

answers:

Template extraction in python/php

related questions