views:

220

answers:

3

Are there existing template extract libraries in either python or php? Perl has Template::Extract (http://search.cpan.org/%7Eautrijus/Template-Extract-0.40/lib/Template/Extract.pm), but I haven't been able to find a similar implementation in either python or php.

The only thing close in python that I could find is TemplateMaker (http://code.google.com/p/templatemaker/), but that's not really a template extraction library.

A: 

TmeplateMaker does seem to do what you need, at least according to its documentation. Instead of receiving a template as an input, it infers ("learns") if from a few documents. Then, it has the extract method to extract the data from other documents that were created with this template.

The example shows:

# Now that we have a template, let's extract some data.
>>> t.extract('<b>red and green</b>')
('red', 'green')
>>> t.extract('<b>django and stephane</b>')
('django', 'stephane')

# The extract() method is very literal. It doesn't magically trim
# whitespace, nor does it have any knowledge of markup languages such as
# HTML.
>>> t.extract('<b>  spacy  and <u>underlined</u></b>')
('  spacy ', '<u>underlined</u>')

# The extract() method will raise the NoMatch exception if the data
# doesn't match the template. In this example, the data doesn't have the
# leading and trailing "<b>" tags.
>>> t.extract('this and that')
Traceback (most recent call last):
...

So, to achieve the task you require, I think you should:

  • Give it a few documents rendered from your template - it will have no trouble inferring the template from them.
  • Use the inferred template to extract data from new documents.

Come to think about it, it's even more useful than Perl's Template::Extract as it doesn't expect you to provide it a clean template - it learns it on its own from sample text.

Eli Bendersky
Sorry, I should have been more specific in my original question in terms of what I am trying to accomplish. I'm looking to extract named variables from a statically structured document, ideally pass them through optional filter functions (this is something that I don't think that Template::Extract is even capable of), and then return the result as a dictionary object. You’re right though... TemplateMaker does a great job of detecting the differences between documents and returning those differences as variables, but it's not quite what I need it to do.
Kyle Derkacz
@Kyle: maybe you need to rephrase your question then
Eli Bendersky
A: 

Here is an interesting discussion from Adrian the author of TemplateMaker http://www.holovaty.com/writing/templatemaker/

It seems to be a lot like what I would call a wrapper induction library.

If your looking for something else that is more configurable (less for scraping) take a look at lxml.html and BeautifulSoup, also for python.

brianray
+1  A: 

After digging around some more I found a solution to exactly what I was looking for. filippo posted a list of python solutions for screen scraping in this post: http://stackoverflow.com/questions/2861/options-for-html-scraping/1970411#1970411 among which is a package called scrapemark ( http://arshaw.com/scrapemark/ ).

Hope this helps anyone else who is looking for the same solution.

Kyle Derkacz