TmeplateMaker
does seem to do what you need, at least according to its documentation. Instead of receiving a template as an input, it infers ("learns") if from a few documents. Then, it has the extract
method to extract the data from other documents that were created with this template.
The example shows:
# Now that we have a template, let's extract some data.
>>> t.extract('<b>red and green</b>')
('red', 'green')
>>> t.extract('<b>django and stephane</b>')
('django', 'stephane')
# The extract() method is very literal. It doesn't magically trim
# whitespace, nor does it have any knowledge of markup languages such as
# HTML.
>>> t.extract('<b> spacy and <u>underlined</u></b>')
(' spacy ', '<u>underlined</u>')
# The extract() method will raise the NoMatch exception if the data
# doesn't match the template. In this example, the data doesn't have the
# leading and trailing "<b>" tags.
>>> t.extract('this and that')
Traceback (most recent call last):
...
So, to achieve the task you require, I think you should:
- Give it a few documents rendered from your template - it will have no trouble inferring the template from them.
- Use the inferred template to extract data from new documents.
Come to think about it, it's even more useful than Perl's Template::Extract
as it doesn't expect you to provide it a clean template - it learns it on its own from sample text.