tags:

views:

195

answers:

3

Hi. Does anyone know a good parser for document metadata in python for unix like systems. In Java, apache tika is great.

No com ... please :)

Thanks

A: 

If you like tika, you could always use Jython so you can reference tika directly.

Hank Gay
sure was looking for a plain python package
locojay
A: 

hachoir_metadata works great with excel documents http://bitbucket.org/haypo/hachoir/wiki/Home

locojay
A: 

You don't have to use Jython to use Tika. You can call Java from Python using JCC. You can find decent instructions for this here.

When installing JCC you'll have to use one of two provided patches for setuptools, so it can build shared objects. The c7 version worked for me on Ubuntu 10.04.

Another option would be to use the python subprocess module to call and capture the stdout of Tika.

Kevin