tags:

views:

8

answers:

0

My problem:

I have a pdf with lots of roman characters with complex diacritical marks (e.g., ṣ, ś, ṝ, ǎ, etc.). To make it easier to search within the pdf, I would like to add an additional layer, much as one does with hocr, where the same text is present without the diacritics.

When using full-text search engines I can index multiple terms at the same position (vector) - I would like to achieve the same effect here.

I have read lots about adding a hocr layer to scanned images, but I really just want to duplicate the text layer, pass it through a script that strips the diacritics (straightforward enough) and then adds it back in as a hidden but searchable layer.

Anyone have any suggestions? (Solutions involving any platform, language, library or toolchain will be useful!)

Thanks :)

Edit: please let me know if the question is unclear.