I have PDF documents containing several images/pages of scanned documents. Their (OCR-produced) text content comes in separate XML files.
Is it possible to use/link the text content from XML somehow to my PDF files? (Ideally there would be no additional files left in the repository to confuse unaware users.)
As I've been told there's 65k limit on a text property, therefore I can't simply put the text content into a property on the , as the PDF might easily exceed that limit.
A suggestion has been made to pass a stream with the text content to cm:content property of my PDF file. I'm kinda lost here, as IMO that means that either I'm providing a reference or I'm assigning huge string again. The first would mean the text content has to be preserved somewhere as a separate document. The later sounds like I would hit the 65k limit again.
Also I think setting cm:content would probably delete the PDF content itself. I need the PDF binary data to remain untouched.
This is where the suggestion is being discussed. I'm currently trying that anyways.