ansaurus

Question

How to link scanned document with its text content to make it searchable?

Answer 1

+1 A:

Soo, it is actually quite easy... What needs to be done is to define a property of time "d:content" on your document; I do that via an aspect...

model.xml:

<aspects>
    <aspect name="mm:my_aspect">
...
            <property name="mm:myTextContentProperty">
                <type>d:content</type>
            </property>
        </properties>
    </aspect>
</aspects>

Then, when I have both PDF and its text representation in the repository, I link those two by adding the aspect and populating the property...

getNodeService().addAspect(pdfNodeRef, myAspect, null);
getNodeService().setProperty(pdfNodeRef, MyModel.MY_TEXT_CONTENT_PROPERTY, new ContentData("store://....bin", "text/plain", size, "UTF-8"));

Now the PDF can be found via both following queries even though it does not contain any text data...

"@\\{http\\://mymodel.ns/content/1.0\\}myTextContentProperty:\"" + string + "\""
"TEXT:\"" + string + "\""

The later is also hinted here, and I guess that is how regular search in Alfresco Web Client works, because now the PDF is reachable using the regular search input.
There is one issue though: the search spits the PDF document and also the document I link using the property. So now I need to hide the later from search results...

(When searching using the first query only the PDF is found, as expected; but that approach is of little use to me.)

Hopefully it saves some time to other Alfresco-newbies. :)

Jaroslav Záruba 2010-10-13 17:24:12

I'll definitely add this to my bookmarks for future use!

chiccodoro 2010-10-14 07:23:27

Answer 2

A:

Another way to achieve what I need would be setting MY_TEXT_CONTENT_PROPERTY using contentService...

ContentWriter writer = getContentService().getWriter(pdfNodeRef, MyModel.MY_TEXT_CONTENT_PROPERTY, true);
writer.setMimetype("text/plain");
writer.setEncoding("UTF-8");
writer.putContent(stringFromXmlDescription); // the source XML gets thrown away

(Important thing seems to be to put the content after the mimetype and encoding are set. Otherwise the content/property is not searchable.)

With this approach there's no need to hide the linked text documents, there aren't any.

Jaroslav Záruba 2010-10-20 14:04:22

ansaurus

tags:

views:

answers:

How to link scanned document with its text content to make it searchable?

related questions