views:

368

answers:

2

In my current project i need to index all e-mails and their attachments from multiple mailbox.

I will use Solr and I don't know what is the best approach to build my index's structure. My first approach was:

<fields>
<field name="id" require="true"/>
<field name="uid" require="true"/>
//A lot of other fields
<dynamicField name="attachmentName_*" require="false">
<dynamicField name="attachmentBody_*" require="false">
</fields>

But now i am not really sure if it is the best structure. I think i can't search for one term (e.g stackoverflow) and know where was the term (e.g. *attachmentBody_1* or *2 or *3 etc) with a single query.

Anyone have a better suggestion to my index's structure?

A: 

I found one possible solution. All I need to do is set attachmentBody as stored.

This solution is not good enough because the index's space will dramatically increase but in my case there is no problem cause I will implement highlight feature too and those fields need to be stored.

Rui Carneiro
+2  A: 

You can use multiValued fields for attachmentName and attachmentBody. So you would have 2 regular fields instead of dynamic fields. You can then use highlighting to bring back the specific values that match with surrounding context.

Another option would be to make each attachment a separate document, and store something to identify which email it belongs to. The downside of this approach is that you may need to index any data from the email itself several times. But this is really only a problem if most of the email messages have more than one attachment.

KenE
That way I would never be able to know in which file was the match. Anyway... to highlighting I need fields with store="true" so we are still with space problem.
Rui Carneiro
Your second option is a good one and i already thinked on that =)
Rui Carneiro