tags:

views:

699

answers:

4

hey guys, just installed solr, edited the schema.xml, and am now trying to index it and search on it with some test data.

In the XML file I'm sending to SOLR, one of my fields look like this:

<field name="PageContent"><![CDATA[<p>some text in a paragrah tag</p>]]></field>

There's HTML there, so I've wrapped it in CDATA.

In my SOLR schema.xml, the definition for that field looks like this:

<field name="PageContent" type="text" indexed="true" stored="true"/>

When I ran the POSTing tool, everything went ok, but when I search for content which I know is inside the PageContent field, I get no results.

However, when I set the node to PageContent, it works. But if I set it to any other field, it doesn't search in PageContent.

Am I doing something wrong? what's the issue?

thanks very much for any help

cheers!

UPDATE

Just to clarify on the error.

I've uploaded a "doc" with the following data:

<field name="PageID">928</field>
<field name="PageName">some name</field>
<field name="PageContent"><![CDATA[<p>html content</p>]]></field>

In my schema I've defined the fields as such:

<field name="PageID" type="integer" indexed="true" stored="true" required="true"/>
<field name="PageName" type="text" indexed="true" stored="true"/>
<field name="PageContent" type="text" indexed="true" stored="true"/>

And:

<uniqueKey>PageID</uniqueKey>
<defaultSearchField>PageName</defaultSearchField>

Now, when I use the Solr admin tool and search for "some name" I get a result. But, if I search for "html content", or "html", or "content", or "928", I get no results

why?

cool, thanks!

+1  A: 

You are making sure that your data has been committed before you attempt to search on it, right?

Also, if you want to store raw HTML its probably best to actually remove the HTML. You can do this in your application or using Solr's solr.HTMLStripWhitespaceTokenizerFactory, like:

<tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>

Which you declare in your fieldtype definition for "text". You might want to create a new field type just for your html, maybe something like text_html and you can use it like so:

<fieldtype name="text_html" class="solr.TextField" positionIncrementGap="100"> 
      <analyzer type="index"> 
          <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/> 
          <filter class="solr.StopFilterFactory" ignoreCase="true"/> 
          <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/> 
          <filter class="solr.LowerCaseFilterFactory"/> 
          <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/> 
          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> 
      </analyzer> 
      <analyzer type="query"> 
          <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/> 
          <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> 
          <filter class="solr.StopFilterFactory" ignoreCase="true"/> 
          <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/> 
          <filter class="solr.LowerCaseFilterFactory"/> 
          <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/> 
          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> 
      </analyzer> 
    </fieldtype>

I am not sure what you mean by:

However, when I set the node to PageContent, it works. But if I set it to any other field, it doesn't search in PageContent.

Can you please elaborate?

Cody Caughlan
cool cody, the code above is really useful, I'll definitely strip out the html. As for the PageContent problem, I've updated my question above. Thanks so much.
andy
+2  A: 

You mentioned that your default search field is set to PageName, I wouldn't expect a search for "content" to return anything.

You probably meant to put "PageContent:content" in the search box to find data in that field. If you want to search against multiple fields you'll want to check this out http://wiki.apache.org/solr/DisMaxRequestHandler. The solr admin console is not that great of a tool to play around with all the DisMax search options, you'll want to just manipulate the URL for that.

Regardless, I agree with the previous poster, if your analysis setup isn't setup up properly to deal with HTML you are likely to get all sorts of unexpected search results. Strip the HTML out and index text only.

If you want your standard query handler to search against all your fields you can change it in your solrconfig.xml (I always add a second query handler instead of modifying "standard". The fl field is the list of fields you want to search against. It's a comma separated list or *.

<requestHandler name="standard" class="solr.DisMaxRequestHandler">

     <lst name="defaults">
            <str name="echoParams">all</str>
            <str name="hl">true</str>

            <str name="fl">*</str>
     </lst>

 </requestHandler>
Trey
cool, thanks Trey. So let me get this straight. I'm a bit confused. So, if I just send a search query, i.e. "solr/?q=hi i live in the content node" SOLR will only do lookup against a single field? When I run the example vanilla SOLR setup I feel like a single simple query searches all fields? Am I wrong?
andy
Because there is no syntax highlighting in comments, I clarified above with a suggestion.
Trey
+1 sweet, thanks dude, I'll try that out and get back to you
andy
A: 

Regarding Terry's answer.. fl is the list of fields returned by the query.. qf is the list you wanted to refer to and it doesn't support wild cards..

What I know is that the only way to search all fields without enlisting them is to have a copyField that catches all values (not stored just indexed), then mimic searching against all fields by searching against it

A: 

In my schema.xml I have something such as the following which copy the value of each field ending with _t into the text field.

<defaultSearchField>text</defaultSearchField>
<copyField source="*_t" dest="text" maxChars="3000"/>
Kurt Harriger