views:

51

answers:

2

We're having problems with UTF-8 in Solr, and need to debug the documents that are sent for indexing. Can we do this somehow?

Searched all logs I've found, enabled debug="1" in the app XML in the tomcat6 / Catalina directory. Even tried Wireshark, but no dice. Please please!

Everything looks good on the PHP side, and this has been working fine until now. But international characters turns into ?, classic headache.

+1  A: 

You could use Tcpmon.

I use it a lot as it allows me to see the http header and payload when sending to Solr (or any web app).

Pascal Dimassimo
I can't get it to pass data back from tomcat to php, but I can see the request. Are you using it with both the web app and Solr running on the same machine?
Znarkus
Hm. It seems tcpmon can't show post data, which makes it useless for me :(
Znarkus
Tcpmon can show POST data. I just use it a couple of minutes ago to debug a posted update to Solr...
Pascal Dimassimo
Yes, I am using it with everything on the same machine.
Pascal Dimassimo
Okay, not working for me. What OS are you on?
Znarkus
Ubuntu 10.04 with OpenJDK 1.6.0_18. I'm also using it on a Windows XP machine without problem. What is the behavior you have?
Pascal Dimassimo
+2  A: 

Be sure that the php side is perfect. Did you open the xml file with an editor and explicit setting the encoding to UTF8? What is your default system encoding? I bet converting the file from this encoding to UTF8 can solve the problem (e.g. with iconv).

Because Solr only accepts UTF-8. And because of the nature of xml this is even only a subset of xml. You can also scan the xml generated from php through the following code i.e. look for invalid (xml) chars there ...

Karussell
Wow, really wish I would've investigated the XML encoding more. Someone had snuck in a method that broke the encoding. Thanks!
Znarkus