views:

33

answers:

2

Im using Sphider as a search engine for my website, its really easy to work with but im having some major issues with localized characters.

All of my html/php pages have the charset defined as UTF-8 and the search and result page from Sphider had charset=ISO-8859-1, when I first used the Sphider "spider" to crawl my website it made all of my localized characters into some codification I dont know:

"ç" become "ç" and so on with "ã", "á" etc

When I created the DB in MySql I made it a utf-8_general_ci also my defenitions for the DB are : MySQL charset: UTF-8 Unicode (utf8) MySQL connection collation: utf-8_unicode_ci

This is a real problem because the search wont work properly, if I search "diferença" for instance, in the url it will appear as "?query=diferença&search=1" which is correct but will produce no results in the "suggested search" it will appear as "diferen�a" in case its not visible, the "ç" has become a black square with a white question mark on it.

I believe the spider might have a different working charset but I dont seem able to understand were if it is to be the case. Also being developed towards English primarily I believe its not hard to understand that it has some hiccups along the way.

Does anyone has any experience with it or what should I try to do to solve this?

What really bugging me is not understanding why I get strange symbols in the DB.

A: 

Quickly browsing through some Sphider source code files revealed that the application works only with Latin1 charset. You should switch to some other search engine, like Lucene. You'll need to do a bit more search-related coding though. If you don't feel like doing it, and your site is public, just integrate Google search.

jmz
Thnak you, tough limited im going to keep it for now, I dont want to use google as I have no way to control the spidering or the results layout to integrate into my website. I originally looked at Lucene but it is over my head.
Joel
If you can, you could use output buffering to capture the whole page you're generating, then if it was requested by the spider, convert it to ISO-8859-1//IGNORE with iconv.
jmz
A: 

You should have EVERYTHING in utf-8.

  • The forms who edit any given page
  • The physical files
  • The outputted html files
  • The headers
  • The connection to the database
  • The table definition

Miss one and you will have problems (I'm talking from personal experience)

The Disintegrator
I believe I do just the Sphider app that must not be made to use utf-8 and im having trouble adapting it.
Joel