views:

122

answers:

1

I'm trying to index Wikpedia dumps. My SAX parser make Article objects for the XML with only the fields I care about, then send it to my ArticleSink, which produces Lucene Documents.

I want to filter special/meta pages like those prefixed with Category: or Wikipedia:, so I made an array of those prefixes and test the title of each page against this array in my ArticleSink, using article.getTitle.startsWith(prefix). In English, everything works fine, I get a Lucene index with all the pages except for the matching prefixes.

In French, the prefixes with no accent also work (i.e. filter the corresponding pages), some of the accented prefixes don't work at all (like Catégorie:), and some work most of the time but fail on some pages (like Wikipédia:) but I cannot see any difference between the corresponding lines (in less).

I can't really inspect all the differences in the file because of its size (5 GB), but it looks like a correct UTF-8 XML. If I take a portion of the file using grep or head, the accents are correct (even on the incriminated pages, the <title>Catégorie:something</title> is correctly displayed by grep). On the other hand, when I rectreate a wiki XML by tail/head-cutting the original file, the same page (here Catégorie:Rock par ville) gets filtered in the small file, not in the original…

Any idea ?

Alternatives I tried:

Getting the file (commented lines were tried wihtout success*):

FileInputStream fis = new FileInputStream(new File(xmlFileName));
//ReaderInputStream ris = ReaderInputStream.forceEncodingInputStream(fis, "UTF-8" );
//(custom function opening the stream, 
//reading it as UFT-8 into a Reader and returning another byte stream)
//InputSource is = new InputSource( fis ); is.setEncoding("UTF-8");
parser.parse(fis, handler);

Filtered prefixes:

ignoredPrefix = new String[] {"Catégorie:", "Modèle:", "Wikipédia:",
    "Cat\uFFFDgorie:", "Mod\uFFFDle:", "Wikip\uFFFDdia:", //invalid char
    "Catégorie:", "Modèle:", "Wikipédia:", // UTF-8 as ISO-8859-1
    "Image:", "Portail:", "Fichier:", "Aide:", "Projet:"}; // those last always work

* ERRATUM

Actually, my bad, that one I tried work, I tested the wrong index:

InputSource is = new InputSource( fis );
is.setEncoding("UTF-8"); // force UTF-8 interpretation
parser.parse(fis, handler);
+1  A: 

Since you write the prefixes as plain strings into your source file, you wanna make sure that you save that .java file in UTF-8, too (or any other encoding that supports the special characters you're using). Then, however, you have to tell the compiler which encoding the file is in with the -encoding flag:

javac -encoding utf-8 *.java

For the XML source, you could try

Reader r = new InputStreamReader(new FileInputStream(xmlFileName), "UTF-8");

InputStreams do not deal with encodings since they are byte-based, not character-based. So, here we create a Reader from an FileInputStream - the latter (stream) doesn't know about encodings, but the former (reader) does, because we give the encoding in the constructor.

Thomas
My sources are already encoded and compiled in UTF-8. As of your try, that is what ReaderInputStream.forceEncodingInputStream does, except it re-converts it back to an InputStream, beacause the SAXParser only supports binary input.
streetpc
SAXParser also takes `InputSource` which you can pass a `Reader`: `parser.parse(new InputSource(r), handler);`
Thomas
Yes, I already tried that (see commented code). Turns out I must have missed something, it worked before I even asked here. Still, I'm accepting your answer because well, that's working, and to thank you for your help.
streetpc