Hi, I'm trying to analyze content of a Drupal database for collective intelligence purposes.
So far I've been able to work out a simple example that tokenizes the various contents (mainly forum posts) and count tokens after removing stop words.
The StandardTokenizer
supplied with Lucene should be able to tokenize hostnames and emails but content can have also embedded html, e.g:
Pubblichiamo la presentazione di IBM riguardante DB2 per i vari sistemi operativi
Linux, UNIX e Windows.\r\n\r\nQuesto documento sta sulla piattaforma KM e lo potete
scaricare a questo <a href=\'https://sfkm.griffon.local/sites/BSF%20KM/BSF/CC%20T/Specifiche/Eventi2008/IBM%20DB2%20for%20Linux,%20UNIX%20e%20Windows.pdf\' target=blank>link</a>.
This is tokenized badly in this way:
pubblichiamo -> 1
presentazione -> 1
ibm -> 1
riguardante -> 1
db2 -> 1
vari -> 1
sistemi -> 1
operativi -> 1
linux -> 1
unix -> 1
windows -> 1
documento -> 1
piattaforma -> 1
km -> 1
potete -> 1
scaricare -> 1
href -> 1
https -> 1
sfkm.griffon.local -> 1
sites -> 1
bsf -> 1
20km/bsf -> 1
cc -> 1
20t/specifiche/eventi2008/ibm -> 1
20db2 -> 1
20for -> 1
20linux -> 1
20unix -> 1
20e -> 1
20windows.pdf -> 1
target -> 1
blank -> 1
link -> 1
What I would like to have is to keep links together and strip html tags (like <pre>
or <strong>
) that are useless.
Should I write a Filter or a different Tokenizer? The Tokenizer should replace the standard one or can I mix them together? The hardest way would be to take StandardTokenizerImpl
and copy it in a new file, then add custom behaviour, but I wouldn't like to go too deep in Lucene implementation for now (learning gradually).
Maybe there is already something similar implemented but I've been unable to find it.
EDIT:
Looking at StandardTokenizerImpl
makes me think that if I have to extend it by modifying the actual implementation it's not so convenient compared to using lex or flex and doing it by myself..