views:

45

answers:

1

I am a Computer Science student and working on a project based on the Nutch search engine. I want to develop Java algorithms to better index and search Arabic websites. How can I optimize for this purpose, any ideas?

A: 

Arabic language has 29 alphabets, some of these alphabets are having sub alphabets like the Alif (أ) which can come in different forms.

if you managed to be sub alphabet tolerant i.e. to allow spelling mistakes on these characters

e.g. أحمد and احمد and إحمد and آحمد although they have different UTF8 values, you can take them as close results.

moreover, if you can derive roots from words to allow searching for singulars, plurals, verbs, nouns, etc.

so if someone typed قال (said) you can include in the searched terms the words قول (saying) and (يقول) (to say) and مقال (a saying), etc. it will require a complicated engine to do such thing

finally, if you consider tashkeel (decorating vowels) that are optional in typing where you could take as a more specific search but would allow ignoring it

e.g. رجل could match رَجُلٌ (meaning a man) or رَجَلَ (meaning walked on feet) or رِِِِِجْل (leg)

I hope this would help

A.Rashad