In my project I work with text in general. I found that preprocessing can be very slow. So I would like to ask you if you know how to optimize my code. The flow is like this:
get HTML page -> (To plain text -> stemming -> remove stop words) -> further text processing
In brackets there are preprocessing steps. The application runs in about 10.265s, but preprocessing takes 9.18s! This is time for preprocessing 50 HTML pages (excluding downloading).
I use HtmlAgilityPack library to convert HTML into plain text. This is quite fast. It takes 2.5ms to convert 1 document, so it's relatively OK.
Problem comes later. Stemming one document takes up to 120ms. Unfortunately those HTML pages are in Polish. There no exists stemmer for Polish language written in C#. I know only 2 free to use written in Java: stempel and morfologic. I precompiled stempel.jar to stempel.dll with help of IKVM software. So there is nothing more to do.
Eliminating stop words takes also a lot of time (~70ms for 1 doc). It is done like this:
result = Regex.Replace(text.ToLower(), @"(([-]|[.]|[-.]|[0-9])?[0-9]*([.]|[,])*[0-9]+)|(\b\w{1,2}\b)|([^\w])", " ");
while (stopwords.MoveNext())
{
string stopword = stopwords.Current.ToString();
result = Regex.Replace(result, "(\\b"+stopword+"\\b)", " ");
}
return result;
First i remove all numbers, special characters, 1- and 2-letter words. Then in loop I remove stop words. There are about 270 stop words.
Is it possible to make it faster?
Edit:
What I want to do is remove everything which is not a word longer than 2 letters. So I want to get out all special chars (including '.', ',', '?', '!', etc.) numbers, stop words. I need only pure words that I can use for Data Mining.