views:

216

answers:

4

Hi,

In my project, where I adopted Aho-Corasick algorithm to do some message filter mode in the server side, message the server got is string of multibyte character. But after several tests I found the bottleneck is the conversion between mulitbyte string and unicode wstring. What I use now is the pair of mbstowcs_s and wcstombs_s, which takes nearly 95% time cost of the whole mode. Also, I have tried MultiByteToWideChar/WideCharToMultiByte, it got just the same result. So I wonder if there is some other more efficient way to do the job? My project is built in VS2005, and the string converted will contain Chinese characters. Many thanks.

A: 

Deprecated (I believe) but you could always use the non-safe versions (mbstowcs and wcstombs). Not sure if this will have a marked improvement though. Alternatively, if your character set is limited (a - z, 0 - 9, for instance), you could always do it manually with a lookup table..?

acron
Unfortunately, the character set must support many other character such as Chinese. And also I tried mbstowcs, the result is just the same.
Avalon
A: 

Perhaps you can reduce the amount of calls to MultiByteToWideChar?

Alex
When check one message, it just call MultiByteToWideChar once, so it could not reduce any more.
Avalon
Can you combine multiple messages into one, larger buffer, then call it on that?
Alex
The logical requires the filtering process as soon as possible.
Avalon
A: 

You could also probably adopt Aho-Corasick to work directly on multibyte strings.

Avi
The original version of AC is for ASCII, then I make it to support wide character. Because the character of multibyte may contain one char or two, considering the AC depends on the state machine, I'm not sure how to change the AC algorithm to work on that.
Avalon
Right. I meant that you could include in the state machine knowledge about the width of each character, so that the internal trie and the state machine are aware of how many bytes to skip for each character. Admittedly, this isn't trivial.
Avi
Thanks, I have modified the original code to support the multibyte characters with some trick. Now it worked well.
Avalon
Thanks, @Avalon. It's good to know for sure that it is possible to modify AC like this.
Avi
A: 

There are a number of possibilities.

Firstly, what do you mean by "multi-byte character"? Do you mean UTF8 or an ISO DBCS system?

If you look at the definition of UTF8 and UTF16 there scope to do a highly optimised conversion, ripping out the "x" bits and reformatting them. See for example http://www.faqs.org/rfcs/rfc2044.html talks about UTF8<==>UTF32. Adjusting for UTF16 would be simple.

The second option might be to work entirely in UTF16. Render your Web page (or UI Dialog or whatever) in UTF16 and get the user input that way.

If all else fails, there aare other string algorithms than Aho-Corasick. Possibly look for an algorithm that works with your original encoding.

[Added 29-Jan-2010] See http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt for more on conversions, including two C implementations of mbtowc() and wctomb(). These are designed to work with arbitrarily large wchar_ts. If you just have 16-bit wchar_ts then you can simplify it a lot.

These would be much faster than the generic (code-page-sensitive) bersions in the standard library.

Michael J
Thanks Michael, I'll have a try of this new mbtowc/wctomb.
Avalon