views:

200

answers:

3

I want to match all individual words in given string, provided that String is UTF-8 encoded, and then I spellcheck each word. Everything works with my code provided it's english-only text, but if there are some, say, German characters, my words are split in two on these characters. How can I match single words from text, that contain latin and not-latin characters?

What I do now is:

text.gsub(/[\w\']+/) do |word| "replacement" end

but this, for text containing "oooäuuu" will end up with "replacementäreplacement", i.e: German characters are not being treated as part of word.

+2  A: 

According to Pickaxe, the \w character class is exactly equivalent to [A-Za-z0-9_], which obviously won't include accented characters. Depending on your locale, you may find the POSIX class [:alpha:] to be what you want (I think you would use /[[:alpha:]']+/, but I may be wrong on the exact formatting of the regexp there).

Chris
Looks OK; you don't need to escape the apostrophe, though.
Tim Pietzcker
D'oh, of course I don't. Thanks for the sanity-check.
Chris
A: 

What you need is an English|German|... tokenizer ? Tokenization in natural language is not as simple as looking for whitespace. For example, if you want to tokenize this sentence : "Los Angeles is a beautiful city". Los Angeles should be considered as one word not two, if you want to find it in a dictionary.

Also you should deal with punctuation (.;?!:), abbreviations, separators, quotes, clitic contractions, etc...

Tokenization in languages like Chinese or Japanese is a lot harder.

There's a simple English tokenization perl script in "Speech and language processing" by Jurafsky and Martin in chapter 3.9.1.

anno
A: 

It looks that this works pretty well:

/[[:word:]]+/

That was just too easy ;)

Hubert Łępicki