ansaurus

Question

Matching words in UTF-8 encoded string with Ruby 1.9.1

Answer 1

+2 A:

According to Pickaxe, the \w character class is exactly equivalent to [A-Za-z0-9_], which obviously won't include accented characters. Depending on your locale, you may find the POSIX class [:alpha:] to be what you want (I think you would use /[[:alpha:]']+/, but I may be wrong on the exact formatting of the regexp there).

Chris 2010-01-12 12:06:15

Looks OK; you don't need to escape the apostrophe, though.

Tim Pietzcker 2010-01-12 13:56:04

D'oh, of course I don't. Thanks for the sanity-check.

Chris 2010-01-12 13:58:52

Answer 2

A:

What you need is an English|German|... tokenizer ? Tokenization in natural language is not as simple as looking for whitespace. For example, if you want to tokenize this sentence : "Los Angeles is a beautiful city". Los Angeles should be considered as one word not two, if you want to find it in a dictionary.

Also you should deal with punctuation (.;?!:), abbreviations, separators, quotes, clitic contractions, etc...

Tokenization in languages like Chinese or Japanese is a lot harder.

There's a simple English tokenization perl script in "Speech and language processing" by Jurafsky and Martin in chapter 3.9.1.

anno 2010-01-12 13:43:57

Answer 3

A:

It looks that this works pretty well:

/[[:word:]]+/

That was just too easy ;)

Hubert Łępicki 2010-01-12 22:58:56

ansaurus

tags:

views:

answers:

Matching words in UTF-8 encoded string with Ruby 1.9.1

related questions