ansaurus

Question

Regular expression to strip everything but words

Answer 1

A:

try \b\w*\b to match whole words

ennuikiller 2010-08-21 19:48:46

Answer 2

+1 A:

I think that what fits you best would be splitting of the string into words. In this case, String::split function would be the better option. It accepts a regexp that matches substrings, which should split the source string into array elements.

In your case, it should be "some non-alphabetic characters". Alphabetic character class is denoted by [:alpha:]. So, here's the example of what you need:

irb(main):001:0> "asd, < er >w , we., wZr,fq.".split(/[^[:alpha:]]+/)
=> ["asd", "er", "w", "we", "wZr", "fq"]

You may further filter the result by intersecting the resultant array with array that contains only English words:

irb(main):001:0> ["asd", "er", "w", "we", "wZr", "fq"] & ["we","you","me"]
=> ["we"]

Pavel Shved 2010-08-21 19:53:01

wow, this looks cool. working on it now

Sam 2010-08-21 19:54:24

ok that worked quite will but I'm getting a done of empty strings in the array

Sam 2010-08-21 19:56:00

@Sam, perhaps, you could find helpful information in `split` documentation? It should contain tips about situations, in which empty strings appear.

Pavel Shved 2010-08-21 19:59:21

Thanks Pavel, I will

Sam 2010-08-21 19:59:41

if you want to retrieve words of other languages (with characters like 'ä', 'ß', 'ǹ' etc), use this regex in your split call: /[^\p{Alpha}]+/

Ragmaanir 2010-08-21 22:13:44

真的好! I will see if that works for Chinese too!

Sam 2010-08-22 02:49:51

ansaurus

tags:

views:

answers:

Regular expression to strip everything but words

related questions