tags:

views:

224

answers:

6

I'm not sure how to use regular expressions in a function so that I could grab all the words in a sentence starting with a particular letter. I know that I can do:

word =~ /^#{letter}/ 

to check if the word starts with the letter, but how do I go from word to word. Do I need to convert the string to an array and then iterate through each word or is there a faster way using regex? I'm using ruby so that would look like:

matching_words = Array.new
sentance.split(" ").each do |word|
  matching_words.push(word) if word =~ /^#{letter}/ 
end
A: 
/\ba[a-z]*\b/i

will match any word starting with 'a'.

The \b indicates a word boundary - we want to only match starting from the beginning of a word, after all.

Then there's the character we want our word to start with.

Then we have as many as possible letter characters, followed by another word boundary.

Anon.
A: 

To match all words starting with t, use:

\bt\w+

That will match test but not footest; \b means "word boundary".

Rubens Farias
This depends on whether you class something like "t_001" as a 'word' or not.
Anon.
yeah, I was writing same to you =) OP must be more precise
Rubens Farias
A: 

You can use \b. It matches word boundaries--the invisible spot just before and after a word. (You can't see them, but oh they're there!) Here's the regex:

/\b(a\w*)\b/

The \w matches a word character, like letters and digits and stuff like that.

You can see me testing it here: http://rubular.com/regexes/13347

yjerem
interesting site...
Rubens Farias
+1  A: 

Similar to Anon.'s answer:

/\b(a\w*)/g

and then see all the results with (usually) $n, where n is the n-th hit. Many libraries will return /g results as arrays on the $n-th set of parenthesis, so in this case $1 would return an array of all the matching words. You'll want to double-check with whatever library you're using to figure out how it returns matches like this, there's a lot of variation on global search returns, sadly.

As to the \w vs [a-zA-Z], you can sometimes get faster execution by using the built-in definitions of things like that, as it can easily have an optimized path for the preset character classes.

The /g at the end makes it a "global" search, so it'll find more than one. It's still restricted by line in some languages / libraries, though, so if you wish to check an entire file you'll sometimes need /gm, to make it multi-line

If you want to remove results, like your title (but not question) suggests, try:

    /\ba\w*//g

which does a search-and-replace in most languages (/<search>/<replacement>/). Sometimes you need a "s" at the front. Depends on the language / library. In Ruby's case, use:

string.gsub(/(\b)a\w*(\b)/, "\\1\\2")

to retain the non-word characters, and optionally put any replacement text between \1 and \2. gsub for global, sub for the first result.

Groxx
+1  A: 

Scan may be a good tool for this:

#!/usr/bin/ruby1.8

s = "I think Paris in the spring is a beautiful place"
p s.scan(/\b[it][[:alpha:]]*/i)
# => ["I", "think", "in", "the", "is"]
  • \b means 'word boundary."
  • [:alpha:] means upper or lowercase alpha (a-z).
Wayne Conrad
I think I'd prefer `s.scan(/\b[it]\w*\b/)`, but that's a minor difference. However, shouldn't the output array be `["I", "think", "in", "the", "is"]`?
kejadlen
@kejadlen, Thanks. I had changed the code but then forgot to past the new output into it. I think it may be better your way, too.
Wayne Conrad
A: 

Personally i think that regex is overkill for this application, simply running a select is more than capable of solving this particular problem.

"this is a test".split(' ').select{ |word| word[0,1] == 't' } 

result => ["this", "test"]

or if you are determined to use regex then go with grep

"this is a test".split(' ').grep(/^t/)

result => ["this", "test"]

Hope this helps.

roja