All the solutions have problems with catching the bad words if the case does not match. The regex solution is easiest to fix by adding the ignore-case flag:
badex = /\b(#{bad.split.join('|')})\b/i
In addition, using "String".include?(" String ") will lead to boundary problems with the first and last words in the string or strings where the target words have punctuation or are hyphenated. Testing for those situations will result in a lot of other code being needed. Because of that I think the regex solution is the best one. It is not the fastest but it is going to be more flexible right out of the box, and, if the other algorithms are tweaked to handle case folding and compound-words the regex solution might pull ahead.
#!/usr/bin/ruby
require 'benchmark'
bad = 'foo bar baz comparison'
badex = /\b(#{bad.split.join('|')})\b/i
str = "What's the fasted way to check if any word from the bad string is within my comparison string, and what's the fastest way to remove said word if it's found?" * 10
n = 10_000
Benchmark.bm(20) do |x|
x.report('regex:') do
n.times { str.gsub(badex,'').gsub(' ',' ') }
end
x.report('regex with squeeze:') do
n.times{ str.gsub(badex,'').squeeze(' ') }
end
x.report('array subtraction') do
n.times { (str.split(' ') - bad.split(' ')).join(' ') }
end
end
I made the str variable a lot longer, to make the routines work a bit harder.
user system total real
regex: 0.740000 0.010000 0.750000 ( 0.752846)
regex with squeeze: 0.570000 0.000000 0.570000 ( 0.581304)
array subtraction 1.430000 0.010000 1.440000 ( 1.449578)
Doh!, I'm too used to how other languages handle their benchmarks. Now I got it working and looking better!
Just a little comment about what it looks like the OP is trying to do: Black-listed word removal is easy to fool, and a pain to keep maintained. L33t-sp34k makes it trivial to sneek words through. Depending on the application, people will consider it a game to find ways to push offensive words past the filtering. The best solution I found when I was asked to work on this, was to create a generator that would create all the variations on a word and dump them into a database where some process could check as soon as possible, rather than in real time. A million small strings being checked can take a while if you are searching through a long list of offensive words; I'm sure we could come up with quite a list of things that someone would find offensive, but that's an exercise for a different day.
I haven't seen anything similar in Ruby to Perl's regex-assembler (http://search.cpan.org/dist/Regexp-Assemble/Assemble.pm), but that was a good way to go after this sort of problem. You can pass an array of words, plus options for case-folding and word-boundaries, and it will spit out a regex pattern that will match all the words, with their commonalities considered to result in the smallest pattern that will match all words in the list. The problem after that is locating which word in the original string matched the hits found by the pattern, so they can be removed. Differences in word case and hits within compound-words makes that replacement more interesting.
And we won't even go into words that are benign or offensive depending on the context.