views:

205

answers:

12

There is a list of banned words ( or strings to be more general) and another list with let's say users mails. I would like to excise all banned words from all mails.

trivial example:

foreach(string word in wordsList)
{
   foreach(string mail in mailList)
   {
      mail.Replace(word,String.Empty);
   }
}

how I can improve this algorithm?

Edit:

Thanks for advices. I voted few answers up but i didnt mark any as answer since it was more like discusion than solution. BTW. Some ppl missed banned words with bad words. In my case I dont have to bother about recognize 'sh1t' or something like that ;-)

+5  A: 

Simple approaches to profanity filtering won't work - complex approaches don't work, for the most part, either.

What happens when you get a work like 'password' and you want to filter out 'ass'? What happens when some clever person writes 'a$$' instead - the intent is still clear, right?

See http://stackoverflow.com/questions/273516/how-do-you-implement-a-good-profanity-filter for extensive discussion.

Steve Townsend
"What happens when you get a work like 'password' and you want to filter out 'ass'?" - Then your algorithm sucks.
Brian R. Bondy
"What happens when some clever person writes 'a$$' instead - the intent is still clear, right?" - Very often reducing a problem has value, a 100% fix to a problem is not always needed.
Brian R. Bondy
@Brian - agreed, I am reading between the lines here. If OP just wants to build 'best effort' code, then tweaks to string replacement are fine. If he/she's signed up to build a reliable profanity filter, then scope of effort needs to be clear, or he/she could be in trouble when it takes a while longer than expected.
Steve Townsend
+2  A: 

You'll get best performance by drawing up a finite state machine (FSM) (or generate one) and then parsing your input 1 character at a time and walking through the states.

You can do this pretty easily with a function that takes your next input char and your current state and that returns the next state, you also create output as you walk through the mail message's characters. You draw the FSM on a paper.

Alternatively you could look into the Windows Workflow Foundation: State Machine Workflows.

In that way you only need to walk each message a single time.

Brian R. Bondy
Unless I'm misunderstanding your suggestion, I feel like using a Windows Workflow State Machine on this problem to parse a string character by character is a bit overkill.
Michael Petito
That depends on what the software is. If the person is trying to build a profanity filtering software, then I wouldn't think so.
Brian R. Bondy
+2  A: 

You could use RegEx to make things a little cleaner:

var bannedWords = @"\b(this|is|the|list|of|banned|words)\b";

foreach(mail in mailList)
    var clean = Regex.Replace(mail, bannedWords, "", RegexOptions.IgnoreCase);

Even that, though, is far from perfect since people will always figure out a way around any type of filter.

Justin Niessner
This isn't removing banned words, it's removing banned substrings. For example, this would change the word "often" in a string to "ten".
Michael Petito
@Michael - Obviously my RegEx-Fu isn't up to snuff. I added what I thought was the right way to limit word boundaries. Any corrections?
Justin Niessner
That looks better, thanks. Though I'll mention again (as below) that it probably isn't ideal to make a Regex like this if your list is more than a few tens of words.
Michael Petito
A: 

You might consider using Regex instead of simple string matches, to avoid replacing partial content within words. A Regex would allow you to assure you are only getting full words that match. You could use a pattern like this:

"\bBADWORD\b"

Also, you may want to iterate over the mailList on the outside, and the word list on the inner loop.

Andrew Barber
+1  A: 

Constructing a regular expression from the words (word1|word2|word3|...) and using this instead of the outer loop might be faster, since then, every e-mail only needs to be parsed once. In addition, using regular expressions would enable you to remove only "complete words" by using the word boundary markers (\b(word1|word2|word3|...)\b).

In general, I don't think you will find a solution which is orders of magnitude faster than your current one: You will have to loop through all mails and you will have to search for all the words, there's no easy way around that.

Heinzi
+1  A: 

A general algorithm would be to:

  1. Generate a list of tokens based on the input string (ie. by treating whitespace as token separators)
  2. Compare each token against a list of banned words
  3. Replace matched tokens

A regular expression is convenient for identifying tokens, and a HashSet would provide quick lookups for your list of banned words. There is an overloaded Replace method on the Regex class that takes a function, where you could control the replace behavior based on your lookup.

HashSet<string> BannedWords = new HashSet<string>(StringComparer.InvariantCultureIgnoreCase)
{
    "bad",
};

string Input = "this is some bad text.";

string Output = Regex.Replace(Input, @"\b\w+\b", (Match m) => BannedWords.Contains(m.Value) ? new string('x', m.Value.Length) : m.Value);
Michael Petito
This doesn't make use of the power of Regex though. It just abstracts the replace loop away. See [Justin's answer](http://stackoverflow.com/questions/3864678/how-to-cut-specified-words-from-string/3864743#3864743) for what I mean.
Ahmad Mageed
@Ahmad Mageed: I'm using a simple (and fast) regular expression to produce a stream of tokens from a string -- what more power do I need? I also don't think it's ideal (or performant) to take hundreds of banned words and build a large regular expression as in Justin's solution.
Michael Petito
A: 

Wouldn't it be easier (and more efficient) to simply redact them by changing all their characters to * or something? That way no large string needs to be resized or moved around, and the recipents are made more aware what happened, rather than getting nonsensical sentences with missing words.

T.E.D.
Why would this be more efficient?
Heinzi
@Heinzi - Edited to include that info. Basicly, Replace will have to move the data after the replaced string around, unless what you are replacing it with is the exact same number of characters.
T.E.D.
@T.E.D.: `Replace` will create a completely new String instance anyway, since Strings are immutable. I do agree with your usability point, though!
Heinzi
@Heinzi - Well, OK. I'm not a C# expert, but you'd probably want to use something *mutable* here and just change the one word in question so you don't have to copy or move a ton of data around.
T.E.D.
+1  A: 

Replacing it with * is annoying, but less annoying than something that removes the context of your intention by removing the word and leaving a malformed sentence. In discussing the Battle of Hastings, I'd be irritated if I saw William given the title "Grand ******* of Normandy", but at least I'd know I was playing in the small-kids playground, while his having the title of "Grand of Normandy" just looks like a mistake, or (worse) I might think that was actually his title.

Don't try replacing words with more innocuous words unless its funny. People get the joke on 4chan, but yahoo groups about history had confused people because the medireview and mediareview periods were being discussed when eval (not profanity, but is used in some XSS attacks that yahoo had been hit by) was replaced with review in medieval and mediaeval (apparantly, medireview is the American spelling of mediareview!).

Jon Hanna
This is pretty much the same as my answer, and was submitted at roughly the same time. Whenever that happens, my general policy is that the submitter is clearly a genius, and deserves a +1. :-)
T.E.D.
A: 

Well, you certainly don' want to make the clbuttic mistake of naive string.Replace() to do it. The regex solution could work, although you'd either be iterating or using the pipe alternator (and I don't know if/how much that would slow your operation down, particularly for a large list of banned words). You could always just...not do it, since it's entirely futile no matter what--there are ways to make your intended words quite clear even without using the exact letters.

That, and it's ridiculous to have a list of words that "people find offensive" in the first place. There's someone who will be offended by pretty much any word

/censorship is bullshit rant

+1  A: 

In some circumstance is possible to improve it: Just for fun:

u can use SortedList, if ur mailing list is mailing list (because u have a delimiter like ";") u can do as bellow:

first calculate ur running time algorithm: Words: n item. (each item has an O(1) length). mailing list: K item. each item in mailing list average length of Z. each sub item in mailing list item average length of Y so the average number of subitems in mailing list items is m = Z/Y.

ur algorithm takes O(n*K*Z). // the best way with knut algorithm

1.now if u sort the words list in O(n log n).

2.1- use mailingListItem.Split(";".ToCharArray()) for each mailing list item: O(Z). 2.2- sort the items in mailing list: O(m * log m) total sorting takes O(K * Z) in worth case with respect to (m logm << Z).

3- use merge algorithm to merge items of bad word and specific mailing list: O((m + n) * k)

total time is O((m+n)*K + m*Z + n^2) with respect to m << n, total algorithm running time is O(n^2 + Z*K) in worth case, which is smaller than O(n*K*Z) if n < K * Z ( i think so).

So if performance is very very very important, u can do this.

SaeedAlg
A: 

I assume that you want to detect only complete words (separated by non-letter characters) and ignore words with a filter-word substring (like a p[ass]word example). In that case you should build yourself a HashSet of filter-words, scan the text for words, and for each word check its existence in HashSet. If it's a filter word then build resulting StringBuilder object without it (or with an equal number of asterisks).

Dialecticus
A: 

I had great results using this algorithm on codeproject.com better than brute force text replacments.

kenny