views:

105

answers:

4

I have an ASP.NET app that accepts users comments and them in a SQL database. I want to make sure that I weed out any "naughty" words so I can keep my app respectable. Problem is that I'm finding there are LOTS of these words. ;>

My question is, what's the most efficient way to do this processing? Should I have a table in SQL and write a stored proc that does the work? Should I do it with c# and Regex in memory on the web server? Are there other options? Has anyone else successfully done this kind of text scanning at scale? If y, what worked?

+7  A: 

It's a futile task. If people want to swear then they will start typing things like f uck and sh*t.

There's no substitute for effective moderation. Anything else is likely to leave you with clbuttic errors on your page

I remember a quote from somewhere about technical solutions to social problems, but I can't source it right now

Gareth
You make a great point Gareth. I'm so naive to assume that people would just enter nice regex friendly naughty words... Thanks for the link to the site as well, made me chuckle.
will
A: 

There are already some Perl modules out there to do all of that for you.

http://search.cpan.org/~abigail/Regexp-Common-2010010201/lib/Regexp/Common/profanity.pm http://search.cpan.org/~tbone/Regexp-Common-profanity_us-2.2/lib/Regexp/Profanity/US.pm http://search.cpan.org/~miyagawa/Plagger-0.7.17/lib/Plagger/Plugin/Filter/Profanity.pm

Eamorr
A: 

There are some things to consider here:

  • Do you want to be able to add or remove words from that black list later? If so it might make sense to do this only before showing the message, but store the original message.
  • Do you want to have a copy of the message later on (e.g. for legal reasons or customer support)? Then it also makes sense to keep the message unchanged in the database.

So I would keep the message in the database and parse it only before rendering it. To me it looks like the most efficient way to do that would be either to:

  1. Keep the blacklist in an indexed column (lowercase) in the database and return the comments through a stored procedure which filters it
  2. Keep the blacklist lowercase in some data structure that allows for efficient access (e.g. Dictionary) in memory on the middle layer.

In both cases you would simply run through each comment and filter it. The latter method is more easier implemented but means that you would have to keep a list in memory, which stops to make sense when you have a very large blacklist.

(I actually see no point in using regex.)

steinar
Then again, I also agree with Gareth of just ignoring this aspect and go for moderation.
steinar
+1  A: 

Scunthorpe Problem

One should be embar***ed to try to solve this in code.

Loren Pechtel