views:

100

answers:

1

For an enterprise application research project me and another person are working on, we are looking to remove certain content from the page to keep the posted messages universal(meaning not offensive and essentially anonymous). Right now we want to take a message that a user has posted to a message board, and remove any type of name, name of a college or institution,and profanity(and if later possible we would like to remove business names).

Is there some database that we can connect to that we can run scrub our messages with to check against values in the database in order to recognize these?

+4  A: 

The question seems to imply an online database which would be queried during the processing of messages. Operational issues (reliability of such services, lag in response time etc.) as well as completeness issue (need to query multiple databases because no single one will cover 100% of the project's lexical needs) render this online/real-time approach impractical. There are however many databases available for download and which would allow you to build your own local database of "hot words".

A good place to start could be WordNet, were you'd likely use all of the "instance" words as words that should typically need to be removed from messages, as you anonymize/cleanse them. (Maybe you'll also want to keep the "non instance" words in a separate table/list of words "more likely to be ok"). This list alone could likely support honorably well a "0.9" version of your application.

You'll eventually want to extend this lexical database of "bad words" however, for example to include all universities acronyms (CMU, UCSD, DU, MIT, UNC and such), Sports Teams names (Celtics, Bruins, Bruins, Red Sox...) and depending on the domain of your messages, additional names of public figures (Wordnet has several, such George Bush or Robert De Niro, but it lacks less famous people or people that came of fame more recently: eg Barack Obama)

To complement Wordnet, two distinct types of sources come to mind:

  • traditional online databases
  • ontologies and folksonomies

Examples of the former are say "Cities/State by ZIP code" at the USPS. Examples of the latter are various "lists" compiled by scholars, organizations or various individuals. It is impossible to provide an exhaustive list of either of these source types, but the following should help:

  • DAML.ORG Catalog of ontologies
  • US Regions and States example of an ontology DAML format
  • Open Directory project Open Source directory (attention, gets quickly messy)
  • SourceWatch.org example of a "list of lists : folks in journalism/politics"
  • Seach Engine keywords: "List Of Lists", or also use three or four of the words you'd expect to find in the list you seek.

In simpler cases, one can merely download lists and such, or also, "cut-and-paste". The ontologies will be "encumbered" with additional attributes that you'll need to parse out (in the future you may actually desire these attributes and use the ontologies in a more traditional fashion, for now, grabbing the lexical entities is all that is needed).

This lexical database compilation task may seem daunting. But the 80-20 rule, states that 20% of the "hot words" will account for 80% of the citations in the messages, and therefore with a relatively small effort, you should be able to produce a system that covers 90%+ of your use cases.

Looking ahead: Beyond the "hot words" database
There are many ways of approaching this task, using various techniques and concepts from Natural Language Processing (NLP). As your project gains in sophistication, you may want to learn about some of these concepts and possibly implement them. For example a simple POS tagger comes to mind, as it may help [in part] discriminating between say various usage of the token "SCREW" as your application discards offensive words. ("The board of directors wants to screw the students" vs. "The board should be fastened with a minimum of 4 screws per yard".

Before even needing these formal NLP techniques, you may use a few pattern-based rules to handle common cases associated with the domain(s) relative to the type of messages the project targets. For example, you may consider the following:
- (word) State University
- Senator (Word_Starting_with_Capital letter)
- Words that mix letters and numbers (these are often used to misspell names and circumvent the type of filters your projects wishes to implement)

Another tool that may be useful, in particular in the beginning will be a system that collects statistical info about the message corpus: word frequency, most common words, most common bigrams (two consecutive words) etc.

mjv
mjv-Wow thank you this is more then enough to get me going on the right track, I will try some of these tools out!
CitadelCSAlum