views:

42

answers:

1

I have a fairly small corpus of structured records sitting in a database. Given a tiny fraction of the information contained in a single record, submitted via a web form (so structured in the same way as the table schema), (let us call it the test record) I need to quickly draw up a list of the records that are the most likely matches for the test record, as well as provide a confidence estimate of how closely the search terms match a record. The primary purpose of this search is to discover whether someone is attempting to input a record that is duplicate to one in the corpus. There is a reasonable chance that the test record will be a dupe, and a reasonable chance the test record will not be a dupe.

The records are about 12000 bytes wide and the total count of records is about 150,000. There are 110 columns in the table schema and 95% of searches will be on the top 5% most commonly searched columns.

The data is stuff like names, addresses, telephone numbers, and other industry specific numbers. In both the corpus and the test record it is entered by hand and is semistructured within an individual field. You might at first blush say "weight the columns by hand and match word tokens within them", but it's not so easy. I thought so too: if I get a telephone number I thought that would indicate a perfect match. The problem is that there isn't a single field in the form whose token frequency does not vary by orders of magnitude. A telephone number might appear 100 times in the corpus or 1 time in the corpus. The same goes for any other field. This makes weighting at the field level impractical. I need a more fine-grained approach to get decent matching.

My initial plan was to create a hash of hashes, top level being the fieldname. Then I would select all of the information from the corpus for a given field, attempt to clean up the data contained in it, and tokenize the sanitized data, hashing the tokens at the second level, with the tokens as keys and frequency as value.

I would use the frequency count as a weight: the higher the frequency of a token in the reference corpus, the less weight I attach to that token if it is found in the test record.

My first question is for the statisticians in the room: how would I use the frequency as a weight? Is there a precise mathematical relationship between n, the number of records, f(t), the frequency with which a token t appeared in the corpus, the probability o that a record is an original and not a duplicate, and the probability p that the test record is really a record x given the test and x contain the same t in the same field? How about the relationship for multiple token matches across multiple fields?

Since I sincerely doubt that there is, is there anything that gets me close but is better than a completely arbitrary hack full of magic factors?

Barring that, has anyone got a way to do this?

I'm especially keen on other suggestions that do not involve maintaining another table in the database, such as a token frequency lookup table :).

This is my first post on StackOverflow, thanks in advance for any replies you may see fit to give.

A: 

You can probably get some ideas from this different but similar SO question: calculating-context-sensitive-text-correlation.

More specific to the problem at hand, here are a few thoughts and ideas:

First off, acknowledging the very skewed usage (only 6 to 10 attributes cover 95% of the use), you can/should apply asymmetric effort on the attributes, i.e. invest more, both in term of programming time and in term of run-time CPU allotment, for dealing with these few attributes than for 100-odd additional attributes.

The relatively small amount of data supplied as input for matching possible duplicates in the database, the relatively small set of attributes typically used, and the apparently common semantics of these (phone number, address, name...) suggest a hand-crafted solution rather than one completely based on machine learning.

Note: a lot of the suggestions thereafter needn't be applied to all the attributes (since less than a dozen of these cover virtually all the usage, there is no point, at least at first to invest much with the other attributes.

  • Normalize the data
    If it is not allowed for you to alter the original field values maybe duplicate the corresponding columns to a "norm_xxx" coluumn where xxx is the original name.
    What, How to normalize may vary with each attribute; for "free text" like data, ensure that there are no leading nor trailing spaces, only one space between words, no tabs and non printable characters. Use either all uppercase or all lowercase (eventhought the original / for-display text may include a mix, your processing will go faster by being able to assume uniform casing). More specifically for addresses and/or company names, you can convert common terms to a standard form (ST for STREET, ST and ST, etc.) (Be sure to keep this list for it will be applied to the user search criteria as well). Part of the normalization may also be to drop altogether some noise words (as say CO, INC, GMBH at the end of company names)
  • Create a few calculated columns
    For example ones with the text, in reverse, for attributes which may be searched with a trailing wildcard
  • Consider using a Soundex-like conversion for some attributes.
  • FullText index, individually, all text-like column
  • Create plain (SQL) indexes on all the 6 to 10 much used columns

All the above, are mere off-line time preparations for actually performing matches. Now.. the user inputs his/her query... here are a few ideas on how to deal with it

  • Normalize the search criteria which warrant it
  • Run several searches...
    This is a bit tricky; there are several, partially conflicting, goals for performing these search. We want to reduce, significantly the number of "potential matches": it is effectively impractical to do a full one-on-one comparing of all 150,000 records with the user-supplied criteria; for example some of the matching logic may imply computing the edit distance between a field of a given record of the database and a search criteria. We also want to ensure we do not exclude records from the "potential matches" list because of a typo in say the company name... Finally we want to provide the list of potential matches in an ranked fashion.
    The way to perform these searches follows some pre-defined heuristics (I found the strategy design pattern work well for that, allowing flexibilty in the way the search are run, depending on the user-supplied input). In a nutshell we search for the most selective words in the most selective/relevant attributes, and based on the number of "hits" found we either "OR" (union) or "AND" with other search results, till we have a few hundred record.
  • Compute a similarity value between each attribute of the "potential matches" records and the corresponding search criteria. Possibly apply a coefficient to this value (allowing to put more weigh to say a company name [partial] match that to a City match)
  • Tally the overal similarity value for a complete record (vs. the complete search criteria)
  • Show the records exceeding a particular threshold of the similarity value to the end-user, for review

    Finally, and there comes a partially automated process, you can alter some of the parameters based on some feedback supplied by the end-user. (this is very tricky to do, I'll keep this for some other post ;-) )

mjv
Thanks for your thoughts and the links to more research on this. I might end up computing an edit distance, or not, I can't be sure. But I think I will score token matches as 1/f(t) * fieldweight and see how close that gets me before computing edit distance.
masonk