views:

97

answers:

3

I'm looking for references on separating a name: "John A. Doe" in parts, first=John, middle=A., last=Doe. In Mexico we have paternal, maternal, first and second given names, and can be written in different permutations, so the problem is quite complex.

As it depends on data, we are working with matching software that calculates a score for every word so we can take decisions (it is based on a big database). The input data is not clean, it is imported from some government web pages and is human filtered so it could have junk that has to be recognized as well. Any suggestions?

[Edit] Examples:

name:
   Javier Abdul Córdoba Gándara
common permutations (or as it may appear in gvt data referring to same person):
   Córdoba Gándara Javier Abdul
   Javier A. Córdoba Gándara
   Javier Abdul Córdoba G.

paternal=Córdoba
maternal=Gándara
first given:Javier
second given:Abdul
name: María de la Luz Sánchez Martínez
paternal:Sánchez
maternal: Martínez
first given: María de la Luz
name: Paloma Viridiana Alin Arias Medina
paternal: Arias
maternal: Medina
first given: Paloma
second given: Viridiana Alin

As I said what the meaning of each word depends on the score. One has no way of knowing that

Viridiana
and
Alin
are given names if not from the score.

We have a very strong database (80 million records or so) so we can get some use of the scoring system. I am designing some algorithm that uses that but looking for other references.

A: 

Before I go into regular expressions to the nth degree, have a look at http://www.ultrapico.com/expresso.htm which is a great tool for doing this sort of thing.

What language are you looking at using, are you wanting to automate?

Have you some examples for us to start with?

RE

Reallyethical
A: 

You may need to add some natural language or machine learning to check. The problem of identifying author names (e.g. in scientific papers) is difficult as they can be reported with differing orders, degrees of abbreviation, elisions etc. If your database is dirty you will end with ambiguity whatever you do.

peter.murray.rust
+1  A: 

Unfortunately - and having done quite a bit of this work myself - your ideal algorithm will be very data specific, and you will need to work this out for your particular situation.

Of the total time and effort to develop this algorithm, I'd say the time will be split roughly as follows:

  1. 10% for general string manipulation
  2. 30% for the specific nature of the data (Mexican name formats, data input quirks)
  3. 60% to cater for data quality / lack of quality

And I believe that's quite generous towards the general string manipulation. Of course it depends whether you need quality results for all records, or only the 'clean' records etc, and if you are able to ignore the 'difficult' records it makes it a lot simpler.

Some general tips

  • If they are not required, remove non alphanumeric / whitespace characters
  • Split on spaces
  • Use hyphens / punctuation to identify surnames or family names
  • Initials (which are generally single letters) are not surnames; i.e. they must be first / middle
  • determine the level of confidence that you have programmatically identified the each name (and test this thoroughly). You may find there are subsets of data that contain similar patterns that need to be catered for individually (they may come from different sources etc)
Kirk Broadhurst