tags:

views:

189

answers:

6

This might be a hard one (if not impossible), but can anyone think of a regular expression that will find a person's name, in say, a resume? I know this won't be 100% accurate, but I can't come up with something.

Let's assume the name only shows up once in the document.

+2  A: 

Forget it - seriously.

Or expect to get a lot of applications from a Mr C Vitae

Martin Beckett
+2  A: 

No, you can't use regular expressions for this. The only chance you have is if the document is always in the same format and you can find the name based on the context surrounding it. But this probably isn't the case for you.

If you are asking your applicants to submit their résumé online you could provide a separate field for them to enter their name and any other information you need instead of trying to automatically parse résumés.

Mark Byers
+1  A: 

Unless you wanted to build an expression that contained every possible name, or-ed together, the expression you are referring to is not "Regular," with a capital R. A good guess might be to go looking for the largest-font words in the document. If they follow a pattern that looks like firstname-lastname, name-initial-name, etc., you could call it a good guess...

+2  A: 

In my experience, having written something very similar (but a very long time ago), about 95% of resumes have the person's name as the very first line. You could probably have a pretty loose regex checking for alpha, hyphens, periods, and assume that's the name.

Obviously there's no way to do this 100% accurately, as you said, but this would be close.

Dan Breen
You could use formatting clues: whatever is the biggest element is probably the person's name. I believe Google does this on some level with documents: stuff in html / head / title and h1 fields gets weighted more heavily.
Jared Updike
+1  A: 

That's a really hairy problem to tackle. The regex has to match two words that could be someone's name. The problem with that is that some people, of Hispanic origin, for example, might have a name that's more than 2 words. Also, how would you define two words to match for a name? Would you use a database of common first and last name fields? That might work unless someone has an uncommon name.

I'm reminded of a story of a COBOL teacher in college told me about an individual of Asian origin who's name would break every rule the programmers defined for a bank's internal system. His first name was "O." just the letter O.

The only remotely dependable way to nail down the regex would be if you had something to set off your search with; maybe if a line of text in the resume began with "Name: " then you'd know where to start looking.

tl;dr: People's names and individual resumes are too heavily varied for a regular expression to pick apart.

Micah
We had a worse one, some Indonesian women don't have a surname until they are married. So student admissions system + no surname - choas
Martin Beckett
Oh man that would be brutal to try to fix.
Micah
A: 

You could do something like Amazon does for book overviews: SIPs. This would require some after-the-fact double checking by humans but you might find the person's name(s) in there.

Jared Updike