views:

74

answers:

3

To make matter more specific:

  1. How to detect people names (seems like simple case of named entity extraction?)
  2. How to detect addresses: my best guess - find postcode (regexes); country and town names and take some text around them.
  3. As for phones, emails - they could be probably caught by various regexes + preprocessing
  4. Don't care about education/working experience at this point

Reasoning: In order to build a fulltext index on resumes all vulnerable information should be stripped out from them.

P.S. any 3rd party APIs/services won't do as a solution.

A: 

I feel it can't be done by a machine.

Every other resume will have a different format and layout. The best you can do is to design an internal format and manually copy every resume content in there. Or ask candidates to fill out your form (not many will bother).

Developer Art
http://adlab.msn.com/vnext/People-Name-Detection/could be an example of a name detector - but in my case i need algorithm, not service. Or at least reference to some research material. Syntax analysis is too wide subject for me to investigate
bushed
You haven't applied to any company that uses BrassRing for their HR/Recruiting. They do this, and its rather nice.
monksy
+2  A: 

The problem you're interested in is information extraction from semi structured sources. http://en.wikipedia.org/wiki/Information_extraction

I think you should download a couple of research papers in this area to get a sense of what can be done and what can't.

carlosdc
A: 

I think that the problem should be broken up into two search domains:

  1. Finding information relating to proper names
  2. Finding information that is formulaic

Firstly the information relating to proper names could probably be best found by searching for items that are either grammatically important or significant. I.e. English capitalizes only the first word of the sentence and proper nouns. For the gramatical rules you could look for all of the words that have the first letter of the word capitalized and check it against a database that contains the word and the type [i.e. Bob - Name, Elon - Place, England - Place].

Secondly: Information that is formulaic. This is more about the email addresses, phone numbers, and physical addresses. All of these have a specific formats that don't change. Use a regex and use an algorithm to detect the quality of the matches.

Watch out: The grammatical rules change based on language. German capitalizes EVERY noun. It might be best to detect the language of the document prior to applying your rules. Also, another issue with this [and my resume sometimes] is how it is designed. If the resume was designed with something other than a text editor [designer tools] the text may not line up, or be in a bitmap format.

TL;DR Version: NLP techniques can help you a lot.

monksy