views:

898

answers:

7

I need to extract information from hundreds of résumés. The ideal would be .doc, .docx, .pdf, .rtf --> hr-xml but since more than 90% of the résumés are .doc, the other formats are not a must have.

I'm looking to buy a third-party tool or a component.

Do you have any good/bad experience solving a similar problem?

Clarification: I'm not looking to use MS Indexing Services or Lucene or any other search indexing engine. It's not that straightforward. The biggest challenge is that the layout/format of the résumés is not the same, so simple indexing won't do.

+1  A: 

Microsoft Indexing Service should be able to do what you want. There is a plug-in for PDF and it supports Word documents.

Jon
+1  A: 

You could try Lucene, which is an open source .NET indexing library.

sgwill
Thanks, but see my clarification above.
vitule
+1  A: 

You don't say exactly what info you are looking for, and that makes a big difference. If for example you wish to list email address and phone numbers for all candidates it should be straightforward case of dumping the text ("wv" should do what you for word docs) and constructing or googling for a suitable regex. Searching for specific strings (i.e. tomcat, .NET) is even easier.

If you need parse out candidate names, street addresses etc. then you're into a whole different world of pain. There's someone trying to do street addresses here:

In general anything other than the most standardised data formats are probably going to require hand verification and correcting.

Colin Pickard
Sad but true. We tried a couple of third party tools but with the success rate below 60%, we ended up hiring an intern for couple of weeks and let her be our parser...
vitule
+1  A: 

Surely Google must have automated this by now ;-)

Some sites claiming to solve this:

http://www.sovren.com/

http://www.resumate.com/

If you are screening new applicants for a job, why not offload the effort onto them by designing your own web form for them to fill out, then you will have all the applications in your own standard format.

Bork Blatt
I hate when companies do that. It duplicates effort and is very impersonal---it feels like filling out a job-app form at a grocery store. I've had companies want me to fill that kind of stuff out *after* the interview. What, did you not read my resume before the interview? How demeaning.
Adam Jaskiewicz
+1 for the two links, not for the web-based form idea...
Aardvark
Sure - this isn't pleasant for the applicant. But when there is 1 interviewer, and 1000 resumes, all in different formats, with different names for the same skills, it can be a bit overwhelming. If someone has found a way to automate the screening of resumes in any format, fantastic.
Bork Blatt
A: 

In college I had a part-time job reading resumes and entering that data into a database. So the solution for that company was to hire students majoring the subjects relating to the positions they were filling. This was important so the reader would have some understanding of the terms used in the resumes.

For example, Computer Science students for programming jobs would understand that C != C++ != C# BUT C# should indicate OO experience.

So that's my non-programming suggestion!

My technical idea would be to write some Microsoft Word VBA macro code to speed-up data entry. A human would still be involved, highlighting sections of the resume and pressing toolbar buttons that invoke VBA code. For example, have an parse_address macro. It could try to parse the selected text as an address, present a preview of the result in a VBA form, manually tweak as needed, and then press a button on the form to add this data to a database (or whatever...).

Aardvark
+2  A: 

I've used both sovren and daxtra to do this as part of my current project and have found sovren to be both easier to use and produce better results. It comes as a .net library and is a real no brainer to develop against.

Andrew Hancox
A: 

I'm using http://www.alt-soft.com Xml2PDF Server for similar purpose