views:

342

answers:

4

Hello,

I have a large database of resumes (CV), and a certain table skills grouping all users skills.

inside that table there's a field skill_text that describes the skill in full text.

I'm looking for an algorithm/software/method to extract significant terms/phrases from that table in order to build a new table with standarized skills..

Here are some examples skills extracted from the DB :

  • Sectoral and competitive analysis
  • Business Development (incl. in international settings)
  • Specific structure and road design software - Microstation, Macao, AutoCAD (basic knowledge)
  • Creative work (Photoshop, In-Design, Illustrator)
  • checking and reporting back on campaign progress
  • organising and attending events and exhibitions
  • Development : Aptana Studio, PHP, HTML, CSS, JavaScript, SQL, AJAX
  • Discipline: One to one marketing, E-marketing (SEO & SEA, display, emailing, affiliate program) Mix marketing, Viral Marketing, Social network marketing.

The output shoud be something like :

  • Sectoral and competitive analysis
  • Business Development
  • Specific structure and road design software -
  • Macao
  • AutoCAD
  • Photoshop
  • In-Design
  • Illustrator
  • organising events
  • Development
  • Aptana Studio
  • PHP
  • HTML
  • CSS
  • JavaScript
  • SQL
  • AJAX
  • Mix marketing
  • Viral Marketing
  • Social network marketing
  • emailing
  • SEO
  • One to one marketing

As you see only skills remains no other representation text.

I know this is possible using text mining technics but how to do it ? the database is realy large.. it's a good thing because we can calculate text frequency and decide if it's a real skill or just meaningless text... The big problem is .. how to determin that "blablabla" is a skill ?

Edit : please don't tell me to use standard things like a text tokinzer, or regex .. because users input skills in a very arbitrary way !!

thanks

A: 

Many databases will do this for you via their full-text search functionality. I know that PostgreSQL's full-text search would be able to do this easily with the aid of a custom dictionary.

Alternatively, you can use PHP's strtok or equivalent to index your text. Once indexed you can compare to dictionary, or simply use occurrences to create a sheet for yourself. Word clouds are made in a similar fashion.

asnyder
Thanks for your fast reply, the problem is that i have no dictionary to compare to.
youssef azari
There are default dictionary, for example PostgreSQL has one that would eliminate things like "the", "a", etc. Otherwise you're just going by parsed words and their occurrence. You need to create a table of skills to accurately parse the, out. You can create this table by doing the counts as mentioned above and cut out what you feel are the skills you're looking for.
asnyder
+3  A: 

If I was doing this programmatically I would:

Extract all punctuation delimited data (or perhaps just brackets and commas) into a new table (with no primary key, just skill) so Creative work (Photoshop, In-Design, Illustrator) becomes

 Skill            
 -------------
 Creative work    
 Photoshop        
 In-Design        
 Illustrator      

Then, after you've proceed all CVs, query for the most common skills (this is MySQL)

SELECT skill, COUNT(1) cnt FROM newTable GROUP BY skill ORDER BY cnt DESC;

Which may look like this contrived example

 Skill            Cnt
 ---------------------
 Photoshop        3293
 Illustrator      2134
 Creative work     932
 In-Design         123

Then you decide, from the top X skills, which you want to capture, which must map to other skills (Indesign and In-design should map to the same skill, for example) and which to discard, then script the process using a data map.

Use the data map to write a new word frequency table (this time skill_id, skill, frequency) and the second time when parsing the data also write to a lookup table (cv_id,skill_id). Your data will then be in a state where each CV is mapped to a number of skills, and each skill to a number of CVs. You can query for the most popular skills, CVs matching certain criteria etc.

Andy
Yes programmatically this is the best practice..but here are some data making it impossible to use your algorithm :"Sensitive to cultural differences and used to working in international environments""Spanish : Reading and conversation""Coordination and distribution of marketing tools per countries""Guidance of sale forces in the sale strategy and the elaboration of the product sales argument"thos are skills extracted from DB.
youssef azari
Map those sentences to "International/Cultural experience", "Spanish (Reading)", "Spanish (Spoken)", "International/Cultural marketing", "Sales experience (Non-marketing)" etc. Your heterogeneous data makes it difficult to do this any other way, unless you can find a "CV skillset map" or some other fantastic document. This is a labour intensive task - once you've defined your basic set of skills (as the above examples) you could farm the map creating work out to MechanicalTurk or similar.
Andy
A: 

For mining text data you can use Automation Anywhere which is easy because of point and click extraction of data. But as your case is to determine which skill, and then extract only the skill part. I don't know how easy it is but I think it can be done because I had little bit similar problem which this software's automation expert made a task for me which I had to run and desired result was in front of me. I would suggest you download and try this data extraction software plus ask for consulting services to help on this, they will make it for u.

Hope it helps you.

Bob
A: 

Doing this well requires knowledge; otherwise what's to tell "organising events" is a 'skill' while "creative work" isn't? But a stupid program can take a first cut at it by analyzing statistics of collocations: see the answers to How to extract common / significant phrases from a series of text entries and Algorithms to detect phrases and keywords from text.

Darius Bacon