views:

303

answers:

3

Hello,

I Begin with textmining. I have two database tables with thousands of data..

a table for "skills" and a table for "skills categories"

  • every "skill" belongs to a skills categorie.
  • a "skill" is , physicaly, a varchar(200) field in the database, where there is some text describing the skill.

Here are some skills extracted from the skills table:

"PHP (good level), Java (intermediaite), C++" "PHP5" "project management and quality management" "begining Javascript" "water engineering" "dfsdf zerze rzer" "cibling customers"

what i want to do is to extract knowledge from those fields, i mean extract only the real skill and ignore the rest of useless text. for the above example i want to get only an array with:

"PHP" "Java" "C++" "PHP5" "project management" "quality management" "Javascript" "water engineering" "cibling customers"

what should i do to extract the skills from tons of data please ? do you know specific algorithms to do this ? ex : k-means ... ?

Thanks in advance.

A: 

I would make use of Regex to parse each row of data, first of all splitting by comma(,) and then removing any text held within brackets, and spaces leading to those brackets. As for removing junk pharases, perhaps comparing to an accepted word list?

I also notice that the keyword 'AND' denotes two separate skills, going by your desired output. Results using this method of processing may be a bit sketchy due to the data not all neccesarily being in the same format.

Seidr
A: 

It would be very hard to start from scratch,

I'd parse some data for skill sets from somewhere and load them to a table and use that table as reference table, trying to match data from that table. Otherwise you have no way to determine whether the words or phrases are meaningful or not.

And for each phrase i'd use the following algorithm

Say you have a phrase of 5 words

 "one two three four five"

first i'd check whether this one exists in my table, if so keep it and go to the next one, if not, check

 "one two three four" and "two three four five"

and if they dont match either, check

  "one two three", "two three four", "three four five"

etc...

I know it is a bit messy and long way, but it is the first thing came in to my mind.

Hope it helps

marvin
A: 
<?php
$white_list = array(); // Add acceptable words and/or characters
$black_list = array(); // Add unacceptable words and/or characters

$s = '"PHP (good level), Java (intermediaite), C++" "PHP5" "project management and quality management" "begining Javascript" "water engineering" "dfsdf zerze rzer" "cibling customers"';

$words = explode(" ",$s);

$primary = array();
$secondary = array();
foreach($words as $word) {
    $new_word = trim(str_replace($black_list, "", $word));
    if (in_array($new_word,$white_list) == true) {
        $primary[] = $new_word;
    } else {
        $secondary[] = $new_word;
    }
}

$collected = '"' . implode('" "',$primary) . '"';

You could use something like this to build a table of white and black lists. In the long run you'll have better control over what is a positive and what is not.

Brant