tags:

views:

870

answers:

4

how can I extract proper nouns / numeric values from a string using php or javascript? example theres a string like

Xyz visisted this page this page 53 mins ago.

I want to be able to recognize "Xyz" and "53" as proper noun and numeric respectively

A: 

The one obvious way is to have a dictionary of proper knowns and some good indexing to quickly search through that, if such a thing exists.

But I get the feeling you are looking for a way to grammatically infer that a word is a proper noun.

I can't think of any perfect way to do this, but if you created a series of rules, you could use these to parse a passage.

Rules might include. * Words that end with ly are not a proper noun * Noise words such as and, to , but etc. are not proper nouns * words that have capital letters but don't start a sentence are proper nouns

To improve it you could use these rules to create a dictionary of proper nouns. Every time a word follows one of these rules it either gets added to or deleted form the proper nouns dictionary.

This is very rough - if this is on the right track, then perhas I can be more specific.

Ankur
I was hoping to achieve this with regex or soemthingeg./([^.])(\s)+([A-Z]{1}[a-z]+)/But this regular expression dosent match two consequetive proper nouns...eg "name is Abb Bayer"....
Annibigi
A: 

If it's always one proper noun in the sentence then you could find it by looking for the word beginning with a capital letter. And if there is none except the first word then that it is. Problem arises if Xyz is named Bim de Verdier or if it's not actually capitalized.

// Get the number with JavaScript and RegExp
var regex = new RegExp("\d+");
var match = regex.exec("Xyz visisted this page this page 53 mins ago.");
if (match == null) {
  alert("No match");
} else {
  var s = "";
  for (i = 0; i < match.length; i++) {
    s = s + match[i] + "\n";
  }
  alert(s);
}

A capitalized word can be matched with "[A-Z][a-z]+[ ]".

Jonas Elfström
A: 

The PHP functions is_numeric and ucfirst may help recognize the words:

function parse_name_and_number($sentence) {
    $words = explode(' ', $sentence);
    $name = array();
    foreach ($words as $word) {
        if (is_numeric($word))
            $number = $word;
        elseif ($word == ucfirst($word))
            $name[] = $word;
    }
    $name = implode(' ', $name);
    return array('name' => $name, 'number' => $number);
}

print_r(parse_name_and_number('Xyz visited this page 53 minutes ago'));
// output:  Array ( [name] => Xyz [number] => 53 )

print_r(parse_name_and_number('we thought Bim de Verdier visited the page 5 seconds ago'));
// output:  Array ( [name] => Bim Verdier [number] => 5 )

print_r(parse_name_and_number('Weirder input messes up the results'));
// output:  Array ( [name] => Weirder [number] => )
A: 

Best option is to use link grammar. Parse the sentence and extract proper nouns.

www.link.cs.cmu.edu/link