how can I extract proper nouns / numeric values from a string using php or javascript? example theres a string like
Xyz visisted this page this page 53 mins ago.
I want to be able to recognize "Xyz" and "53" as proper noun and numeric respectively
how can I extract proper nouns / numeric values from a string using php or javascript? example theres a string like
Xyz visisted this page this page 53 mins ago.
I want to be able to recognize "Xyz" and "53" as proper noun and numeric respectively
The one obvious way is to have a dictionary of proper knowns and some good indexing to quickly search through that, if such a thing exists.
But I get the feeling you are looking for a way to grammatically infer that a word is a proper noun.
I can't think of any perfect way to do this, but if you created a series of rules, you could use these to parse a passage.
Rules might include. * Words that end with ly are not a proper noun * Noise words such as and, to , but etc. are not proper nouns * words that have capital letters but don't start a sentence are proper nouns
To improve it you could use these rules to create a dictionary of proper nouns. Every time a word follows one of these rules it either gets added to or deleted form the proper nouns dictionary.
This is very rough - if this is on the right track, then perhas I can be more specific.
If it's always one proper noun in the sentence then you could find it by looking for the word beginning with a capital letter. And if there is none except the first word then that it is. Problem arises if Xyz is named Bim de Verdier or if it's not actually capitalized.
// Get the number with JavaScript and RegExp
var regex = new RegExp("\d+");
var match = regex.exec("Xyz visisted this page this page 53 mins ago.");
if (match == null) {
alert("No match");
} else {
var s = "";
for (i = 0; i < match.length; i++) {
s = s + match[i] + "\n";
}
alert(s);
}
A capitalized word can be matched with "[A-Z][a-z]+[ ]".
The PHP functions is_numeric
and ucfirst
may help recognize the words:
function parse_name_and_number($sentence) {
$words = explode(' ', $sentence);
$name = array();
foreach ($words as $word) {
if (is_numeric($word))
$number = $word;
elseif ($word == ucfirst($word))
$name[] = $word;
}
$name = implode(' ', $name);
return array('name' => $name, 'number' => $number);
}
print_r(parse_name_and_number('Xyz visited this page 53 minutes ago'));
// output: Array ( [name] => Xyz [number] => 53 )
print_r(parse_name_and_number('we thought Bim de Verdier visited the page 5 seconds ago'));
// output: Array ( [name] => Bim Verdier [number] => 5 )
print_r(parse_name_and_number('Weirder input messes up the results'));
// output: Array ( [name] => Weirder [number] => )