Hello!
I would like to extract all citations from a text. Additionally, the name of the cited person should be extracted. DayLife does this very well.
Example:
“They think it’s ‘game over,’ ” one senior administration official said.
The phrase They think it's 'game over' and the cited person one senior administration official should be extracted.
Do you think that's possible? You can only distinguish between citations and words in quotes if you check whether there's a cited person mentioned.
Example:
“I think it is serious and it is deteriorating,” Admiral Mullen said Sunday on CNN’s “State of the Union” program.
The passage State of the Union is not a quotation. But how do you detect this? a) You check if there's a cited person mentioned. b) You count the blank spaces in the supposed quotation. If there are less than 3 blank spaces it won't be a quotation, right? I would prefer b) since there's not always a cited person named.
How to start?
I would first replace all types of quotes by a single type so that you'll have to check for only one quote mark later.
<?php
$text = '';
$quote_marks = array('“', '”', '„', '»', '«');
$text = str_replace($quote_marks, '"', $text);
?>
Then I would extract all phrases between quotation marks which contain more than 3 blank spaces:
<?php
function extract_quotations($text) {
$result = preg_match_all('/"([^"]+)"/', $text, $found_quotations);
if ($result == TRUE) {
return $found_quotations;
// check for count of blank spaces
}
return array();
}
?>
How could you improve this?
I hope you can help me. Thank you very much in advance!