views:

237

answers:

3

Hello!

I would like to extract all citations from a text. Additionally, the name of the cited person should be extracted. DayLife does this very well.

Example:

“They think it’s ‘game over,’ ” one senior administration official said.

The phrase They think it's 'game over' and the cited person one senior administration official should be extracted.

Do you think that's possible? You can only distinguish between citations and words in quotes if you check whether there's a cited person mentioned.

Example:

“I think it is serious and it is deteriorating,” Admiral Mullen said Sunday on CNN’s “State of the Union” program.

The passage State of the Union is not a quotation. But how do you detect this? a) You check if there's a cited person mentioned. b) You count the blank spaces in the supposed quotation. If there are less than 3 blank spaces it won't be a quotation, right? I would prefer b) since there's not always a cited person named.

How to start?

I would first replace all types of quotes by a single type so that you'll have to check for only one quote mark later.

<?php
$text = '';
$quote_marks = array('“', '”', '„', '»', '«');
$text = str_replace($quote_marks, '"', $text);
?>

Then I would extract all phrases between quotation marks which contain more than 3 blank spaces:

<?php
function extract_quotations($text) {
   $result = preg_match_all('/"([^"]+)"/', $text, $found_quotations);
   if ($result == TRUE) {
      return $found_quotations;
      // check for count of blank spaces
   }
   return array();
}
?>

How could you improve this?

I hope you can help me. Thank you very much in advance!

+3  A: 

If there are less than 3 blank spaces it won't be a quotation, right?

"Not necessarily," said ceejayoz.

The passage State of the Union is not a quotation. But how do you detect this? a) You check if there's a cited person mentioned. b) You count the blank spaces in the supposed quotation. If there are less than 3 blank spaces it won't be a quotation, right? I would prefer b) since there's not always a cited person named.

b) doesn't even work for this very example - there are 3 blank spaces in "State of the Union".

ceejayoz
"Not necessarily" Oh, yes of course, you're right. :) But USUALLY it won't be one. And if it is one, it USUALLY won't be important, will it?b) could be increased to 4!?
@marco92w and "#LK$#@^" USUALLY won't be found in memory, so why don't we use it to delimit blocks in a cache?
Lucas Oman
I know that there will be some exceptions. But I needn't find ALL quotations. I would be glad if I could find 90% of them.
A: 

A quotation will always have punctuation--either a comma at the end, to signify that the speaker's name or title is to follow, or the end of the sentence (.!?).

Lucas Oman
So will many non-quotations. `The President's annual address to Congress is called the "State of the Union".`
ceejayoz
@ceejayoz: your quoted string didn't end in punctuation. The sentence containing it did. Quotations will have punctuation INSIDE the quotes.
Lucas Oman
Yes, I think that could help finding the quotations.
@Lucas Oman - In the United States, yes. In the Queen's English, puncutation only goes within the quotes if it makes logical sense to do so - if the punctuation doesn't apply to the quotation, it goes outside.
ceejayoz
It's a pity. It wouldn't work for other languages, either. But the punctuation is just a part of it. You could easily implement the punctuation analysis for each language.
+2  A: 

As ceejayoz already pointed out, this won't fit into a single function. What you're describing in your question (detecting grammatical function of a quote-escaped part of a sentence - i.e. “I think it is serious and it is deteriorating,” vs "State of the Union") would be best solved with a library that can break down natural language into tokens. I am not aware of any such library in PHP, but you can have a look at the project size of something you would use in python: http://www.nltk.org/

I think the best you can do is define a set of syntax rules that you verify manually. What about something like this:

abstract class QuotationExtractor {

    protected static $instances;

    public static function getAllPossibleQuotations($string) {
        $possibleQuotations = array();
        foreach (self::$instances as $instance) {
            $possibleQuotations = array_merge(
                $possibleQuotations,
                $instance->extractQuotations($string)
            );
        }
        return $possibleQuotations;
    }

    public function __construct() {
        self::$instances[] = $this;
    }

    public abstract function extractQuotations($string);

}

class RegexExtractor extends QuotationExtractor {

    protected $rules;

    public function extractQuotations($string) {
        $quotes = array();
        foreach ($this->rules as $rule) {
            preg_match_all($rule[0], $string, $matches, PREG_SET_ORDER);
            foreach ($matches as $match) {
                $quotes[] = array(
                    'quote' => trim($match[$rule[1]]),
                    'cited' => trim($match[$rule[2]])
                );
            }
        }
        return $quotes;
    }

    public function addRule($regex, $quoteIndex, $authorIndex) {
        $this->rules[] = array($regex, $quoteIndex, $authorIndex);
    }

}

$regexExtractor = new RegexExtractor();
$regexExtractor->addRule('/"(.*?)[,.]?\h*"\h*said\h*(.*?)\./', 1, 2);
$regexExtractor->addRule('/"(.*?)\h*"(.*)said/', 1, 2);
$regexExtractor->addRule('/\.\h*(.*)(once)?\h*said[\-]*"(.*?)"/', 3, 1);

class AnotherExtractor extends Quot...

If you have a structure like the above you can run the same text through any/all of them and list the possible quotations to select the correct ones. I've run the code with this thread as input for testing and the result was:

array(4) {
  [0]=>
  array(2) {
    ["quote"]=>
    string(15) "Not necessarily"
    ["cited"]=>
    string(8) "ceejayoz"
  }
  [1]=>
  array(2) {
    ["quote"]=>
    string(28) "They think it's `game over,'"
    ["cited"]=>
    string(34) "one senior administration official"
  }
  [2]=>
  array(2) {
    ["quote"]=>
    string(46) "I think it is serious and it is deteriorating,"
    ["cited"]=>
    string(14) "Admiral Mullen"
  }
  [3]=>
  array(2) {
    ["quote"]=>
    string(16) "Not necessarily,"
    ["cited"]=>
    string(0) ""
  }
}
soulmerge
Thank you! Does your code use the NLTK?
No, it's written in PHP. I added the reference to nltk to demonstrate the complexity of doing it right.
soulmerge
Perfect! :) So I could possibly use it. How do I give an input to the function? And how do I call the function? And: Can I simply add my regular expressions for finding quotations in the addRule section?
You can copy-paste the code and add your own regular expressions with addRule(). But if you don't intend to add more complex extracting algorithms than regexes, you can just use the 3 regular expressions in the code with `preg_match_all()`. The rest is a nice OO-API that allows you to create other extractors - like one that does some parsing, for example.
soulmerge
Thank you very much! Now I've understood it. I'll ask for the "perfect regexes" in another question here. :)