views:

52

answers:

1

Hello ,

I want to implement the SRX Segmentation Rules using javascript to extract sentences from text.

In order to do this correctly I will have to follow the SRX rules.

eg. http://www.lisa.org/fileadmin/standards/srx20.html#refTR29

now there are two types of regular expressions

  1. if found sentence should break like ". "
  2. if found sentence should not break like abbreviation U.K or Mr.

For this again there are two parts

  1. before breaking
  2. after breaking

for example if the rule is

<rule break="no">

    <beforebreak>\s*[0-9]+\.</beforebreak>
    <afterbreak>\s</afterbreak>

</rule>

Which says if the pattern "\s*[0-9]+.\s" is found the segment should not break.

how do I implement using javascript, my be split function is not enough ?

+1  A: 

You may want to try something like this:

function segment(text, rules) {
    if (!text) return text;
    if (!rules) return [text];

    var rulePattern = /<rule(?:(\s+break="no")|\s+[^>]+|\s*)>(?:<beforebreak>([^<]+)<\/beforebreak>)?(?:<afterbreak>([^<]+)<\/afterbreak>)?<\/rule>/g;
    cleanXml(rules).replace(rulePattern, 
        function(whole, nobreak, before, after) {
            var r = new RegExp((before||'')+'(?![\uE000\uE001])'+(after?'(?='+after+')':''), 'mg');
            text = text.replace(r, nobreak ? '$&\uE000' : '$&\uE001');
            return '';
        }
    );

    var sentences = text.replace(/\uE000/g, '').split(/\uE001/g);

    return sentences;
}

function cleanXml(s) {
    return s && s.replace(/<!--[\s\S]*?-->/g,'').replace(/>\s+</g,'><');
}

To run this simply call segment() with the text to split, and the rules XML as a string. For example:

segment('The U.K. Prime Minister, Mr. Blair, was seen out with his family today.',
        '<rule break="no">' +
            '<beforebreak>\sMr\.</beforebreak>' +
            '<afterbreak>\s</afterbreak>' +
        '</rule>' +
        '<rule break="no">' +
            '<beforebreak>\sU\.K\.</beforebreak>' +
            '<afterbreak>\s</afterbreak>' +
        '</rule>' +
        '<rule break="yes">' +
            '<beforebreak>[\.\?!]+</beforebreak>' +
            '<afterbreak>\s</afterbreak>' +
        '</rule>'
);

The call to segment() will return an array of sentences, so you can simply do something like alert(segment(...).join('\n')) to see the result.

Known Limitations:

  1. It expects the rules to be after the cascading process that is relevant for the specific language.
  2. It expects the regular expressions used by the rules to conform to the javascript regexp syntax.
  3. It does not handle internal markup.

All of these limitations seem quite easy to overcome.

How does this work?

The segment function uses the rulePattern to extract each rule, identify if it is a breaking or non-breaking rule, and create a regexp based on the beforebreak and afterbreak clauses of the rule. It then scans the text, and marks each matching place by adding a unicode character (taken from a unicode private use area) that marks whether it is a break (\uE001) or a non-break (\uE000). If another marker is already positioned in the same place, the rule is not matched, to preserve rule priorities.

Then it simply removes the non-break marks, and splits the text according to the break marks.

@Sourabh: I hope this is still relevant for you.

Roy Sharon
I made some edits to my original answer, making the segment() code simpler and shorter. So if anyone had seen/used the code in its previous version, you may want to revisit it.
Roy Sharon
Hi Roy, Yes its still relevant.Thanks a lot
Sourabh
Cool. I enjoyed the exercise. Thanks!
Roy Sharon