ansaurus

Question

implementing SRX Segmentation Rules in JavaScript

Answer 1

+1 A:

You may want to try something like this:

function segment(text, rules) {
    if (!text) return text;
    if (!rules) return [text];

    var rulePattern = /<rule(?:(\s+break="no")|\s+[^>]+|\s*)>(?:<beforebreak>([^<]+)<\/beforebreak>)?(?:<afterbreak>([^<]+)<\/afterbreak>)?<\/rule>/g;
    cleanXml(rules).replace(rulePattern, 
        function(whole, nobreak, before, after) {
            var r = new RegExp((before||'')+'(?![\uE000\uE001])'+(after?'(?='+after+')':''), 'mg');
            text = text.replace(r, nobreak ? '$&\uE000' : '$&\uE001');
            return '';
        }
    );

    var sentences = text.replace(/\uE000/g, '').split(/\uE001/g);

    return sentences;
}

function cleanXml(s) {
    return s && s.replace(/<!--[\s\S]*?-->/g,'').replace(/>\s+</g,'><');
}

To run this simply call segment() with the text to split, and the rules XML as a string. For example:

segment('The U.K. Prime Minister, Mr. Blair, was seen out with his family today.',
        '<rule break="no">' +
            '<beforebreak>\sMr\.</beforebreak>' +
            '<afterbreak>\s</afterbreak>' +
        '</rule>' +
        '<rule break="no">' +
            '<beforebreak>\sU\.K\.</beforebreak>' +
            '<afterbreak>\s</afterbreak>' +
        '</rule>' +
        '<rule break="yes">' +
            '<beforebreak>[\.\?!]+</beforebreak>' +
            '<afterbreak>\s</afterbreak>' +
        '</rule>'
);

The call to segment() will return an array of sentences, so you can simply do something like alert(segment(...).join('\n')) to see the result.

Known Limitations:

It expects the rules to be after the cascading process that is relevant for the specific language.
It expects the regular expressions used by the rules to conform to the javascript regexp syntax.
It does not handle internal markup.

All of these limitations seem quite easy to overcome.

How does this work?

The segment function uses the rulePattern to extract each rule, identify if it is a breaking or non-breaking rule, and create a regexp based on the beforebreak and afterbreak clauses of the rule. It then scans the text, and marks each matching place by adding a unicode character (taken from a unicode private use area) that marks whether it is a break (\uE001) or a non-break (\uE000). If another marker is already positioned in the same place, the rule is not matched, to preserve rule priorities.

Then it simply removes the non-break marks, and splits the text according to the break marks.

@Sourabh: I hope this is still relevant for you.

Roy Sharon 2010-08-15 20:37:36

I made some edits to my original answer, making the segment() code simpler and shorter. So if anyone had seen/used the code in its previous version, you may want to revisit it.

Roy Sharon 2010-08-16 08:55:34

Hi Roy, Yes its still relevant.Thanks a lot

Sourabh 2010-09-13 11:02:38

Cool. I enjoyed the exercise. Thanks!

Roy Sharon 2010-09-13 18:29:55

ansaurus

tags:

views:

answers:

implementing SRX Segmentation Rules in JavaScript

Known Limitations:

How does this work?

related questions