You may want to try something like this:
function segment(text, rules) {
if (!text) return text;
if (!rules) return [text];
var rulePattern = /<rule(?:(\s+break="no")|\s+[^>]+|\s*)>(?:<beforebreak>([^<]+)<\/beforebreak>)?(?:<afterbreak>([^<]+)<\/afterbreak>)?<\/rule>/g;
cleanXml(rules).replace(rulePattern,
function(whole, nobreak, before, after) {
var r = new RegExp((before||'')+'(?![\uE000\uE001])'+(after?'(?='+after+')':''), 'mg');
text = text.replace(r, nobreak ? '$&\uE000' : '$&\uE001');
return '';
}
);
var sentences = text.replace(/\uE000/g, '').split(/\uE001/g);
return sentences;
}
function cleanXml(s) {
return s && s.replace(/<!--[\s\S]*?-->/g,'').replace(/>\s+</g,'><');
}
To run this simply call segment()
with the text to split, and the rules XML as a string. For example:
segment('The U.K. Prime Minister, Mr. Blair, was seen out with his family today.',
'<rule break="no">' +
'<beforebreak>\sMr\.</beforebreak>' +
'<afterbreak>\s</afterbreak>' +
'</rule>' +
'<rule break="no">' +
'<beforebreak>\sU\.K\.</beforebreak>' +
'<afterbreak>\s</afterbreak>' +
'</rule>' +
'<rule break="yes">' +
'<beforebreak>[\.\?!]+</beforebreak>' +
'<afterbreak>\s</afterbreak>' +
'</rule>'
);
The call to segment()
will return an array of sentences, so you can simply do something like alert(segment(...).join('\n'))
to see the result.
Known Limitations:
- It expects the rules to be after the cascading process that is relevant for the specific language.
- It expects the regular expressions used by the rules to conform to the javascript regexp syntax.
- It does not handle internal markup.
All of these limitations seem quite easy to overcome.
How does this work?
The segment function uses the rulePattern
to extract each rule, identify if it is a breaking or non-breaking rule, and create a regexp based on the beforebreak and afterbreak clauses of the rule. It then scans the text, and marks each matching place by adding a unicode character (taken from a unicode private use area) that marks whether it is a break (\uE001) or a non-break (\uE000). If another marker is already positioned in the same place, the rule is not matched, to preserve rule priorities.
Then it simply removes the non-break marks, and splits the text according to the break marks.
@Sourabh: I hope this is still relevant for you.