views:

302

answers:

4

I'd like to highlight long sentences (say, 50 words or greater) contained in an array of paragraph objects on a page, ie $("#content p"). I'm not sure how to tackle this.

I originally tried to highlight all sentences, but ran in trouble when they contained HTML tags (example highlighting code on the net seem to be for individual words only, so they don't take child nodes into account). I'm aware that splitting sentences is difficult; I'd like to use .!? followed either by a space then a capital letter or nothing at all (ie the end of the paragraph).

Thanks in advance for any help/advice.

+2  A: 

As you've said it's gonna be tricky to get right, given the fact you;re not going to catch them all, I'd stick with something simple like:

var regex = \[^.!?]{50,}[.!?]\;

Getting too clever and you will end up spending more time coding for edge cases than I guess you would reasonably want to.

Justin Wignall
I've assumed here you've got your jQuery to do the basic highlighting?
Justin Wignall
If these paragraphs have links then it will not produce what you expect. And there is now way in the future to deal with html code that spans sentences. But if the paragraphs are simple and just text this is perfect.
Jeff Beck
A: 

I'm not sure the best thing to do is to do this on the client side. I would consider sending the paragraphs back to the server to do the work. But the work should be the same either way.

First take all the content of a paragraph make sure to get it all it could be in a few nodes in the DOM. (Read This) Then you will need to make a parser that looks for your split characters while still ignoring them while they are in HTML entities.

As an example the . in a href attribute should be ignored and not split. While doing the parsing you can keep a word count as well breaking working on the spaces. Make each sentence an object that contains the whole sentence and the word count. So you can push those objects into an array that represents the paragraph. Once done you can then iterate through the array and wrap any sentence in a span for highlighting with CSS if the word count reaches your threshold.

The major problem is Tags that may be parts of two sentences such as the following.

I'm typing <b> in bold. NOW!</b>

what I've talked about doing doesn't deal with that but you could make the parser more complex later to support that.

So a quick overview of my rambling parse through all the characters with a state machine that deals with counting words and splitting in the correct spot. On split add the data you collected to an array. When done iterate through the array outputting the newly wrapped sentences.

Jeff Beck
A: 

This is probably a rather slow solution, and ugly too, but it should be pretty simple to code:

Read all the text into a string, and then parse through it, counting characters and finding every .!?-character. In the parsing loop, you also look for < and >, where < means "ignore all .!? until finding another >". Then every time you find a .!?-character, you check the length since the last one, and if it's long enough you save the index for starting- and end-point into an array or something.

When the whole thing is done, make another loop, that moves substrings from the first string into a new string, prepending every "long sentence" with a highlight-tag, and appending an end-highlight-tag to the end of it, before moving on.

When finished, put the new string back where you got it from...

Adrian Schmidt
A: 

To do this you need get the HTML of each paragraph (node.html()) and then replace all of the HTML tags with the same number of spaces. This should be fairly straightfoward and as you can just look for opening angled brackets and the first closing bracket. You need to do this firstly to prevent any full stops and words inside the tag from confusing the rest of the algorithm, but also to prevent a tag itself being counted as a word.

Split the text based on a full stop followed by nothing or any amount of whitespace to get your sentences. You need to perform this split manually using a matching regular expression so you can keep track of the start and end positions of the sentence in the original string.

Next split each sentence on whitespace and remove any 'words' from the array which just consist of whitespace. This gives you the length of the sentence. If it's over your limit then insert the appropriate HTML at the start and end positions of the sentence in your original HTML string. You'll need to keep track of how much extra HTML you've added so you can find find the right start and end positions of subsequent long sentences.

Andrew Wilkinson