views:

162

answers:

6

I'd like to HTML escape a specific phrase automatically and logically that is currently a statement with words highlighted with quotation marks. Within the statement, quotation or inch marks could also be used to describe a distance.

The phrase could be:

Paul said "It missed us by about a foot". In fact it was only about 9".

To escape this phrase It should really be

<pre>Paul said &ldquo;It missed us by about a foot&rdquo;.  
In fact it was only about 9&prime;.</pre>

Which gives

<pre>Paul said “It missed us by about a foot”. 
     In fact it was only about 9″.</pre>

I can't think of a sample phrase to add in a " escape as well but that could be there!

I'm looking for some help on how to identify which of the escape values to replace " characters with at runtime. The phrase was just an example and it could be anything but should be correctly formed i.e. an opening and closing quote would be present if we are to correctly escape the text.

Would I use a regular expression to find a quoted phrase in the text i.e. two " " characters before a full stop and then replace the first then the second. with

&ldquo;

then

&rdquo;

If I found one " replace it with a "
unless it was after a number where I replace it with

&Prime;

How would I deal with multiple quotes within a sentence?

"It just missed" Paul said "by a foot".  

This would really stump me.....

<pre>"It just missed" Paul said "by 9" almost".</pre>

The above should read when escaped correctly. (I'm showing the actual characters this time)

“It just missed” Paul said “by 9″ almost”.

Obviously an edge case but I wondered if it's possible to escape this at runtime without an understanding of the content? If not help on the more obvious phrases would be appreciated.

A: 

You could try something like this. First replace the quotations with this regular expression:

"((?:[^"\d]+|\d"?)*)"

And than the inch sign:

(\d+)"

Here’s an example in JavaScript:

'"It just missed" Paul said "by 9" almost"'.replace(/"((?:[^"\d]*|\d["']?)+)"/g, "&ldquo;$1&rdquo;").replace(/(\d+)"/g, "$1&Prime;");
Gumbo
+1  A: 

what you've described is basically a hidden markov model,

http://en.wikipedia.org/wiki/Hidden_Markov_model

you have a set of input symbols (your original text and ambiguous punctuation), and a set of output symbols (original text and more fine-grained punctuation) but no good way of really observing the connection between the two in a programmatic way. you could write some rules to cover some of the edge cases, but that will basically never work for the multiple quotes situation. in this case you can't really use a regex for the same reason, but with an hmm, and a bunch of training text you could probably mmake some pretty good guesses.

sorry that's probably not very helpful if you're trying to get something ready for deployment, but the input has greater ambiguity than the output, so your only option is to consider the context, and that basically means either a very lengthy set of rules, or some kind of machine learning approach.

interesting question though - it would be neat to see what kind of performance you could get. maybe someone's already written a paper on it?

blackkettle
+1  A: 

I wondered if it's possible to escape this at runtime without an understanding of the content?

Considering that you're adding semantic meaning to the punctuation which is currently encoded in the other text... no, not really.

Regular expressions would be the easiest tool for at least part of it. I'd suggest looking for /\d+"/ for the inch number cases. But for quotes delimiters, after you'd looked for any other special cases or phrases, it may be easier to use an algorithm for matching pairs, like with parentheses and brackets: tokenize and count. Then test on real-world input and refine.

But I really have to ask: why?

Anonymous
+3  A: 

I would do this in two passes:

The first pass searches for any "s which are immediately preceded by numbers and does that replacement:

s/([0-9])"/\1&Prime;/g

Depending on the text you're dealing with, you may want/need to extend this regex to also recognize numbers that are spelled out as words; I've only checked for digits for the sake of simplicity.

With all of those taken care of, a second pass can then easily convert pairs of "s as you've described:

s/"([^"]*)"/&ldquo;\1&rdquo;/g

Note the use of [^"]* rather than .* - we want to find two sets of double-quotes with any number of non-double-quote characters between them. By adding that restriction, there won't be any problems handling strings with multiple quoted sections. (This could also be accomplished using the non-greedy .*?, but a negated character class more clearly states your intent and, in most regex implementations, is more efficient.)

A stray, mismatched " somewhere in the string, or an inch marker which is missed by the first pass, can still cause problems, of course, but there's no way to avoid that possibility without implementing understanding of the content.

Dave Sherohman
+1 for doing the Prime symbols first. That correctly handles the "by 9" almost" case.
Alan Moore
+1  A: 

I am not sure if it is possible at all to do that without understanding the meaning of the sentence. I tend to doubt it.

My first attempt would be the following.

  • go from left to right through the string
  • alternate replacing double primes with left and right double quotes, but replace with double primes if there is a number to the left
  • if the quotation marks are unbalanced at the end of the string go back until you find a number with double primes and change the double primes into left or right double quotes depending on the preceding double quotes.

I am quite sure that you can easily fail this strategy. But it is still the easy case - hard work starts when you have to deal with nested quotation marks.

Daniel Brückner
+1  A: 

I know this is off the wall, but have you considered Mechanical Turk? This is the sort of problem humans excel at, and computers, currently, are terrible at. Choosing the correct punctuation requires understanding of the meaning of the sentence, so a regex is bound to fail for edge cases.

Chas. Owens