tags:

views:

64

answers:

2

I have a bunch of posts written in markdown and I need to remove the periods from the end of every paragraph in each of them

The end of a paragraph in markdown is delimited by:

  • 2 or more \ns or
  • The end of the string

However, there are these edge cases

  1. Ellipses
  2. Acroynms (e.g., I don't want to drop the final period in "Notorious B.I.G." when it falls at the end of a paragraph). I think you can deal with this case by saying "don't remove the final period if it's preceded by a capital letter which is itself preceded by another period"
  3. Special cases: e.g., i.e., etc.

Here's a regular expression that matches posts that have offending periods, but it doesn't account for (2) and (3) above:

/[^.]\.(\n{2,}|\z)/

+1  A: 
(?<!\.[a-zA-Z]|etc|\.\.)\.(?=\n{2,}|\Z)
  • (?<!\.[a-zA-Z]|etc|\.\.) - lookbehind to make sure that the period is not preceded by sequences like .T, etc, .. (for ellipsis).
  • \. the period
  • (?=\n{2,}|\Z) lookahead to look for end of a markdown paragraph (two newlines or end of string)

Test:

s = """ths is a paragraph.

this ends with an ellipsis...

this ends with etc.

this ends with B.I.G.

this ends with e.g.

this should be replaced.

this is end of text."""
print s.gsub(/(?<!\.[a-zA-Z]|etc|\.\.)\.(?=[\n]{2,}|\Z)/, "") 
print "\n"

Output:

this is a paragraph

this ends with an ellipsis...

this ends with etc.

this ends with B.I.G.

this ends with e.g.

this should be replaced

this is end of text
Amarghosh
Perfect! (Only my version of Ruby (1.8.7) doesn't support lookbehinds! Argh!)
Horace Loeb
@Horace 1.9.1p129 does.
Amarghosh
Is there any way to do this without a lookbehind? Even with more than 1 regular expression (I can't upgrade Ruby right now)?
Horace Loeb
@Horaz I haven't tested this; but you can replace `(\.[a-zA-Z]|etc|\.\.)\.(?=\n{2,}|\Z)` with `"\\1"`
Amarghosh
Close, but it does the *opposite* of what we want (i.e., removes periods when part of ellipses, acrynoms, etc). See http://pastie.org/1056316 (what does `"\\1"` mean?)
Horace Loeb
A: 

A Ruby 1.8.7 compatible algorithm:

s = %{this is a paragraph.

this ends with an ellipsis...

this ends with etc.

this ends with B.I.G.

this ends with e.g.

this should be replaced.

this is end of text.}.strip

a = s.split(/\n{2,}/).each do |paragraph|
  next unless paragraph.match /\.\Z/
  next if paragraph.match /(\.[a-zA-Z]|etc|\.\.)\.\Z/
  paragraph.chop!
end.join("\n\n")

>> puts a
this is a paragraph

this ends with an ellipsis...

this ends with etc.

this ends with B.I.G.

this ends with e.g.

this should be replaced

this is end of text
Horace Loeb