ansaurus

Question

Replace patterns that are inside delimiters using a regular expression call

Answer 1

+3 A:

This cannot be done with regular expressions, because you need to maintain state on whether you're inside single quotes or outside, and regex is inherently stateless. (Also, as far as I understand, single quotes can be escaped without terminating the "inside" region).

Your best bet is to iterate through the string character by character, keeping a boolean flag on whether or not you're inside a quoted region - and remove the --'s that way.

levik 2008-10-07 23:16:08

Answer 2

A:

Hm. There might be a way in Python if there are no quoted apostrophes, given that there is the (?(id/name)yes-pattern|no-pattern) construct in regular expressions, but it goes way over my head currently.

Does this help?

def remove_double_dashes_in_apostrophes(text):
    return "'".join(
 part.replace("--", "") if (ix&1) else part
 for ix, part in enumerate(text.split("'")))

Seems to work for me. What it does, is split the input text to parts on apostrophes, and replace the "--" only when the part is odd-numbered (i.e. there has been an odd number of apostrophes before the part). Note about "odd numbered": part numbering starts from zero!

ΤΖΩΤΖΙΟΥ 2008-10-07 23:33:39

Answer 3

+1 A:

If bending the rules a little is allowed, this could work:

import re
p = re.compile(r"((?:^[^']*')?[^']*?(?:'[^']*'[^']*?)*?)(-{2,})")
txt = "xxxx rt / $ 'dfdf--fggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g '--ggh--' vcbcvb"
print re.sub(p, r'\1-', txt)

Output:

xxxx rt / $ 'dfdf-fggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g '-ggh-' vcbcvb

The regex:

(               # Group 1
  (?:^[^']*')?  # Start of string, up till the first single quote
  [^']*?        # Inside the single quotes, as few characters as possible
  (?:
    '[^']*'     # No double dashes inside theses single quotes, jump to the next.
    [^']*?
  )*?           # as few as possible
)
(-{2,})         # The dashes themselves (Group 2)

If there where different delimiters for start and end, you could use something like this:

-{2,}(?=[^'`]*`)

Edit: I realized that if the string does not contain any quotes, it will match all double dashes in the string. One way of fixing it would be to change

(?:^[^']*')?

in the beginning to

(?:^[^']*'|(?!^))

Updated regex:

((?:^[^']*'|(?!^))[^']*?(?:'[^']*'[^']*?)*?)(-{2,})

MizardX 2008-10-07 23:41:41

Answer 4

A:

You can use the following sed script, I believe:

:again
s/'\(.*\)--\(.*\)'/'\1\2'/g
t again

Store that in a file (rmdashdash.sed) and do whatever exec magic in your scripting language allows you to do the following shell equivalent:

sed -f rmdotdot.sed < file containing your input data

What the script does is:

:again <-- just a label

s/'$.*$--$.*$'/'\1\2'/g

substitute, for the pattern ' followed by anything followed by -- followed by anything followed by ', just the two anythings within quotes.

t again <-- feed the resulting string back into sed again.

Note that this script will convert '----' into '', since it is a sequence of two --'s within quotes. However, '---' will be converted into '-'.

Ain't no school like old school.

bog 2008-10-08 00:28:46

"foo 'bar' -- 'baz'" -> "foo 'bar' 'baz'"

MizardX 2008-10-08 00:46:24

Answer 5

+1 A:

I found another way to do this from an answer by Greg Hewgill at Qn138522
It is based on using this regex (adapted to contain the pattern I was looking for):

--(?=[^\']*'([^']|'[^']*')*$)

Greg explains:

"What this does is use the non-capturing match (?=...) to check that the character x is within a quoted string. It looks for some nonquote characters up to the next quote, then looks for a sequence of either single characters or quoted groups of characters, until the end of the string. This relies on your assumption that the quotes are always balanced. This is also not very efficient."

The usage examples would be :

JavaScript input.replace(/--(?=[^']'([^']|'[^']')*$)/g, "")
PHP preg_replace('/--(?=[^\']'([^']|'[^']')*$)/', "", input)
Python re.sub(r'--(?=[^\']'([^']|'[^']')*$)', "", input)
Ruby input.gsub(/--(?=[^\']'([^']|'[^']')*$)/, "")

I have tested this for Ruby and it provides the desired result.

Mike Berrow 2008-10-08 03:01:39

ansaurus

tags:

views:

answers:

Replace patterns that are inside delimiters using a regular expression call

related questions