tags:

views:

1693

answers:

5

I need to clip out all the occurances of the pattern '--' that are inside single quotes in long string (leaving intact the ones that are outside single quotes).

Is there a regular expression way of doing this? (using it with an iterator from the language is OK).

For example, starting with

"xxxx rt / $ 'dfdf--fggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g '--ggh--' vcbcvb"

I should end up with:

"xxxx rt / $ 'dfdffggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g 'ggh' vcbcvb"

So I am looking for a regex that could be run from the following languages as shown

  • JavaScript input.replace(/someregex/g, "")
  • PHP preg_replace('/someregex/', "", input)
  • Python re.sub(r'someregex', "", input)
  • Ruby input.gsub(/someregex/, "")
+3  A: 

This cannot be done with regular expressions, because you need to maintain state on whether you're inside single quotes or outside, and regex is inherently stateless. (Also, as far as I understand, single quotes can be escaped without terminating the "inside" region).

Your best bet is to iterate through the string character by character, keeping a boolean flag on whether or not you're inside a quoted region - and remove the --'s that way.

levik
A: 

Hm. There might be a way in Python if there are no quoted apostrophes, given that there is the (?(id/name)yes-pattern|no-pattern) construct in regular expressions, but it goes way over my head currently.

Does this help?

def remove_double_dashes_in_apostrophes(text):
    return "'".join(
 part.replace("--", "") if (ix&1) else part
 for ix, part in enumerate(text.split("'")))

Seems to work for me. What it does, is split the input text to parts on apostrophes, and replace the "--" only when the part is odd-numbered (i.e. there has been an odd number of apostrophes before the part). Note about "odd numbered": part numbering starts from zero!

ΤΖΩΤΖΙΟΥ
+1  A: 

If bending the rules a little is allowed, this could work:

import re
p = re.compile(r"((?:^[^']*')?[^']*?(?:'[^']*'[^']*?)*?)(-{2,})")
txt = "xxxx rt / $ 'dfdf--fggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g '--ggh--' vcbcvb"
print re.sub(p, r'\1-', txt)

Output:

xxxx rt / $ 'dfdf-fggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g '-ggh-' vcbcvb

The regex:

(               # Group 1
  (?:^[^']*')?  # Start of string, up till the first single quote
  [^']*?        # Inside the single quotes, as few characters as possible
  (?:
    '[^']*'     # No double dashes inside theses single quotes, jump to the next.
    [^']*?
  )*?           # as few as possible
)
(-{2,})         # The dashes themselves (Group 2)

If there where different delimiters for start and end, you could use something like this:

-{2,}(?=[^'`]*`)


Edit: I realized that if the string does not contain any quotes, it will match all double dashes in the string. One way of fixing it would be to change

(?:^[^']*')?

in the beginning to

(?:^[^']*'|(?!^))

Updated regex:

((?:^[^']*'|(?!^))[^']*?(?:'[^']*'[^']*?)*?)(-{2,})
MizardX
A: 

You can use the following sed script, I believe:

:again
s/'\(.*\)--\(.*\)'/'\1\2'/g
t again

Store that in a file (rmdashdash.sed) and do whatever exec magic in your scripting language allows you to do the following shell equivalent:

sed -f rmdotdot.sed < file containing your input data

What the script does is:

:again <-- just a label

s/'\(.*\)--\(.*\)'/'\1\2'/g

substitute, for the pattern ' followed by anything followed by -- followed by anything followed by ', just the two anythings within quotes.

t again <-- feed the resulting string back into sed again.

Note that this script will convert '----' into '', since it is a sequence of two --'s within quotes. However, '---' will be converted into '-'.

Ain't no school like old school.

bog
"foo 'bar' -- 'baz'" -> "foo 'bar' 'baz'"
MizardX
+1  A: 

I found another way to do this from an answer by Greg Hewgill at Qn138522
It is based on using this regex (adapted to contain the pattern I was looking for):

--(?=[^\']*'([^']|'[^']*')*$)

Greg explains:

"What this does is use the non-capturing match (?=...) to check that the character x is within a quoted string. It looks for some nonquote characters up to the next quote, then looks for a sequence of either single characters or quoted groups of characters, until the end of the string. This relies on your assumption that the quotes are always balanced. This is also not very efficient."

The usage examples would be :

  • JavaScript input.replace(/--(?=[^']'([^']|'[^']')*$)/g, "")
  • PHP preg_replace('/--(?=[^\']'([^']|'[^']')*$)/', "", input)
  • Python re.sub(r'--(?=[^\']'([^']|'[^']')*$)', "", input)
  • Ruby input.gsub(/--(?=[^\']'([^']|'[^']')*$)/, "")

I have tested this for Ruby and it provides the desired result.

Mike Berrow