tags:

views:

69

answers:

3

Ok... I have an unsatisfactory solution to a problem.

The problem is I have input like so:

{sup 19}F({sup 3}He,t){sup 19}Ne(p){sup 18}F

and need output like so:

¹⁹F(³He,t)¹⁹Ne(p)¹⁸F

I use a series of replacements first to split each of the {sup xx} sections into {sup x}{sup x} and then use a regex to match each of those and replace the characters with their UTF-8 single equivalents. The "problem" is that the {sup} sections can have numbers 1, 2 or 3 digits long (maybe more, I don't know), and I want to "expand" them into separate {sup} sections with one digit each. ( I also have the same problem with {sub} for subscripts... )

My current solution looks like this (in java):

retval = retval.replaceAll("\\{sup ([1-9])([0-9])\\}", "{sup $1}{sup $2}");
retval = retval.replaceAll("\\{sup ([1-9])([0-9])([0-9])\\}", "{sup $1}{sup $2}{sup $3}");

My question: is there a way to do this in a single pass no matter how many digits ( or at least some reasonable number ) there are?

+2  A: 

Yes, but it may be a bit of a hack, and you'll have to be careful it doesn't overmatch!

Regex:

(?:\{sup\s)?(\d)(?=\d*})}?

Replacement String:

{sup $1}

A short explanation:

(?:                            | start non-capturing group 1
  \{                           |   match the character '{'
  sup                          |   match the substring: "sup"
  \s                           |   match any white space character
)                              | end non-capturing group 1
?                              | ...and repeat it once or not at all
(                              | start group 1
  \d                           |   match any character in the range 0..9
)                              | end group 1
(?=                            | start positive look ahead
  \d                           |   match any character in the range 0..9
  *                            |   ...and repeat it zero or more times
  }                            |   match the substring: "}"
)                              | stop negative look ahead
}                              | match the substring: "}"
?                              | ...and repeat it once or not at all

In plain English: it matches a single digit, only when looking ahead there's a } with optional digits in between. If possible, the substrings {sup and } are also replaced.

EDIT:

A better one is this:

(?:\{sup\s|\G)(\d)(?=\d*})}?

That way, digits like in the string "set={123}" won't be replaced. The \G in my second regex matches the spot where the previous match ended.

Bart Kiers
Why did you mark the `{sup ` part as optional? It looks like it will match "1}".
Mike D.
@Mike: the OP wants to replace `{sup 123}` with `{sup 1}{sup 2}{sup 3}`. Only the first digit has `{sup ` in front of it and the last digit has `}` after it: that's why it's optional.
Bart Kiers
@Mike: ah, I see what you mean. Hence my remark "you'll have to be careful it doesn't *overmatch*!". See my second solution, the one with the `\G` in it, which accounts for that.
Bart Kiers
That second edited one is the right one. The first one incorrectly makes replacements on other inputs like {sub 1} instead of {sup 1}. There are a lot of replacements in these documents.
darelf
You're in luck then: the `\G` is not implemented in many regex implementations (I only know of Java).
Bart Kiers
@Bart K: You prolly already know this, but you are a genius.
darelf
`\G` is not that rare, really: http://www.regular-expressions.info/continue.html . It's just that, outside of Perl (where it originated--of course!), people don't seem to think of it very often. At least, I don't; this isn't the first time you've managed to blindside me with it. :)
Alan Moore
Aha, I always thought it originated in Java's java.util.regex (don't know where I got that idea from...) and that Perl either adopted it from Java, or was going to do so. Thanks for the info.
Bart Kiers
A: 

Sure, this is a standard Regular Expression construct. You can find out about all the metacharacters in the Pattern Javadoc, but for your purposes, you probably want the "+" metacharacter, or the {1,3} greedy quantifier. Details in the link.

Adrian Petrescu
No, you misunderstood, the OP is not looking how to match one or more digits.
Bart Kiers
+1  A: 

The easiest way to do this kind of thing is with something like PHP's preg_replace_callback or .NET's MatchEvaluator delegates. Java doesn't have anything like that built in, but it does expose the lower-level API that lets you implement it yourself. Here's one way to do it:

import java.util.regex.*;

public class Test
{
  static String sepsup(String orig)
  {
    Pattern p = Pattern.compile("(\\{su[bp] )(\\d+)\\}");
    Matcher m = p.matcher(orig);
    StringBuffer sb = new StringBuffer();
    while (m.find())
    {
      m.appendReplacement(sb, "");
      for (char ch : m.group(2).toCharArray())
      {
        sb.append(m.group(1)).append(ch).append("}");
      }
    }
    m.appendTail(sb);
    return sb.toString();
  }

  public static void main (String[] args)
  {
    String s = "{sup 19}F({sup 3}He,t){sub 19}Ne(p){sup 18}F";
    System.out.println(s);
    System.out.println(sepsup(s));
  }
}

result:

{sup 19}F({sup 3}He,t){sub 19}Ne(p){sup 18}F
{sup 1}{sup 9}F({sup 3}He,t){sub 1}{sub 9}Ne(p){sup 1}{sup 8}F

If you wanted, you could go ahead and generate the superscript and subscript characters and insert those instead.

Alan Moore
Nice one, Alan!
Bart Kiers