tags:

views:

73

answers:

5

I have a very complex string, such as:

<p>aaa <font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p>
<p>bbb <font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p>
<p>ccc <font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p>
....

Now I want to get the aaa,bbb,ccc parts. I don't want to use regular expression here, because it's too complicated to turn the <font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p> part to a regex.

I hope there is a method (say substrings_between), I can use it like this:

substrings = text.substrings_between('<p>', ' <font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p>');
substrings # -> [aaa, bbb, ccc]

Is there such a method? Or what's the best way to do?

+1  A: 

Use strip_tags

string = '<span id="span_is"><br><br><u><i>Hi</i></u></span>'
strip_tags(string)  # Will Return  'Hi'
Salil
@Salil, thanks. My task is much harder than this, so what I want is not just strip tags, but get the substrings between some keywords.
Freewind
+1  A: 

I think you'll have to build the function yourself. Something like:

def substrings_between str, opening, ending
  i_opening = str.index opening
  i_ending = str.index ending
  res = []
  while i_opening && i_ending
    res << str[i_opening+opening.length .. i_ending]
    str = str[i_ending+ending.length .. -1]
    i_opening = str.index opening
    i_ending = str.index ending
  end
  res
end

(This code isn't too much Ruby-like, but it works well).

paradoja
@paradoja, thanks. there is a minor mistake: `i_ending = str.index ending` should be `str.index ending, i_opening+opening.length`
Freewind
@Freewind, what do you mean? The code as it is seems to work for me, and changing i_ending = str.index ending for str.index ending, i_opening+opening.length gives an error (and I don't understand what you intend).
paradoja
@paradoja, please try `substrings_between "abcba", "b", "a"`, the result is `["", "cba"]`. I think the correct result should be `["cb"]`
Freewind
+1  A: 

I think the function you're looking for is probably too specific to be in the Ruby distribution.

We can probably assemble it using

String#index(string, offset)

Then we could write something like this (extending String):

class String
  def delimited_strings(start_delim, end_delim)
    strings = []
    starts_at = index(start_delim) 
    return strings unless starts_at
    ends_at = index(end_delim, starts_at + start_delim.size)
    while starts_at && ends_at do
      strings << self[starts_at+start_delim.size...ends_at]
      starts_at = index(start_delim, starts_at + end_delim.size)
      ends_at = index(end_delim, starts_at + start_delim.size) if starts_at
    end
    strings
  end
end

s = "<p>aaa<font>xxx</font></p><p>bbb<font>xxx</font></p><p>ccc<font>xxx</font></p>"
s.delimited_strings("<p>", "<font") #=> ["aaa", "bbb", "ccc"]
Mike Woodhouse
@Mike, thank you! Is there a minor mistake? `ends_at = index(end_delim, starts_at+1` => `ends_at = index(end_delim, starts_at + start_delim.length`? I think `+1` is not enough, if you consider `start_delin=abc, end_delim=bc`
Freewind
@Freewind - eek, you're right. Inadequate test conditions. Code changed to (hopefully) be more resilient...
Mike Woodhouse
+2  A: 
Steve Weet
@Steve, thanks for your detailed answer, you gave me useful skills. Although I want here is "substring**s**_between", but thank you all the same
Freewind
+3  A: 

Ideally you should parse HTML using a proper parser, like Nokogiri.

That said, if you know for certain that what you need is located between two hard-coded strings, you could use scan and a regular expression:

string = '<p>aaa <font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p>
          <p>bbb <font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p>
          <p>ccc <font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p>'

before = Regexp.escape '<p>'
after  = Regexp.escape ' <font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p>'

substrings = string.scan(/#{before}(.*?)#{after}/).flatten
 => ["aaa", "bbb", "ccc"] 
Lars Haugseth
D'oh! I forget about `Regexp#escape`!
Mike Woodhouse
@Lars, your code looks simple and nice, thank you:)
Freewind