tags:

views:

103

answers:

2

Another regexp question

I have input text like this:

test start first end start second end start third end

and I need matches like this:

test first
test second 
test third

I've tried something like that:

start(.*?)end

but how to add "test"?

Thanks for any suggestion

Lennyd

(edited - there was mistake in input text)


There is no chance to use another programming language, it should be just regexp. I need this for parsing web page with (part) syntax like this:

Season 1
    Episode 1
    Episode 2
    Episode 3
Season 2
    Episode 1
    Episode 2
...etc

and with this regexp i need output like


<episodeslist>>
  <episode season="1" episode="1">
  <episode season="1" episode="2">
.. etc

.. deatiled - it is for xmbc.org media scraper

A: 

A very primitive regex will be:

echo "test start first end start second end test third end" |
     perl -ne 'print "$1 -> $2\n" while (/(\w+).*?(\w+) end/g);'
test -> first
start -> second
test -> third

but I agree with Alan Moore, that you sample output is a bit wired.

dma_k
FYI, I withdrew my answer. See the edited question and my comment underneath it for the reason.
Alan Moore
In this case you don't need regex. Regular expressions are good for what they are aimed: for parsing strings. In your case you have a list of strings, which you would like to convert to XML with minimal parsing. I would use some programming language construct to do the job (iterate through list and generate XML). Regex can only make your code hard to read if you misuse it.
dma_k
+1  A: 

Am I the only one who didnt understand what lennyd wants in the first example?

Now for this one

input

Season 1
  Episode 1
  Episode 2
  Episode 3

output

<episodeslist>
  <episode season="1" episode="1">
  <episode season="1" episode="2">

assuming you're using a regex multiline tool

catch
/Season[^0-9]*([0-9]+)[^\n]*[\s]+Episode[^0-9]*([0-9]+)\n/gs
add as many [\s]+Episode[^0-9]*([0-9]+)\n as needed

return

<list>
<episode season=$1 episode=$2>
<episode season=$1 episode=$3>
<episode season=$1 episode=$4>
<episode season=$1 episode=$5>

just not sure about [^\n] , use [^E] if the input in really that clean

If the number of episodes varies between 24 o 26, just run 3 regex

If you want something more flexible, you'll need some powerfull app like GREP on linux or some clones with UI for other OS, that can do "regex inside regex"

If its some scripted language running regex functions, you could easily wrap the following in a loop, untill input no longer matches anything
{

1 - Match only `Season[^0-9]*([0-9]+)`, strip if off the input, store the season # in a variable,  
2 - Match a block of episodes `([\s]+Episode[^0-9]*[0-9]+\n)+`  
3 - Then inside that block match single lines `[\s]+Episode[^0-9]*[0-9]+`  
4 - Using the season variable, output the appropriate XML  

}

Luxvero