tags:

views:

54

answers:

3

Sorry, I know this is probably a duplicate but having searched for 'python regular expression match between' I haven't found anything that answers my question!

The document (which to make clear, is a long HTML page) I'm searching has a whole bunch of strings in it (inside a JavaScript function) that look like this:

link: '/Hidden/SidebySideGreen/dei1=1204970159862'};
link: '/Hidden/SidebySideYellow/dei1=1204970159862'};

I want to extract the links (i.e. everything between quotes within these strings) - e.g. /Hidden/SidebySideYellow/dei1=1204970159862

To get the links, I know I need to start with:

re.matchall(regexp, doc_sting)

But what should regexp be?

+1  A: 

I'd start with:

regexp = "'([^']+)'"

And check if it works okay - I mean, if the only condition is that string is in one line between '', it should be good as it is.

raceCh-
+3  A: 

The answer to your question depends on how the rest of the string may look like. If they are all like this link: '<URL>'}; then you can do it very simple using simple string manipulation:

myString = "link: '/Hidden/SidebySideGreen/dei1=1204970159862'};"
print( myString[7:-3] )

(If you just have one string with multiple lines by that, you can just split the string into lines.)

If it is a bit more complex though, using regular expressions are fine. One example that just looks for the url inside of the quotes would be:

myDoc = """link: '/Hidden/SidebySideGreen/dei1=1204970159862'};
link: '/Hidden/SidebySideYellow/dei1=1204970159862'};"""

print( re.findall( "'([^']+)'", myDoc ) )

Depending on how the whole string looks, you might have to include the link: as well:

print( re.findall( "link: '([^']+)'", myDoc ) )
poke
A: 

Use a few simple splits

>>> s="link: '/Hidden/SidebySideGreen/dei1=1204970159862'};"
>>> s.split("'")
['link: ', '/Hidden/SidebySideGreen/dei1=1204970159862', '};']
>>> for i in s.split("'"):
...     if "/" in i:
...         print i
...
/Hidden/SidebySideGreen/dei1=1204970159862
>>>
ghostdog74