views:

246

answers:

4

Is there a more succinct/correct/pythonic way to do the following:

url = "http://0.0.0.0:3000/authenticate/login"
re_token = re.compile("<[^>]*authenticity_token[^>]*value=\"([^\"]*)")
for line in urllib2.urlopen(url):
    if re_token.match(line):
        token = re_token.findall(line)[0]
        break

I want to get the value of the input tag named "authenticity_token" from an HTML page:

<input name="authenticity_token" type="hidden" value="WTumSWohmrxcoiDtgpPRcxUMh/D9m7O7T6HOhWH+Yw4=" />
+6  A: 

Could you use Beautiful Soup for this? The code would essentially look something like so:

from BeautifulSoup import BeautifulSoup
url = "hhttp://0.0.0.0:3000/authenticate/login"
page = urlli2b.urlopen(page)
soup = BeautifulSoup(page)
token = soup.find("input", { 'name': 'authenticity_token'})

Something like that should work. I didn't test this but you can read the documentation to get it exact.

Bartek
+1  A: 

You don't need the findall call. Instead use:

m = re_token.match(line)
if m:
    token = m.group(1)
    ....

I second the recommendation of BeautifulSoup over regular expressions though.

interjay
+1  A: 

there's nothing "pythonic" with using regex. If you don't want to use BeautifulSoup(which you should ideally), just use Python's excellent string manipulation capabilities

for line in open("file"):
    line=line.strip()
    if "<input name" in line and "value=" in line:
        item=line.split()
        for i in item:
            if "value" in i:
                print i

output

$ more file
<input name="authenticity_token" type="hidden" value="WTumSWohmrxcoiDtgpPRcxUMh/D9m7O7T6HOhWH+Yw4=" />
$ python script.py
value="WTumSWohmrxcoiDtgpPRcxUMh/D9m7O7T6HOhWH+Yw4="
ghostdog74
This code is terrible... worse than the original IMHO (though of course an actual parser like BS is the way to go). You should almost never have quad nested statements like this. The original had two, and you doubled it.
Andrew Johnson
Andyou introduced a bunch of random string literals.
Andrew Johnson
you should take a look at my output before you comment. I am doing it on a file with only that sample line OP posted, just to show you can just use Python's internal string capabilities without too much regex. What quad nested statements and random string literals are you talking about? If you have a better solution, then please post it out.
ghostdog74
You code nests for->if->for->if, and is indented four times. The string literals are "<input name", "value=", and "value"... I read this whole thread, and the accepted answer is a good solution. No reason to be mucking around with string manipulation on this. The code in your answer is both hard to interpret and fragile.... I learned this myself the hard way.
Andrew Johnson
So what if its indented for times ?? the first if test for the "almost" exact line to get. then once the line is grabbed, split into items, iterate over them to get "value" (because we don't know where value might be). There's no use of regex in this case. What's wrong with that? Like i already said, OP should use BS if possible, but my solution also applies when doesn't want to use BS.
ghostdog74
A: 

As to why you shouldn't use regular expressions to search HTML, there are two main reasons.

The first is that HTML is defined recursively, and regular expressions, which compile into stackless state machines, don't do recursion. You can't write a regular expression that can tell, when it encounters an end tag, what start tag it encountered on its way to that tag it belongs to; there's nowhere to save that information.

The second is that parsing HTML (which BeautifulSoup does) normalizes all kinds of things that are allowable in HTML and that you're probably not going to ever consider in your regular expressions. To pick a trivial example, what you're trying to parse:

<input name="authenticity_token" type="hidden" value="xxx"/>

could just as easily be:

<input name='authenticity_token' type="hidden" value="xxx"/>

or

<input type = "hidden" value = "xxx" name = 'authenticity_token' />

or any one of a hundred other permutations that I'm not thinking about right now.

Robert Rossney