views:

6409

answers:

7

I want to grab the value of a hidden input field in HTML.

<input type="hidden" name="fooId" value="12-3456789-1111111111" />

I want to write a regular expression in Python that will return the value of fooId, given that I know the line in the HTML follows the format

<input type="hidden" name="fooId" value="**[id is here]**" />

Can someone provide an example in Python to parse the HTML for the value?

A: 
/<input type="hidden" name="fooId" value="([\d-]+)" \/>/
Pat
+5  A: 

Parsing is one of those areas where you really don't want to roll your own if you can avoid it, as you'll be chasing down the edge-cases and bugs for years go come

I'd recommend using BeautifulSoup. It has a very good reputation and looks from the docs like it's pretty easy to use.

Orion Edwards
I agree for a general case, but if you're doing a one-off script to parse one or two very specific things out, a regex can just make life easier. Obviously more fragile, but if maintainability is a non-issue then it's not a concern. That said, BeautifulSoup is fantastic.
Cody Brocious
I love regex, but have to agree with Orion on this one. This is one of the time when the famous quote from Jamie Zawinski comes to mind: "Now you have two problems"
Justin Standard
+5  A: 
import re
reg = re.compile('<input type="hidden" name="([^"]*)" value="<id>" />')
value = reg.search(inputHTML).group(1)
print 'Value is', value
Cody Brocious
+14  A: 

For this particular case, BeautifulSoup is harder to write than a regex, but it is much more robust... I'm just contributing with the BeautifulSoup example, given that you already know which regexp to use :-)

from BeautifulSoup import BeautifulSoup

#Or retrieve it from the web, etc. 
html_data = open('/yourwebsite/page.html','r').read()

#Create the soup object from the HTML data
soup = new BeautifulSoup(html_data)
fooId = soup.find('input',name='fooId',type='hidden') #Find the proper tag
value = fooId.attrs[2][1] #The value of the third attribute of the desired tag
Vinko Vrsalovic
A: 
/<input\s+type="hidden"\s+name="([A-Za-z0-9_]+)"\s+value="([A-Za-z0-9_\-]*)"\s*/>/

>>> import re
>>> s = '<input type="hidden" name="fooId" value="12-3456789-1111111111" />'
>>> re.match('<input\s+type="hidden"\s+name="([A-Za-z0-9_]+)"\s+value="([A-Za-z0-9_\-]*)"\s*/>', s).groups()
('fooId', '12-3456789-1111111111')
Will Boyce
+8  A: 

I agree with Vinko BeautifulSoup is the way to go. However I suggest using fooId['value'] to get the attribute rather than relying on value being the third attribute.

from BeautifulSoup import BeautifulSoup
#Or retrieve it from the web, etc.
html_data = open('/yourwebsite/page.html','r').read()
#Create the soup object from the HTML data
soup = BeautifulSoup(html_data)
fooId = soup.find('input',name='fooId',type='hidden') #Find the proper tag
value = fooId['value'] #The value attribute
A Nony Mouse
'new' ? That's not python!
Aaron Gallagher
+1  A: 

Pyparsing is a good interim step between BeautifulSoup and regex. It is more robust than just regexes, since its HTML tag parsing comprehends variations in case, whitespace, attribute presence/absence/order, but simpler to do this kind of basic tag extraction than using BS.

Your example is especially simple, since everything you are looking for is in the attributes of the opening "input" tag. Here is a pyparsing example showing several variations on your input tag that would give regexes fits, and also shows how NOT to match a tag if it is within a comment:

html = """<html><body>
<input type="hidden" name="fooId" value="**[id is here]**" />
<blah>
<input name="fooId" type="hidden" value="**[id is here too]**" />
<input NAME="fooId" type="hidden" value="**[id is HERE too]**" />
<INPUT NAME="fooId" type="hidden" value="**[and id is even here TOO]**" />
<!--
<input type="hidden" name="fooId" value="**[don't report this id]**" />
-->
<foo>
</body></html>"""

from pyparsing import makeHTMLTags, withAttribute, htmlComment

# use makeHTMLTags to create tag expression - makeHTMLTags returns expressions for
# opening and closing tags, we're only interested in the opening tag
inputTag = makeHTMLTags("input")[0]

# only want input tags with special attributes
inputTag.setParseAction(withAttribute(type="hidden", name="fooId"))

# don't report tags that are commented out
inputTag.ignore(htmlComment)

# use searchString to skip through the input 
foundTags = inputTag.searchString(html)

# dump out first result to show all returned tags and attributes
print foundTags[0].dump()
print

# print out the value attribute for all matched tags
for inpTag in foundTags:
    print inpTag.value

Prints:

['input', ['type', 'hidden'], ['name', 'fooId'], ['value', '**[id is here]**'], True]
- empty: True
- name: fooId
- startInput: ['input', ['type', 'hidden'], ['name', 'fooId'], ['value', '**[id is here]**'], True]
  - empty: True
  - name: fooId
  - type: hidden
  - value: **[id is here]**
- type: hidden
- value: **[id is here]**

**[id is here]**
**[id is here too]**
**[id is HERE too]**
**[and id is even here TOO]**

You can see that not only does pyparsing match these unpredictable variations, it returns the data in an object that makes it easy to read out the individual tag attributes and their values.

Paul McGuire