views:

88

answers:

2

I am trying to parse the keywords from google suggest, this is the url:

http://google.com/complete/search?output=toolbar&q=test

I've done it with php using:

'|<CompleteSuggestion><suggestion data="(.*?)"/><num_queries int="(.*?)"/></CompleteSuggestion>|is'

But that wont work with python re.match(pattern, string), I tried a few but some show error and some return None.

How can I parse that info? I dont want to use minidom because I think regex will be less code.

+2  A: 

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

This is an XML document. Please, reconsider an XML parser. It will be more robust and probably take you less time in the end, even if it is more code.

Borealid
Could you provide an example on how to use an xml parser in python? I am in the same situation than with regex.
jahmax
@jahmax: I think Marcelo Cantos, above, has done a solid job of showing a DOM-style XML parser operating in Python.
Borealid
Yea I used that one.
jahmax
+5  A: 

You could use etree:

>>> from xml.etree.ElementTree import XMLParser
>>> x = XMLParser()
>>> x.feed('<toplevel><CompleteSuggestion><suggestion data=...')
>>> tree = x.close()
>>> [(e.find('suggestion').get('data'), int(e.find('num_queries').get('int')))
     for e in tree.findall('CompleteSuggestion')]
[('test internet speed', 31800000), ('test', 686000000), ...]

It is more code than a regex, but it also does more. Specifically, it will fetch the entire list of matches in one go, and unescape any weird stuff like double-quotes in the data attribute. It also won't get confused if additional elements start appearing in the XML.

Marcelo Cantos