views:

195

answers:

7

Hey I have a fairly basic question about regular expressions. I want to just return the text inside (and including) the body tags, and I know the following isn't right because it'll also match all the characters before the opening body tag. I was wondering how you would go about skipping those?

x = re.match('(.*<body).*?(</body>)', fileString)

Thanks!

+1  A: 
 x = re.match('.*(<body>.*?</body>)', fileString)

Consider minidom for HTML parsing.

Ewan Todd
+2  A: 

Here is some example code which uses regex to find all the text between <body>...</body> tags. Although this demonstrates some features of python's re module, note that the Beautiful Soup module is very easy to use and is a better tool to use if you plan on parsing HTML or XML. (See below for an example of how you could parse this using BeautifulSoup.)

#!/usr/bin/env python
import re

# Here we have a string with a multiline <body>...</body>
fileString='''baz<body>foo
baby foo
baby foo
baby foo
</body><body>bar</body>'''

# re.DOTALL tells re that '.' should match any character, including newlines.
x = re.search('(<body>.*?</body>)', fileString, re.DOTALL)
for match in x.groups():
    print(match)
# <body>foo
# baby foo
# baby foo
# baby foo
# </body>

If you wish to collect all matches, you could use re.findall:

print(re.findall('(<body>.*?</body>)', fileString, re.DOTALL))
# ['<body>foo\nbaby foo\nbaby foo\nbaby foo\n</body>', '<body>bar</body>']

and if you plan to use this pattern more than once, you can pre-compile it:

pat=re.compile('(<body>.*?</body>)', re.DOTALL)
print(pat.findall(fileString))
# ['<body>foo\nbaby foo\nbaby foo\nbaby foo\n</body>', '<body>bar</body>']

And here is how you could do it with BeautifulSoup:

#!/usr/bin/env python
from BeautifulSoup import BeautifulSoup

fileString='''baz<body>foo
baby foo
baby foo
baby foo
</body><body>bar</body>'''
soup = BeautifulSoup(fileString)
print(soup.body)
# <body>foo
# baby foo
# baby foo
# baby foo
# </body>

print(soup.findAll('body'))
# [<body>foo
# baby foo
# baby foo
# baby foo
# </body>, <body>bar</body>]
unutbu
+1 for findall (which I often find easier to use than search), and because I can't figure out why someone downvoted
foosion
+1 findall is a great convenience if regex was the correct solution, but as you say, BeautifulSoup is the better solution for parsing HTML
gnibbler
By default, the DOT does not match new line characters: something worth mentioning since body-tags are almost always span more than one line. But, as you already said: an html parser would be the way to go.
Bart Kiers
Thanks for the comments. I've revised my code to handle multi-line strings and added some BeautifulSoup demonstration code.
unutbu
+1  A: 
>>> import re
>>> fileString = '12345<body>dwdwdwdw</body>12345'
>>> x = re.match('.*?(<body>.*</body>)', fileString)
>>> x.group(1)
'<body>dwdwdwdw</body>'

This regex minimizes the number of characters before and after the body (only the match between the body tags is greedy). This is based on the assumption that the header does not contain the string "<body>".

A more robust solution without relying on regular expressions involves using e.g. BeautifulSoup, which will even handle slightly malformed HTML.

Stephan202
+1  A: 
x = re.search('(<body>.*</body>)', fileString)
x.group(1)

Less typing than the match answers

foosion
+6  A: 

I don't know Python, but here's a quick example thrown together using Beautiful Soup, which I often see recommended for Python HTML parsing.

import BeautifulSoup

soup = BeautifulSoup(fileString)

bodyTag = soup.html.body.string

That will (in theory) deal with all the complexities of HTML, which is very difficult with pure regex-based answers, because it's not what regex was designed for.

Peter Boughton
This is probably a lot more useful than working with Regex. Thanks!
Simon
+1  A: 

Does your fileString contain multiple lines? In that case you may need to specify it or skip the lines explicitly:

x = re.match(r"(?:.|\n)*(<body>(?:.|\n)*</body>)", fileString)

or, more simply with the re module:

x = re.match(r".*(<body>.*</body>)", fileString, re.DOTALL)

x.groups()[0] should contain your string if x is not None.

RedGlyph
May I enquire why the -1? This does what is asked.
RedGlyph
+1 for re.DOTALL
foosion
No idea why the -1, since the DOTALL is a valid point (even if the whole idea of using regex here is questionable), but you're back to zero now. :)
Peter Boughton
Thanks for that, must have been another reason or totally random I guess ;-) As to the use of a regex, there was simply not enough context to suggest another approach, the OP seemed to want that. The BeautifulSoup is certainly a more interesting idea if installing modules is allowed.
RedGlyph
Heh, I find it a good policy to assume the OP rarely knows what they want. :)Didn't realise Python modules might require permissions though. Seems like BS is common/useful enough that it should be provided by any decent host.
Peter Boughton
You may have a point there. And you're right of course, but while I wouldn't hesitate to install such a package for myself, I have often been in the situation where those scripts are provided to clients as part of a package (hard and probably risky to make them modify their setup), I must be suffering a bit from this "don't-touch-anything" syndrome :-)
RedGlyph
A: 

You cannot parse HTML with regex. HTML is not a regular language. Use an HTML parser like lxml instead.

Mike Graham