ansaurus

Question

Python RegEx skipping the first few characters?

Answer 1

+1 A:

 x = re.match('.*(<body>.*?</body>)', fileString)

Consider minidom for HTML parsing.

Ewan Todd 2009-10-25 13:18:22

Answer 2

+2 A:

Here is some example code which uses regex to find all the text between <body>...</body> tags. Although this demonstrates some features of python's re module, note that the Beautiful Soup module is very easy to use and is a better tool to use if you plan on parsing HTML or XML. (See below for an example of how you could parse this using BeautifulSoup.)

#!/usr/bin/env python
import re

# Here we have a string with a multiline <body>...</body>
fileString='''baz<body>foo
baby foo
baby foo
baby foo
</body><body>bar</body>'''

# re.DOTALL tells re that '.' should match any character, including newlines.
x = re.search('(<body>.*?</body>)', fileString, re.DOTALL)
for match in x.groups():
    print(match)
# <body>foo
# baby foo
# baby foo
# baby foo
# </body>

If you wish to collect all matches, you could use re.findall:

print(re.findall('(<body>.*?</body>)', fileString, re.DOTALL))
# ['<body>foo\nbaby foo\nbaby foo\nbaby foo\n</body>', '<body>bar</body>']

and if you plan to use this pattern more than once, you can pre-compile it:

pat=re.compile('(<body>.*?</body>)', re.DOTALL)
print(pat.findall(fileString))
# ['<body>foo\nbaby foo\nbaby foo\nbaby foo\n</body>', '<body>bar</body>']

And here is how you could do it with BeautifulSoup:

#!/usr/bin/env python
from BeautifulSoup import BeautifulSoup

fileString='''baz<body>foo
baby foo
baby foo
baby foo
</body><body>bar</body>'''
soup = BeautifulSoup(fileString)
print(soup.body)
# <body>foo
# baby foo
# baby foo
# baby foo
# </body>

print(soup.findAll('body'))
# [<body>foo
# baby foo
# baby foo
# baby foo
# </body>, <body>bar</body>]

unutbu 2009-10-25 13:18:43

+1 for findall (which I often find easier to use than search), and because I can't figure out why someone downvoted

foosion 2009-10-25 15:27:45

+1 findall is a great convenience if regex was the correct solution, but as you say, BeautifulSoup is the better solution for parsing HTML

gnibbler 2009-10-25 17:36:23

By default, the DOT does not match new line characters: something worth mentioning since body-tags are almost always span more than one line. But, as you already said: an html parser would be the way to go.

Bart Kiers 2009-10-25 18:36:21

Thanks for the comments. I've revised my code to handle multi-line strings and added some BeautifulSoup demonstration code.

unutbu 2009-10-25 19:05:39

Answer 3

+1 A:

>>> import re
>>> fileString = '12345<body>dwdwdwdw</body>12345'
>>> x = re.match('.*?(<body>.*</body>)', fileString)
>>> x.group(1)
'<body>dwdwdwdw</body>'

This regex minimizes the number of characters before and after the body (only the match between the body tags is greedy). This is based on the assumption that the header does not contain the string "<body>".

A more robust solution without relying on regular expressions involves using e.g. BeautifulSoup, which will even handle slightly malformed HTML.

Stephan202 2009-10-25 13:19:53

Answer 4

+1 A:

x = re.search('(<body>.*</body>)', fileString)
x.group(1)

Less typing than the match answers

foosion 2009-10-25 13:25:40

Answer 5

+6 A:

I don't know Python, but here's a quick example thrown together using Beautiful Soup, which I often see recommended for Python HTML parsing.

import BeautifulSoup

soup = BeautifulSoup(fileString)

bodyTag = soup.html.body.string

That will (in theory) deal with all the complexities of HTML, which is very difficult with pure regex-based answers, because it's not what regex was designed for.

Peter Boughton 2009-10-25 13:32:09

This is probably a lot more useful than working with Regex. Thanks!

Simon 2009-10-26 11:28:55

Answer 6

+1 A:

Does your fileString contain multiple lines? In that case you may need to specify it or skip the lines explicitly:

x = re.match(r"(?:.|\n)*(<body>(?:.|\n)*</body>)", fileString)

or, more simply with the re module:

x = re.match(r".*(<body>.*</body>)", fileString, re.DOTALL)

x.groups()[0] should contain your string if x is not None.

RedGlyph 2009-10-25 13:41:02

May I enquire why the -1? This does what is asked.

RedGlyph 2009-10-25 15:12:25

+1 for re.DOTALL

foosion 2009-10-25 15:23:04

No idea why the -1, since the DOTALL is a valid point (even if the whole idea of using regex here is questionable), but you're back to zero now. :)

Peter Boughton 2009-10-25 15:23:18

Thanks for that, must have been another reason or totally random I guess ;-) As to the use of a regex, there was simply not enough context to suggest another approach, the OP seemed to want that. The BeautifulSoup is certainly a more interesting idea if installing modules is allowed.

RedGlyph 2009-10-25 15:27:32

Heh, I find it a good policy to assume the OP rarely knows what they want. :)Didn't realise Python modules might require permissions though. Seems like BS is common/useful enough that it should be provided by any decent host.

Peter Boughton 2009-10-25 16:01:43

You may have a point there. And you're right of course, but while I wouldn't hesitate to install such a package for myself, I have often been in the situation where those scripts are provided to clients as part of a package (hard and probably risky to make them modify their setup), I must be suffering a bit from this "don't-touch-anything" syndrome :-)

RedGlyph 2009-10-25 17:18:56

Answer 7

A:

You cannot parse HTML with regex. HTML is not a regular language. Use an HTML parser like lxml instead.

Mike Graham 2009-10-25 15:50:23

ansaurus

tags:

views:

answers:

Python RegEx skipping the first few characters?

related questions