tags:

views:

78

answers:

2

SO, I am trying create a simple regex that matches the following string:

<PRE>><A HREF="../cgi-bin/hgTracks?hgsid=160564920&db=hg18&position=chrX:33267175-33267784&hgPcrResult=pack">chrX:33267175-33267784</A> 610bp TGATGTTTGGCGAGGAACTC GCAGAGTTTGAAGAGCTCGG
TGATGTTTGGCGAGGAACTCtactattgttacacttaggaaaataatcta
atccaaaggctttgcatctgtacagaagagcgagtagatactgaaagaga
tttgcagatccactgttttttaggcaggaagaatgctcgttaaatgcaaa
cgctgctctggctcatgtgtttgctccgaggtataggttttgttcgactg
acgtatcagatagtcagagtggttaccacaccgacgttgtagcagctgca
taataaatgactgaaagaatcatgttaggcatgcccacctaacctaactt
gaatcatgcgaaaggggagctgttggaattcaaatagactttctggttcc
cagcagtcggcagtaatagaatgctttcaggaagatgacagaatcaggag
aaagatgctgttttgcactatcttgatttgttacagcagccaacttattg
gcatgatggagtgacaggaaaaacagctggcatggaaggtaggattatta
aagctattacatcattacaaatacaattagaagctggccatgacaaagca
tatgtttgaacaagcagctgttggtagctggggtttgttgCCGAGCTCTT
CAAACTCTGC
</PRE>

I have created the following regex:

<PRE>[.|[\n]]*</PRE>

yet it won't match the string above. Does anyone have a solution to this conundrum and perhaps a reasoning as toward why this doesn't work.

Sorry about the formatting of this question.

+2  A: 

Stop trying to parse HTML using regexes. You can't do it (robustly). There's a reason there's this famous SO answer. Use lxml instead.

Hank Gay
+1  A: 

If you're going to parse HTML, please use lxml, as Hank proposed.

But for this regex to work, you need to change the [] to (). A | inside square brackets is interpreted as the symbol '|' and not as an OR operator.

Another option is to use the flag that's called DOTALL, which makes the dot operator match anything, including a newline. This way the regex becomes very simple:

m = re.match(r'<PRE>(.*)</PRE>', input_string, re.DOTALL)
m.group(1)

outputs the string inside the PRE, without the < PRE >and< /PRE > themselves.

Ofri Raviv