ansaurus

Question

How do I parse only foreign characters from the text in an HTML file with regular expressions

Answer 1

+2 A:

I wouldn't use just regular expressions for this. Down that path lies an angry Tony the Pony.

I'd use an HTML parser in conjuction with regular expressions, though. That way you can distinguish the markup from the non-markup.

John at CashCommons 2010-08-18 16:46:19

You linked to the question. The answer is [here](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454)

reemrevnivek 2010-08-18 16:54:27

But that takes out some of the fun of finding Tony! ;)

John at CashCommons 2010-08-18 17:45:56

Answer 2

+1 A:

Use BeautifulSoup to get the content that you need, then use a variation on this code to match your characters.

import re

kataLetters = range(0x30A0, 0x30FF)
hiraLetters = range(0x3040, 0x309F)
kataPunctuation = range(0x31F0,0x31FF)

myLetters = kataLetters+kataPunctuation+hiraLetters

myLetters = u''.join([unichr(aLetter) for aLetter in myLetters])


myRe = re.compile('['+myLetters+']+', re.UNICODE)

Use the code charts here to get the ranges for your characters.

Chinmay Kanchi 2010-08-18 17:15:42

ansaurus

tags:

views:

answers:

How do I parse only foreign characters from the text in an HTML file with regular expressions

related questions