+5  A: 

You probably want to take the problem reverse, i.e. finding all the character without the spaces:

[^ \t\n]*

Or you want to add the extra characters:

[a-zA-Z0-9&;]*

In case you want to match HTML entities, you should try something like:

(\w+|&\w+;)*
PierreBdR
Sklivvz
Well, that's not what the question says ...
PierreBdR
The last one almost worked, so I worked from there. (\\W+)* did the trick for me.
kari.patila
+2  A: 

you should make a character class that would include the extra characters. For example:

split=re.compile('[\w&;]+')

This should do the trick. For your information

  • \w (lower case 'w') matches word characters (alphanumeric)
  • \W (capital W) is a negated character class (meaning it matches any non-alphanumeric character)
  • * matches 0 or more times and + matches one or more times, so * will match anything (even if there are no characters there).
Steven Oxley
+6  A: 

I would treat the entities as a unit (since they also can contain numerical character codes), resulting in the following regular expression:

(\w|&(#(x[0-9a-fA-F]+|[0-9]+)|[a-z]+);)+

This matches

  • either a word character (including “_”), or
  • an HTML entity consisting of
    • the character “&”,
      • the character “#”,
        • the character “x” followed by at least one hexadecimal digit, or
        • at least one decimal digit, or
      • at least one letter (= named entity),
    • a semicolon
  • at least once.

/EDIT: Thanks to ΤΖΩΤΖΙΟΥ for pointing out an error.

Konrad Rudolph
ΤΖΩΤΖΙΟΥ
Ӓ 쫾 ä
MizardX
A: 

Looks like this did the trick:

split=re.compile('(\\W+&\\W+;)*')

Thanks for the suggestions. Most of them worked fine on Reggy, but I don't quite understand why they failed with re.compile.

kari.patila
That doesn't do what you want. Try matching "<title>". 1) use |; and use r' strings. r'(\W+|)*'
S.Lott