at first i thought i’d recommend to try and build a custom letter class from python’s unicodedata.category
method, which, when given a character, will tell you what class that codepoint is assigned to acc to the unicode character category; this would tell you whether a codepoint is e.g. an uppercase or lowercase letter, a digit or whatever.
on second thought and remiscent of an answer i gave the other day, let me suggest another approach. there are many implicit assumptions we have to get rid of when going from national to global; one of them is certainly that ‘a character equals a byte’, and one other is that ‘a person’s name is made up of letters, and i know what the possible letters are’. unicode is vast, and the eu currently has 23 official languages written in three alphabets; exactly what characters are used for each language will involve quite a bit of work to figure out. greek uses those fancy apostrophies and is distributed across at least 367 codepoints; bulgarian uses the cyrillic alphabet with a slew of extra characters unique to the language.
so why not simply turn the tables and take advantage of the larger context those names appear in? i brosed through some sample data and it looks like the general pattern for MEP names is LASTNAME, Firstname
with (1) the last name in (almost) upper case; (2) a comma and a space; (3) the given names in ordinary case. this even holds in more ‘deviant’ examples like GERINGER de OEDENBERG, Lidia Joanna
, GALLAGHER, Pat the Cope
(wow), McGUINNESS, Mairead
. It would take some work to recover the ordinary case from the last names (maybe leave all the lower case letters in place, and lower-case any capital letters that are preceded by another capital letters), but to extract the names is, in fact simple:
fullname := lastname ", " firstname
lastname := character+
firstname := character+
that’s right—since the EUP was so nice to present names enclosed in an HTML tag, you already know the maximum extent of it, so you can just cut out that maximum extent and split it up in two parts. as i see it, all you have to look for is the first occurrence of a sequence of comma, space—everything before that is the last, anything behind that the given names of the person. i call that the ‘silhouette approach’ since it’s like looking at the negative, the outline, rather than the positive, what the form is made up from.
as has been noted earlier, some names use hyphens; now there are several codepoints in unicode that look like hyphens. let’s hope the typists over there in brussels were consistent in their usage. ah, and there are many surnames using apostrophes, like d'Hondt
, d'Alambert
. happy hunting: possible incarnations include U+0060, U+00B4, U+0027, U+02BC and a fair number of look-alikes. most of these codepoints would be ‘wrong’ to use in surnames, but when was the last time you saw thos dits used correctly?
i somewhat distrust that alphanums + alphas8bit + extended_chars + special_chars
pattern; at least that alphanums
part is a tad bogey as it seems to include digits (which ones? unicode defines a few hundred digit characters), and that alphas8bit
thingy does reek of a solvent made for another time. unicode conceptually works in a 32bit space. what’s 8bit intended to mean? letters found in codepage 852? c’mon this is 2010.
ah, and looking back i see you seem to be parsing the HTML with pyparsing. don’t do that. use e.g. beautiful soup for sorting out the markup; it’s quite good at dealing even with faulty HTML (most HTML in the wild does not validate) and once you get your head about it’s admittedly wonderlandish API (all you ever need is probably the find()
method) it will be simple to fish out exactly those snippets of text you’re looking for.