ansaurus

Question

Problem with eastern european characters when scraping data from the European Parliament Website

Answer 1

+1 A:

Looks like you've got some kind of encoding problem if you are getting western European names OK (they have lots of accents etc also!). Show us all of your code plus the URL of a typical page that you are trying to scrape and has the East-only problem. Displaying the piece of html that you have is not much use; we have no idea what transformations it has been through; at the very least, use the result of the repr() function.

Update The offending character in that MEP's name is U+0116 (LATIN LETTER CAPITAL E WITH DOT ABOVE). So it is not included in pyparsing's "alphanums + alphas8bit". The Westies (latin-1) will all fit in what you've got already. I know little about pyparsing; you'll need to find a pyparsing expression that includes ALL unicode alphabetics ... not just Latin-n in case they start using Cyrillic for the Bulgarian MEPs instead of the current transcription into ASCII :-)

Other observations:

(1) alphaNUMs ... digits in a name?
(2) names may include apostrophe and hyphen e.g. O'Reilly, Foughbarre-Smith

John Machin 2010-06-10 10:12:02

Hi John, I have added all the code I have written so far to my original question. The example is copy pasted from the url listed in the code which gives me problems. You are right that I get all German,French and Spanish accents...

Thomas Jensen 2010-06-10 10:25:26

Thanks a lot for the help John, I am a novice with this so I really appreciate the hints!!

Thomas Jensen 2010-06-10 11:41:18

Answer 2

+2 A:

I was able to show 31 names starting with A with code:

extended_chars = srange(r"[\0x80-\0x7FF]")
special_chars = ' -'''
name = Word(alphanums + alphas8bit + extended_chars + special_chars)

As John noticed you need more unicode characters (extended_chars) and some names have hypehen etc. (special chars). Count how many names you received and check if page has the same count as I do for 'A'.

Range 0x80-0x87F encode 2 bytes sequences in utf8 of probably all european languages. In pyparsing examples there is greetingInGreek.py for Greek and other example for Korean texts parsing.

If 2 bytes are not enough then try:

extended_chars = u''.join(unichr(c) for c in xrange(127, 65536, 1))

Michał Niklas 2010-06-10 11:30:06

Brilliant, thanks a lot Michal, this was just what I was looking for!

Thomas Jensen 2010-06-10 11:40:31

@Michal: (1) They are Unicode characters, not UTF-8. (2) is `\0x` followed by a variable length chunk of hex a typo, or does pyparsing have idiosyncratic syntax? (3) U+0080 to U+009F inclusive are C1 control characters, i.e. not alphabetic (4) if `\0x87F` means the same as U+087F, it's a very weird/arbitrary choice; the range would include all the Latin characters, IPA extensions, spacing modifier letters, combining diacritical marks, Greek there are no chars defined in the range U+07B1 to U+08FF!

John Machin 2010-06-10 11:52:19

@Michal: about "Range 0x80-0x87F encode 2 bytes sequences in utf8": 2-byte UTF-8 sequences cover up to U+07FF (not 087f) and in any case what is the relevance of how many UTF-bytes are involved? `.decode('utf8')` is done once at the start and the original encoding is then irrelevant.

John Machin 2010-06-10 12:09:02

John, you are right. (1) changed. (2) In examples there is: `koreanChars = srange(r"[\0xac00-\0xd7a3]")` so I think this is how pyparsing works. (3) and (4) I think asker should check what alphabets are used and then set `extended_chars`. Unfortunately there is no pyparsing support for various alphabets.

Michał Niklas 2010-06-10 12:12:35

Answer 3

+2 A:

Are you sure that writing your own parser to pick bits out of HTML is the best option? You might find it easier to use a dedicated HTML parser. Beautiful Soup which lets you specify the location you're interested in using the DOM, so pulling the text from the first link inside a table cell with class "listcontentlight_left" is quite easy:

soup = BeautifulSoup(htmlDocument)
cells = soup.findAll("td", "listcontentlight_left")
for cell in cells:
  print cell.a.string

Andrew Aylett 2010-06-10 11:53:12

They make some cells light and some dark so: `cells = soup.findAll("td", "listcontentlight_left") + soup.findAll("td", "listcontentdark_left")` will get all names. I also think this is better to use Beautiful Soup then pyparsing. +1

Michał Niklas 2010-06-10 12:29:40

Answer 4

+1 A:

at first i thought i’d recommend to try and build a custom letter class from python’s unicodedata.category method, which, when given a character, will tell you what class that codepoint is assigned to acc to the unicode character category; this would tell you whether a codepoint is e.g. an uppercase or lowercase letter, a digit or whatever.

on second thought and remiscent of an answer i gave the other day, let me suggest another approach. there are many implicit assumptions we have to get rid of when going from national to global; one of them is certainly that ‘a character equals a byte’, and one other is that ‘a person’s name is made up of letters, and i know what the possible letters are’. unicode is vast, and the eu currently has 23 official languages written in three alphabets; exactly what characters are used for each language will involve quite a bit of work to figure out. greek uses those fancy apostrophies and is distributed across at least 367 codepoints; bulgarian uses the cyrillic alphabet with a slew of extra characters unique to the language.

so why not simply turn the tables and take advantage of the larger context those names appear in? i brosed through some sample data and it looks like the general pattern for MEP names is LASTNAME, Firstname with (1) the last name in (almost) upper case; (2) a comma and a space; (3) the given names in ordinary case. this even holds in more ‘deviant’ examples like GERINGER de OEDENBERG, Lidia Joanna, GALLAGHER, Pat the Cope (wow), McGUINNESS, Mairead. It would take some work to recover the ordinary case from the last names (maybe leave all the lower case letters in place, and lower-case any capital letters that are preceded by another capital letters), but to extract the names is, in fact simple:

fullname  := lastname ", " firstname
lastname  := character+
firstname := character+

that’s right—since the EUP was so nice to present names enclosed in an HTML tag, you already know the maximum extent of it, so you can just cut out that maximum extent and split it up in two parts. as i see it, all you have to look for is the first occurrence of a sequence of comma, space—everything before that is the last, anything behind that the given names of the person. i call that the ‘silhouette approach’ since it’s like looking at the negative, the outline, rather than the positive, what the form is made up from.

as has been noted earlier, some names use hyphens; now there are several codepoints in unicode that look like hyphens. let’s hope the typists over there in brussels were consistent in their usage. ah, and there are many surnames using apostrophes, like d'Hondt, d'Alambert. happy hunting: possible incarnations include U+0060, U+00B4, U+0027, U+02BC and a fair number of look-alikes. most of these codepoints would be ‘wrong’ to use in surnames, but when was the last time you saw thos dits used correctly?

i somewhat distrust that alphanums + alphas8bit + extended_chars + special_chars pattern; at least that alphanums part is a tad bogey as it seems to include digits (which ones? unicode defines a few hundred digit characters), and that alphas8bit thingy does reek of a solvent made for another time. unicode conceptually works in a 32bit space. what’s 8bit intended to mean? letters found in codepage 852? c’mon this is 2010.

ah, and looking back i see you seem to be parsing the HTML with pyparsing. don’t do that. use e.g. beautiful soup for sorting out the markup; it’s quite good at dealing even with faulty HTML (most HTML in the wild does not validate) and once you get your head about it’s admittedly wonderlandish API (all you ever need is probably the find() method) it will be simple to fish out exactly those snippets of text you’re looking for.

flow 2010-06-10 12:42:30

Answer 5

A:

Even though BeautifulSoup is the de facto standard for HTML parsing, pyparsing has some alternative approaches that lend themselves to HTML too (certainly a leg up over brute force reg exps). One function in particular is makeHTMLTags, which takes a single string argument (the base tag), and returns a 2-tuple of pyparsing expressions, one for the opening tag and one for the closing tag. Note that the opening tag expression does far more than just return the equivalent of "<"+tag+">". It also:

handles upper/lower casing of the tag itself
handles embedded attributes (returning them as named results)
handles attribute names that have namespaces
handles attribute values in single, double, or no quotes
handles empty tags, as indicated by a trailing '/' before the closing '>'
can be filtered for specific attributes using the withAttribute parse action

So instead of trying to match the specific name content, I suggest you try matching the surrounding <a> tag, and then accessing the title attribute. Something like this:

aTag,aEnd = makeHTMLTags("a")
for t,_,_ in aTag.scanString(page):
    if ";id=" in t.href:
        print t.title

Now you get whatever is in the title attribute, regardless of character set.

Paul McGuire 2010-06-10 13:52:00

as i insinuated before, xml and html parsing is such an unwieldy beast it’s imho better done by those specialized libraries out there. just throw a well-known package at your x/html and see how you can pry generic data structures out of the cold hands of that html. yeah your parser handles empty tags, fine. sure you can do that. of course, if writing another x/html parser is what you want to do, just do it. if all you want to get data out of some html pages, leave it. simple as that. then again, i never tried pyparsing for that purpose. what does it do with non-validating html?

flow 2010-06-10 14:46:27

In the example I posted, pyparsing won't care much if the html is validating or not. Unlike BS, which reads all the HTML and must comprehend it all in order to get anything, pyparsing takes more of a regex-y scanning approach (I would never encourage anyone to try to write a full HTML parser using pyparsing). Unlike regexen, pyparsing's makeHTMLTags and makeXMLTags builtins create mini parsers that can comprehend the variations in HTML source, most notably attributes, spelling, and upper/lowercase. Imagine trying to write a regex that can handle all of the bulleted exceptions in my answer.

Paul McGuire 2010-06-10 21:24:13

Wauw, thanks a lot. I feel like I have enough material here for a week of study. If you don't mind, could you explain what the t,_,_ and ";id=" does? (it works perfectly, I would just like to know why it works).

Thomas Jensen 2010-06-10 21:44:02

scanString is a generator method that returns a 3-tuple each time it is called. The elements of the tuple are the matched tokens, the starting location of the tokens, and the ending location of the tokens. Since you are not interested in the location of the tags within the HTML, I just throw those values away by using '_' dummy variables. If it makes it clearer, replace this with `for tagTokens,tagStart,tagEnd in aTag.scanString(...` and replace references to 't' with 'tagTokens'.

Paul McGuire 2010-06-10 22:14:07

The use of `";id=" in t.href` is there to try to filter out some of the spurious `<a>` tags that have title attributes but are not really entries for parliamentarians.

Paul McGuire 2010-06-10 22:14:43

Thanks for the answers Paul!

Thomas Jensen 2010-06-11 14:47:20

ansaurus

tags:

views:

answers:

Problem with eastern european characters when scraping data from the European Parliament Website

related questions