views:

194

answers:

5

Dear Experts

EDIT: thanks a lot for all the answers an points raised. As a novice I am a bit overwhelmed, but it is a great motivation for continuing learning python!!

I am trying to scrape a lot of data from the European Parliament website for a research project. The first step is to create a list of all parliamentarians, however due to the many Eastern European names and the accents they use i get a lot of missing entries. Here is an example of what is giving me troubles (notice the accents at the end of the family name):

<td class="listcontentlight_left">
<a href="/members/expert/alphaOrder/view.do?language=EN&amp;id=28276" title="ANDRIKIENĖ, Laima Liucija">ANDRIKIENĖ, Laima Liucija</a>
<br/>
Group of the European People's Party (Christian Democrats)
<br/>
</td>

So far I have been using PyParser and the following code:

#parser_names
name = Word(alphanums + alphas8bit)
begin, end = map(Suppress, "><")
names = begin + ZeroOrMore(name) + "," + ZeroOrMore(name) + end

for name in names.searchString(page):
    print(name)

However this does not catch the name from the html above. Any advice in how to proceed?

Best, Thomas

P.S: Here is all the code i have so far:

# -*- coding: utf-8 -*-

import urllib.request
from pyparsing_py3 import *

page = urllib.request.urlopen("http://www.europarl.europa.eu/members/expert/alphaOrder.do?letter=B&amp;language=EN")
page = page.read().decode("utf8")


#parser_names
name = Word(alphanums + alphas8bit)
begin, end = map(Suppress, "><")
names = begin + ZeroOrMore(name) + "," + ZeroOrMore(name) + end

for name in names.searchString(page):
    print(name)
+1  A: 

Looks like you've got some kind of encoding problem if you are getting western European names OK (they have lots of accents etc also!). Show us all of your code plus the URL of a typical page that you are trying to scrape and has the East-only problem. Displaying the piece of html that you have is not much use; we have no idea what transformations it has been through; at the very least, use the result of the repr() function.

Update The offending character in that MEP's name is U+0116 (LATIN LETTER CAPITAL E WITH DOT ABOVE). So it is not included in pyparsing's "alphanums + alphas8bit". The Westies (latin-1) will all fit in what you've got already. I know little about pyparsing; you'll need to find a pyparsing expression that includes ALL unicode alphabetics ... not just Latin-n in case they start using Cyrillic for the Bulgarian MEPs instead of the current transcription into ASCII :-)

Other observations:

(1) alphaNUMs ... digits in a name?
(2) names may include apostrophe and hyphen e.g. O'Reilly, Foughbarre-Smith

John Machin
Hi John, I have added all the code I have written so far to my original question. The example is copy pasted from the url listed in the code which gives me problems. You are right that I get all German,French and Spanish accents...
Thomas Jensen
Thanks a lot for the help John, I am a novice with this so I really appreciate the hints!!
Thomas Jensen
+2  A: 

I was able to show 31 names starting with A with code:

extended_chars = srange(r"[\0x80-\0x7FF]")
special_chars = ' -'''
name = Word(alphanums + alphas8bit + extended_chars + special_chars)

As John noticed you need more unicode characters (extended_chars) and some names have hypehen etc. (special chars). Count how many names you received and check if page has the same count as I do for 'A'.

Range 0x80-0x87F encode 2 bytes sequences in utf8 of probably all european languages. In pyparsing examples there is greetingInGreek.py for Greek and other example for Korean texts parsing.

If 2 bytes are not enough then try:

extended_chars = u''.join(unichr(c) for c in xrange(127, 65536, 1))
Michał Niklas
Brilliant, thanks a lot Michal, this was just what I was looking for!
Thomas Jensen
@Michal: (1) They are Unicode characters, not UTF-8. (2) is `\0x` followed by a variable length chunk of hex a typo, or does pyparsing have idiosyncratic syntax? (3) U+0080 to U+009F inclusive are C1 control characters, i.e. not alphabetic (4) if `\0x87F` means the same as U+087F, it's a very weird/arbitrary choice; the range would include all the Latin characters, IPA extensions, spacing modifier letters, combining diacritical marks, Greek there are no chars defined in the range U+07B1 to U+08FF!
John Machin
@Michal: about "Range 0x80-0x87F encode 2 bytes sequences in utf8": 2-byte UTF-8 sequences cover up to U+07FF (not 087f) and in any case what is the relevance of how many UTF-bytes are involved? `.decode('utf8')` is done once at the start and the original encoding is then irrelevant.
John Machin
John, you are right. (1) changed. (2) In examples there is: `koreanChars = srange(r"[\0xac00-\0xd7a3]")` so I think this is how pyparsing works. (3) and (4) I think asker should check what alphabets are used and then set `extended_chars`. Unfortunately there is no pyparsing support for various alphabets.
Michał Niklas
+2  A: 

Are you sure that writing your own parser to pick bits out of HTML is the best option? You might find it easier to use a dedicated HTML parser. Beautiful Soup which lets you specify the location you're interested in using the DOM, so pulling the text from the first link inside a table cell with class "listcontentlight_left" is quite easy:

soup = BeautifulSoup(htmlDocument)
cells = soup.findAll("td", "listcontentlight_left")
for cell in cells:
  print cell.a.string
Andrew Aylett
They make some cells light and some dark so: `cells = soup.findAll("td", "listcontentlight_left") + soup.findAll("td", "listcontentdark_left")` will get all names. I also think this is better to use Beautiful Soup then pyparsing. +1
Michał Niklas
+1  A: 

at first i thought i’d recommend to try and build a custom letter class from python’s unicodedata.category method, which, when given a character, will tell you what class that codepoint is assigned to acc to the unicode character category; this would tell you whether a codepoint is e.g. an uppercase or lowercase letter, a digit or whatever.

on second thought and remiscent of an answer i gave the other day, let me suggest another approach. there are many implicit assumptions we have to get rid of when going from national to global; one of them is certainly that ‘a character equals a byte’, and one other is that ‘a person’s name is made up of letters, and i know what the possible letters are’. unicode is vast, and the eu currently has 23 official languages written in three alphabets; exactly what characters are used for each language will involve quite a bit of work to figure out. greek uses those fancy apostrophies and is distributed across at least 367 codepoints; bulgarian uses the cyrillic alphabet with a slew of extra characters unique to the language.

so why not simply turn the tables and take advantage of the larger context those names appear in? i brosed through some sample data and it looks like the general pattern for MEP names is LASTNAME, Firstname with (1) the last name in (almost) upper case; (2) a comma and a space; (3) the given names in ordinary case. this even holds in more ‘deviant’ examples like GERINGER de OEDENBERG, Lidia Joanna, GALLAGHER, Pat the Cope (wow), McGUINNESS, Mairead. It would take some work to recover the ordinary case from the last names (maybe leave all the lower case letters in place, and lower-case any capital letters that are preceded by another capital letters), but to extract the names is, in fact simple:

fullname  := lastname ", " firstname
lastname  := character+
firstname := character+

that’s right—since the EUP was so nice to present names enclosed in an HTML tag, you already know the maximum extent of it, so you can just cut out that maximum extent and split it up in two parts. as i see it, all you have to look for is the first occurrence of a sequence of comma, space—everything before that is the last, anything behind that the given names of the person. i call that the ‘silhouette approach’ since it’s like looking at the negative, the outline, rather than the positive, what the form is made up from.

as has been noted earlier, some names use hyphens; now there are several codepoints in unicode that look like hyphens. let’s hope the typists over there in brussels were consistent in their usage. ah, and there are many surnames using apostrophes, like d'Hondt, d'Alambert. happy hunting: possible incarnations include U+0060, U+00B4, U+0027, U+02BC and a fair number of look-alikes. most of these codepoints would be ‘wrong’ to use in surnames, but when was the last time you saw thos dits used correctly?

i somewhat distrust that alphanums + alphas8bit + extended_chars + special_chars pattern; at least that alphanums part is a tad bogey as it seems to include digits (which ones? unicode defines a few hundred digit characters), and that alphas8bit thingy does reek of a solvent made for another time. unicode conceptually works in a 32bit space. what’s 8bit intended to mean? letters found in codepage 852? c’mon this is 2010.

ah, and looking back i see you seem to be parsing the HTML with pyparsing. don’t do that. use e.g. beautiful soup for sorting out the markup; it’s quite good at dealing even with faulty HTML (most HTML in the wild does not validate) and once you get your head about it’s admittedly wonderlandish API (all you ever need is probably the find() method) it will be simple to fish out exactly those snippets of text you’re looking for.

flow
A: 

Even though BeautifulSoup is the de facto standard for HTML parsing, pyparsing has some alternative approaches that lend themselves to HTML too (certainly a leg up over brute force reg exps). One function in particular is makeHTMLTags, which takes a single string argument (the base tag), and returns a 2-tuple of pyparsing expressions, one for the opening tag and one for the closing tag. Note that the opening tag expression does far more than just return the equivalent of "<"+tag+">". It also:

  • handles upper/lower casing of the tag itself

  • handles embedded attributes (returning them as named results)

  • handles attribute names that have namespaces

  • handles attribute values in single, double, or no quotes

  • handles empty tags, as indicated by a trailing '/' before the closing '>'

  • can be filtered for specific attributes using the withAttribute parse action

So instead of trying to match the specific name content, I suggest you try matching the surrounding <a> tag, and then accessing the title attribute. Something like this:

aTag,aEnd = makeHTMLTags("a")
for t,_,_ in aTag.scanString(page):
    if ";id=" in t.href:
        print t.title

Now you get whatever is in the title attribute, regardless of character set.

Paul McGuire
as i insinuated before, xml and html parsing is such an unwieldy beast it’s imho better done by those specialized libraries out there. just throw a well-known package at your x/html and see how you can pry generic data structures out of the cold hands of that html. yeah your parser handles empty tags, fine. sure you can do that. of course, if writing another x/html parser is what you want to do, just do it. if all you want to get data out of some html pages, leave it. simple as that. then again, i never tried pyparsing for that purpose. what does it do with non-validating html?
flow
In the example I posted, pyparsing won't care much if the html is validating or not. Unlike BS, which reads all the HTML and must comprehend it all in order to get anything, pyparsing takes more of a regex-y scanning approach (I would never encourage anyone to try to write a full HTML parser using pyparsing). Unlike regexen, pyparsing's makeHTMLTags and makeXMLTags builtins create mini parsers that can comprehend the variations in HTML source, most notably attributes, spelling, and upper/lowercase. Imagine trying to write a regex that can handle all of the bulleted exceptions in my answer.
Paul McGuire
Wauw, thanks a lot. I feel like I have enough material here for a week of study. If you don't mind, could you explain what the t,_,_ and ";id=" does? (it works perfectly, I would just like to know why it works).
Thomas Jensen
scanString is a generator method that returns a 3-tuple each time it is called. The elements of the tuple are the matched tokens, the starting location of the tokens, and the ending location of the tokens. Since you are not interested in the location of the tags within the HTML, I just throw those values away by using '_' dummy variables. If it makes it clearer, replace this with `for tagTokens,tagStart,tagEnd in aTag.scanString(...` and replace references to 't' with 'tagTokens'.
Paul McGuire
The use of `";id=" in t.href` is there to try to filter out some of the spurious `<a>` tags that have title attributes but are not really entries for parliamentarians.
Paul McGuire
Thanks for the answers Paul!
Thomas Jensen