views:

68

answers:

4

Dear Coding Experts,

Edit: Just for clarification I am using python, and would like to do this within python.

I am in the middle of collecting data for a research project at our university. Basically I need to scrape a lot of information from a website that moniters the European Parliament. Here is an example of how the url of one site looks like:

http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-2010-0190&language=EN

The numbers after the reference part of the adress refers to: A7 = Parliament in session (previous parliaments are A6 etc.), 2010 = year, 0190 = number of the file.

What I want to do is to create a variable that has all the urls for different parliaments, so I can loop over this variable and scrape the information from the websites.

Any help is much appreciated!

With kind regards,

Thomas Jensen

P.S: I have tried this:

number = range(1,190,1) 

   for i in number: 
       search_url = "http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-2010-" + str(number[i]) +"&language=EN" 

      results = search_url 
      print results

but this gives me the following error: Traceback (most recent call last): File "", line 7, in IndexError: list index out of range

+1  A: 

Can you use python and wget ? Loop through the sessions that exist, and create a string to give to wget? Or is that overkill?

MJB
I am using python(still learning), but I have no idea what wget is. Preferrably I would like to keep it as simple as possible...
Thomas Jensen
A: 

Use selenium. Since it controls uses a real browser, it can handle sites using complex javascript. Many language bindings are available, including python.

James Roth
Thanks, but I have no clue about selenium, and I would like to do this from within Python (as it is the only language I am somewhat familiar with).
Thomas Jensen
It is well documented and I do use it from within python.
James Roth
+1  A: 

If I understand correctly, you just want to be able to loop over the parliments?

i.e. you want A7, A6, A5...?

If that's what you want a simple loop could handle it:

for p in xrange(7,0, -1):
    parliment = "A%d" % p
    print p

for the other values similar loops would work just as well:

for year in xrange(2010, 2000, -1):
    print year

for filenum in xrange(100,200):
    fnum = "%.4d" % filenum
    print fnum

You could easily nest your loops in the proper order to generate the combination(s) you need. HTH!

Edit:

String formatting is super useful, and here's how you can do it with your example:

# Just create a string with the format specifier in it: %.4d - a [d]ecimal with a
# precision/width of 4 - so instead of 3 you'll get 0003
search_url = "http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-2010-%.4d&language=EN"

# This creates a Python generator. They're super powerful and fun to use,
# and you can iterate over them, just like a collection.
# 1 is the default step, so no need for it in this case
for number in xrange(1,190):   
    print search_url % number

String formatting takes a string with a variety of specifiers - you'll recognize them because they have % in them - followed by % and a tuple containing the arguments to the format string.

If you want to add the year and parliment, change the string to this: search_url = "http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A%d-%d-%.4d&language=EN"

where the important changes are here: reference=A%d-%d-%.4d&language=EN

That means you'll need to pass 3 decimals like so:

print search_url % (parliment, year, number)

Wayne Werner
add the backtick ``` character around it and you can write `code stuff`. I'll edit my answer to respond to that particular issue
Wayne Werner
Thomas Jensen
Thanks alot Wayne, I could not figure out the code part for the comments so I added it to the question. But I see you were faster than me :)This is exactly what I was looking for!
Thomas Jensen
+1  A: 

Sorry I can't give this as a comment, but I don't have a high enough score yet.

Looking at the code you quoted in the comment above, your problem is you are trying to add a string and an integer. While some languages will do on the fly conversion (useful when it works but confusing when it doesn't), you have to explicitly convert it with str().

It should be something like:

"http://firstpartofurl" + str(number[i]) + "restofurl"

or, you can use string formatting (using % etc. as Wayne's answer).

neil
Thanks Neil! After some googling I realized the mistake, I am still learning the basics :)
Thomas Jensen