views:

973

answers:

8

I want to download some Yahoo Groups (files, photos, messages, memberlist) and I've found these scripts:

I've downloaded ActivePerl and the needed modules from CPAN (nothing fancy; they're very easy to find). I've managed to install them, but when I run the script I get an error after it tells me that I've successfully logged in: "Use of uninitialized value $cells in pattern match (m//) at yahoogroups_files.pl line 244, line 2."

I'm guessing that Yahoo changed the layout of the page or something, but I'm not able to update the script myself. I'm a newbie when it comes to Perl and understanding the way Yahoo generates the pages, I only know some basic C++. I want to mention that I'm not lazy, I'll try do fix it myself but I need your help: hints, advice, anything.

PS: I've contacted the author, but he isn't willing to update the scripts.

Regards, Nick

+2  A: 

You would need knowledge in the following fields:

  • use of an html parser

  • http knowledge ( get/post/head )

  • web scraping

I suggest you focus on WWW::Mechanize since it's capable of all these things ( and more )

EDIT: another solution ( that doesn't need programming ) , is this: login with your browser on yahoo groups, store the cookie, and then run wget , passing the stored cookie as a parameter. This way you'll get the task accomplished very fast.

Find your browser's cookies.txt file on your harddrive, and then call wget like this ( if I remember the commands correctly ) :

wget --load-cookies path_to_cookie_file -r -w 60 website

The full man page can be found here

EDIT2: Another option is to use WebDriver to automate firefox. You can use this article as a guide on how to accomplish this.

Geo
I've checked wget,but it isn't that simple to use it,I've searched on google but wasn't able to find something close, wget, login based, cookies, authetication...If you tried something similar please tell me the command that you used for wget.I've tried Offline Explorer+login pass and didn't work.
Expansion
Couldn't make it work, it only downloads the login page and it stops. I also had to use "--no-check-certificate" otherwise it doesn't connect to the server.
Expansion
+1  A: 

By the filename I'm assuming you're using Yahoo Group archiver found here: http://sourceforge.net/projects/grabyahoogroup/

I ran the files script against the SubEthaEdit group and it works great. All of the files downloaded without incident.

Looking at the code it seems to barf while processing an html table in a while loop if $cells is empty.

Considering the code did work when I tested it it's possible there's something going on with the listing of that group's files. You'll want to try outputting $content and figure out where and why the regular expression on 243 isn't able to process that html.

EDIT: If you don't mind posting the group this is happening with I'm sure myself or someone else here can try it out and troubleshoot on their own. It's tough to pinpoint what's up when the issue can't be duplicated. Also, try the same group I did and see if it works out for you. Certainly something up with the group you're trying if that works.

Nate
I've tried various groups and I got the same results, and I'm not the only one that got this error(I've read on sourceforge that someone got the same error). I'm going to try it on another PC, maybe it's something with my "machine".
Expansion
@Expansion -- just a wild guess .. are the groups you are fetching fairly low-traffic, with 0 messages in some months?
SquareCog
@Nate: I'm using the latest ActivePerl and I installed the needed modules: Getopt-Long and libwww-perl with PPM via this link: http://ppm4.activestate.com/MSWin32-x86/5.10/1000/ I forgot to mention that I run the script in command prompt like this:perl -T D:\workspace\yahoogroups_files.pl <grpnam>
Expansion
Expansion
Would you please explain me what did you do? I have a feeling that I'm doing something wrong.(I'm using command prompt from WinXP SP2)
Expansion
Ran it off my linux box. I already had the modules installed that were needed.Add this above the while loop on line 244 'print "$content\n";' and check the html. Modify the regular expression above if needed. You are going to have to play with code here. Great opportunity to learn something new!
Nate
I still get errors: 500 No Host option provided. And what HTML? I think I'll install Linux on a VM. What distribution should i choose? What compiler do you use for perl? Hope I'll manage to install the modules.
Expansion
The script you're using parses the html pages of yahoo groups. The html that's being returned may not match what the script is expecting. I recommend Ubuntu but it doesn't matter for a task like this. /usr/bin/perl is standard on a linux system for perl and should be included by default.
Nate
I don't understand how come in my case the returend HTML doesn't match, but in your case it does. Doesn't it like WinXP, ActivePerl, or the modules I installed?
Expansion
@SquareCog - Nop. They're active groups.
Expansion
You should crack open the code and try to debug this. This is a forum to help with programming, not generic troubleshooting of downloaded code/your machine. Open the file, read it, try to understand it, and play till it works. If you have a specific question about the code feel free to post it.
Nate
Did it work for you? All of the scripts worked? If they did there's no debugging. I'm trying to see where's the problem coming from: My settings/software/pc or the actual code. If it's the actual code, i'll get into it.
Expansion
yahoogroups_files works. I only ran that since it's the one you mentioned. You have two options: Edit the code or use another box. Debugging code will help you find out if it's a code or ActivePerl issue. Again, this is a programming forum. Reply w/o trying any code then I can't and won't help.
Nate
I've added an answer below, because here I have a limited number of characters.
Expansion
Please tell me what linux do you use and what distribution of perl. Thank you.
Expansion
A: 

@Nate, I'm struggling to make it work. First, I've rechecked the modules, then, some "header" commands/arguments. I want to say that yesterday it worked and downloaded a few files from a group, but then it stopped and did not want to start again. Today, I'm getting same results all time, like this: 1st time: perl -wT d:\workspace\yahoogroups_files.pl groupname

Successfully logged in as .....

Yahoo error : nonexistant group

I'm going to "documents and settings", and I delete the created cookie and start over.

2nd time: perl -wT d:\workspace\yahoogroups_files.pl groupname

Successfully logged in as .....

Use of uninitialized value $cells in pattern match (m//) at d:\workspace\yahoogr oups_files.pl line 242, line 2.

I've modified the script as you said, to show me $content. It shows the contens of a HTML file, a part of it, since it's very large.

I've checked the contens of $cells (print "$cells\n";) after this line my ($cells) = $content =~ /\s+(.+?)\s+/s;

And it seems that $cells has the same content as $content.
But at the next line I get that warning with "uninitialized value $cells": while ($cells =~ /< tr>.+?\s+< a href="(.+?)">(.+?)<\/a>\s+<\/span>.+?<\/tr>/sg) I've pasted 'print "$cells\n"; in the "while" loop but i get no result, i can't see what it contains. It seems that the script stops at the "while" loop.

The problem seems to be here(between those 2 lines, $cells doesn't maintain it's content in the "while" condition):

my ($cells) = $content =~ /\s+(.+?)\s+/s;

while ($cells =~ /.+?\s+< a href="(.+?)">(.+?)<\/a>\s+<\/span>.+?<\/tr>/sg)

Expansion
A: 

Ok, new things/tests that i've done and i want to mention:

-the cookie file that the script creates, called yahoogroups.cookies contains only this line: #LWP-Cookies-1.0

-I've used open (MYFILE, '>>cells.html'); #print MYFILE $cells; #close (MYFILE); #open (MYFILE, '>>content.html');#print MYFILE $content; #close (MYFILE); to check the contents of $cells and $content right before that "while" line that I've posted in the above answer. $content contained the "yahoo login page" but $cells was empty.

-Whatever I wrote I couldn't manage to make that while loop run at least once(I've inserted a line in the while loop where I wanted to write something and that something never appeard, so I assume that the program stops at the while condition )

Expansion
A: 

Now I get another error:

Use of uninitialized value $group_domain in concatenation (.) or string at d:\workspace\yahoogroups_files2.pl line 223, <STDIN> line 2.
Use of uninitialized value $group_domain in concatenation (.) or string at d:\workspace\yahoogroups_files2.pl line 225, <STDIN> line 2.

500 No Host option provided
Expansion
A: 

Dunno if it will help you, but here's what I did to get the message-download working:

http://sourceforge.net/forum/forum.php?thread_id=3283915&amp;forum_id=209170

(I only used message-download, I didn't look at file-download)

The message-download is different from the rest of the scripts. As the author states:"Inspired by Ravi Ramkissoon's fetchyahoo utility."Thank you for your interest!
Expansion
A: 

I have the FILES script working, and started debugging the photos scripts when Yahoo blocked me for 'unusual activity'. I've since written their tech support (ironically while watching a Yahoo ad on TV at the same time) .

I've asked they stop this nonsense and provide an alternative bulk download from group files stores - or - stop blocking me until I'm done and I'll be working local again once it's successfully completed.

I haven't determined exactly why/how to determine this, but one of the groups I frequent is in tech.groups.yahoo.com and not groups.yahoo.com.

Manually changing the URL fixed most of the issues for the file grabbing script.

I believe the fix is similar for the photos script. I will continue once I'm unblocked.

A: 

Was tinkering on this a while ago to backup my girlfriend's group messages and files from uni. Upon debugging on the latest scripts ive found out that there seems to be a bug on group_domain declaration (theres also a group declaration bug that i've found on yahoo2maildir.pl of the same project, see $request)

($group_domain) = $url =~ /\/\/(.*?groups.yahoo.com)\//;

in this case, i've overwritten the $request var under the function sub download_folder() with

from
$request = GET "http://$group_domain/group/$group/files$sub_folder/";
to
$request = GET "http://groups.yahoo.com/group/$user_group/files$sub_folder/";

Hope this helps.

Cheers,
Dev

Dev