views:

442

answers:

3

I'd like to scrape a website to programmatically collect any external links within any flash elements on the page. I'd also like to collect any other text, if possible, but the links are the important part. Is this possible? A freeware library/service to accomplish this task would be preferable, but if none is, how can I accomplish the task on my own? Is it possible to get the source code and pull from that?

+1  A: 

As a very crude first step you could use Google to get a text snippet out of the swf, given that the swf has been indexed by Google and that you know it's URL. e.g:

http://www.google.com/search?q=site%3Awww.michaelgraves.com%2Fmga.swf

cherouvim
+2  A: 

Yanking "external links" out of a flash can be as simple as, for instance:

curl -s http://hostname/path/to/file.swf | strings | grep http

Of course, this'll fail if the author has taken any attempt to hide the URL.

YMMV a lot. Good luck!

MikeyB
curl's output just looks like a bunch of random characters, nothing as co-herent as http. I used curl www.michaelgraves.com/mga.swf -o test.txt. Does strings do something to convert to readable text?
Mike Pateras
the `strings` program yanks what may be human-readable strings out of a binary data stream. The `grep` is pulling out any strings containing the word `http`. You can also try modifying the strings command options to give you more useful output (`strings -10`: only output strings of at least 10 characters)
MikeyB
So if the file doesn't contain an "http" string, strings isn't going to give it to me, right?
Mike Pateras
@Mike: That's right, exactly.
MikeyB
So what are my options if that output is entirely garbage? Is that just a reality for some sites?
Mike Pateras
I would say that your next step would be to find some application that actually understands the .swf format to parse it. A quick Google search (parsing .swf) leads me to http://flashpanoramas.com/blog/2007/07/02/swf-parser-air-application/ which looks promising.
MikeyB
+3  A: 

Decompiling the Flash source would let you see the ActionScript part of the Flash file, which I've found to often contain info like links.

A free decompiler is Flare. It's command line only, and works fine. It won't decode some of the info in newer Flash formats (>CS3 I think). It dumps all the AS into one file.

Sothink SWF Decompiler is a more sophisticated commercial program. It will work fine with any Flash file I've tried and the results are quite thorough and well organized. it's GUI based and I don't know if it is easily automated.

With Flare, since it's a command line tool, one could easily write a script to obtain the SWF, decompile it, grep for 'http://', and log the results.

Alex JL