views:

289

answers:

3
+3  Q: 

MP3 link Crawler

I have been looking into a good way to implement this. I am working on a simple website crawler that will go around a specific set of websites and crawl all the mp3 links into the database. I don't want to download the files, just crawl the link, index them and be able to search them. So far for some of the sites i have been successful, but for some they use url redirects and stuff which confuses the crawler..

any ideas? how does beemp3.com index all these links?

thanks

+1  A: 

You can do an http header request to the links and check their mime type. If it is audio/mpeg chances are you are fetching an mp3 link.

klez
A: 

Here's something similar to your request (friends at college use it all the time). Upon entry of QUERY_TEXT This search generates a Google query of the following format:

QUERY_TEXT intitle:
"index.of" "parent directory" "size" "last modified" "description"
[snd] (mp4|mp3|avi)
-inurl:(jsp|php|html|aspx|htm|cf|shtml|lyrics|mp3s|mp3|index)
-gallery
-intitle:"last modified"
-intitle:(intitle|mp3)
pianoman
this won't search mp3s, but only pages containing directory listing including mp3 files.
klez
yeah and that's not really crawling either.. i want to see if a script can go around and index X amount of sites only for mp3 files. Thanks for the answer though :)
+1  A: 

What programming languages do you prefer?

Python:
There is a very promising crawling framework called Scrapy (written in python) which is built similar to the Django Framework. I haven't used it yet myself but I've been looking at crawlers and Scrapy is the best candidate. It's IIRC not ready out of the box and requires a minimal amount of coding, but it's designed around the DRY principle and is very customizable (somewhat like Django doesn't give you a turn-key website right after installation).

There are many different methods for URL redirection and your crawler needs to be able to follow these redirects OR in worst case be able to ignore them so it doesn't malfunction.

The site which is redirected to must also be in your sites whitelist.

Could you perhaps edit your question and add details on your crawler; Is it written from scratch, is it some turn-key solution, etc?

Hannson