tags:

views:

168

answers:

3

Hi ,

I want to use a database of URLs present in DMOZ ODP for my application. ( an array of URL strings OR a file containing the same ). Is there any way of obtaining it , ( other than the manual copy-paste ) ?

EDIT :

Is there any script / code to parse the rdf file..

+4  A: 

Take a look at http://rdf.dmoz.org/, you'll need to find a way to parse the RDF into your database.

I did this the other day using the odp2db scripts from Steve's Software. They're old, but the format hasn't changed significantly so they work fine.

I found I didn't need to do the iconv and xmlclean.pl steps suggested in the readme, just uncompressed the dumps and ran the structure2db.pl and content2db.pl scripts. You'll need to create the database tables manually (see the SQL at top of script for that) and modify the connection details in the scripts before you start.

With the mid-January 2009 dump I used, there's 756,962 categories and 4,436,796 websites. It took a while to run through them all, but not excessively long, though I did dispense with the site descriptions as I didn't need them. Also, may be worth adding database indices after creating the tables to speed access up later. The raw structure and content files were 75MB and 300MB compressed respectively. 848MB and 2GB respectively.

Mat
Ya actually i downloaded it and tried to extract data with extreme DMOZ extractor but could get only 1000 urls as it was an evaluation version . Is there any other extractor (freeware) for extracting complete DMOZ directory ???
trinity
Or is there any script to parse it.. I'm not that familiar with rdf files.. please help.. i need it badly..
trinity
I did this the other day, have modified my answer accordingly. Hope it helps!
Mat
Oh , I'll try that then , thanks !
trinity
I have a few doubts - 1.what will the size of content < list of urls > be .. How much disk space will it require ? 2.i have mysql.. what'll the modified perl script be for mysql instead os PostgreSQL ?
trinity
I have added some stats on the raw file sizes to my answer above. I can't tell you how big the database will be as I dispensed with the descriptions (although now I find I might need them so will have to do it again). The database could well be smaller than the RDF files as there's a lot of XML and RDF cruft.I can't see any problem with using MySQL instead of PostgreSQL. Just modify the DBI->connect statement, which you'd have had to anyway with login details.
Mat
A: 

You could always pay one of the currupt editors there and they will help you out :)

zinc
A: 

I've actually done this in java. I just used the SAX API to read through the RDF files. It was pretty straight forward. In my case I wanted to pull out every URL that was in a topic with "Weblogs" in the topic name.

Basically what did was implement a org.xml.sax.helpers.DefaultHandler

Then to setup the code you do:

       InputSource is = new InputSource(new FileInputStream("filename.rdf"));
       XMLReader r = XMLReaderFactory.createXMLReader();
       r.setContentHandler(new MyHandlerClass());
       r.parse(is);

and that's pretty much it. In my handler class I had to implement:

  • startElement(String uri, String localName, String qName, Attributes attributes) then I had an if statement to see if it was an "ExternalPage" tag, in which case I went to another state to look for "topic","Title" and "Description". I had another

  • characters(char[] ch, int start, int length) where I read in the topic, title, and description text depending on which one had been most recently sent to startElement

  • endElement(String uri, String localName, String qName) where I checked to see which element was ending, and if it ExternalPage, that meant the end of the current element.

The whole thing was 80-90 lines of code for the basic parsing. So pretty easy to write. It was able to chew through the multi-gigabyte files in... I don't remember maybe a minute or two? If you just want to query out some specific data, it might be easier just to write the code to do that in your handler, rather then trying to load it into a DB.

If you find a tool that works well, that's obviously better then writing your own code. But writing your own code isn't hard! RDF is just an XML format, and it's not nested or anything. A simple SAX parser is easily doable in a day or so.

Chad Okere