tags:

views:

47

answers:

4

I'm trying to work on a project about page ranking. I want to make an index (dictionary) which looks like this:

file1.html -> [[cat, ate, food, drank, milk], [file2.html, file3.html]]
file2.html -> [[dog, barked, ran, away], [file1.html, file4.html]]

Fetching links is easy - look for anchor tags. My question is - how do I fetch text? The text in the html files is not enclosed within any tags like <p>.

Here's an example of one of the input HTML files:

d_9.html
d_3.html

bedote charlatanism nondecision pudsey Antaean haec euphoniously Bixa bacteriologically hesitantly Hobbist petrosa emendable counterembattled noble hornlessness chemolyze spittoon flatiron formalith wreathingly hematospermatocele theosophically sarking nontruth possessionist gravimetry matico unlawly abator hyetological Microconodon supermuscan

Maybe, the text above is not HTML, but then how do I fetch and parse it? Any ideas?

A: 

One way to go about this is to simply ignore all the tags and what you've got left is assumed to be text. It will make the regex large though.

dutt
A: 

I wouldn't use regex, I would use something like lxml, that way you can get the tags, the text and also the structure of the document as needed.

knitti
A: 

You say the text is "not HTML," and "is not enclosed within any tags." So it's just plain text, there's nothing to parse. Fetch the url, and the contents returned to you are a string full of words. Split the words with .split(), and you have a list of words.

Ned Batchelder
A: 

i think what you want is to get data (links , keywords ...) from an HTML File , but your problem is that some part of your HTML file does not contain any tags to parse it properly, or is it all the HTML file that don't have tags ? if yes you can format the html file with tidy, it can help you for parsing it ;

so if i were you i will just use regex to match links something like :

links = re.finditer(".*html", text) # by the way the regex  must be more complicated than that.  

and for the keywords "[cat, ate, food, drank, milk]" i don't know what you are looking for exactly ;

hope this can help

singularity