I'm trying to work on a project about page ranking. I want to make an index (dictionary) which looks like this:
file1.html -> [[cat, ate, food, drank, milk], [file2.html, file3.html]]
file2.html -> [[dog, barked, ran, away], [file1.html, file4.html]]
Fetching links is easy - look for anchor tags. My question is - how do I fetch text? The text in the html files is not enclosed within any tags like <p>
.
Here's an example of one of the input HTML files:
d_9.html
d_3.htmlbedote charlatanism nondecision pudsey Antaean haec euphoniously Bixa bacteriologically hesitantly Hobbist petrosa emendable counterembattled noble hornlessness chemolyze spittoon flatiron formalith wreathingly hematospermatocele theosophically sarking nontruth possessionist gravimetry matico unlawly abator hyetological Microconodon supermuscan
Maybe, the text above is not HTML, but then how do I fetch and parse it? Any ideas?