The big mission: I am trying to get a few lines of summary of a webpage. i.e. I want to have a function that takes a URL and returns the most informative paragraph from that page. (Which would usually be the first paragraph of actual content text, in contrast to "junk text", like the navigation bar.)
So I managed to reduce an HTML page to a bunch of text by cutting out the tags, throwing out the <HEAD>
and all the scripts. But some of the text is still "junk text". I want to know where the actual paragraphs of text begin. (Ideally it should be human-language-agnostic, but if you have a solution only for English, that might help too.)
How can I figure out which of the text is "junk text" and which is actual content?
UPDATE: I see some people have pointed me to use an HTML parsing library. I am using Beautiful Soup. My problem isn't parsing HTML; I already got rid of all the HTML tags, I just have a bunch of text and I want to separate the context text from the junk text.