EDITORIAL NOTE
The following is my attempt to summarize the question. I'm not replacing the original content because I'm not 100% sure that I'm right.
In many 'scraping' application, the goal is to find the 'payload' text of a web page. Consider a typical, oh, CNN web page. It has a news article. And then it has all sorts of scraps of text for navigation, advertising, and other more or less noise. If you want to use it as raw material for NLP, for example, you need to sort it out.
How can this be done?
ORIGINAL QUESTION:
When we see a webpage's source code there are many things in it, HTML tags, links, text etc. My question is: Can we make a set (MAY BE PARTIAL SET) of HTML(or other webpage Programming Language based) tags which can be used to identify the location of text (text here means the main content of the webpage which we see in a browser) in the given webpage. We are allowed to see HTML tags only in the webpage and not its content. E.g. suppose given a complete sequence of tags only (webpage text removed) from a webpage can we say that in between/after these tags main text of the webpage exist. I am not very much familiar with HTML programming I thought that someone with good HTML programming experience can help. Thank You. Regards
Idea behind this: My idea behind this is to define some features of webpages using this set of tags so that I can train a part of machine learning based system to extract text from webpage.