views:

87

answers:

7

EDITORIAL NOTE

The following is my attempt to summarize the question. I'm not replacing the original content because I'm not 100% sure that I'm right.

In many 'scraping' application, the goal is to find the 'payload' text of a web page. Consider a typical, oh, CNN web page. It has a news article. And then it has all sorts of scraps of text for navigation, advertising, and other more or less noise. If you want to use it as raw material for NLP, for example, you need to sort it out.

How can this be done?

ORIGINAL QUESTION:

When we see a webpage's source code there are many things in it, HTML tags, links, text etc. My question is: Can we make a set (MAY BE PARTIAL SET) of HTML(or other webpage Programming Language based) tags which can be used to identify the location of text (text here means the main content of the webpage which we see in a browser) in the given webpage. We are allowed to see HTML tags only in the webpage and not its content. E.g. suppose given a complete sequence of tags only (webpage text removed) from a webpage can we say that in between/after these tags main text of the webpage exist. I am not very much familiar with HTML programming I thought that someone with good HTML programming experience can help. Thank You. Regards

Idea behind this: My idea behind this is to define some features of webpages using this set of tags so that I can train a part of machine learning based system to extract text from webpage.

A: 

Not sure what you're asking really, but you can use a DOM parser and grab all text nodes through this xpath: //text() and extract them if that's what you're after. Then do what you want, enumerate, whatever.

meder
Thanks for your reply, I looked into DOM and still trying to understand it completely. But I have taken a different approach and for that I am defining some features for webpages, as I have not much knowledge of HTML programming so I thought that people with better experience can suggest something about inner structure of HTML which can say something about the content position can location. Based on these features I was planning to train a part of the system.
ravi
A: 

Are you asking whether there is a particular HTML tag that is always used to contain any text that is on the page? In other words, whenever you see text on a web page, is it always contained in some particular tag? If that's the question, the answer is no. Text can appear in any tag. (Well, anywhere between <body> and </body>, but the same is true of all other non-text content.)

David Zaslavsky
Thank you for your answer, yes I am asking this. So <"body"> and <"/body"> tags are like the main tag and everything is defined in between this.
ravi
It's `<body>` and `</body>`, no quotes, but yes, all the text does appear between those two tags. There is also a `<head>` tag (and closing tag `</head>`), but things between those tags don't directly appear in the web page.
David Zaslavsky
Thank you very much for the reply. I am now studying some of the tags definition if I can utilize their properties, also thanks for suggesting to update the question.
ravi
A: 

If you are making the page then you can put all text between <span> tags (note that span tags can contain any other content too). If it isn't your page then good luck - the text can be nearly anywhere.

slugster
Was this a random down vote? Or is there actually something wrong with this answer, especially compared to other answers when the OP's intent is a little unclear?
slugster
thank you for your answer.
ravi
+1 for "the text can be nearly anywhere"
Stevko
A: 

HTML5 offers some content-area-specific tags... ?

Eh, revisiting this and reading your response above, it sounds like XSLT could be a possibility...

Imagine this situation: you have an XML document with custom tags, defined by you, which contain chunks of information, ie;

<Item>
<GenericText>Hello, </GenericText>
<AdminText>Admin, check the <a href="#">latest logs here</a></AdminText>
<UserText>User, please continue to look at my web page.</UserText>
</Item>

With XSLT, transformed on your server using a technology like PHP, you can write logic to accomodate which tags are displayed when. You could also insert valid, standard HTML tags inside your custom XML tags - written correctly, your XSLT will just parse it as a chunk of XHTML.

This would form the basis of a formal approach to developing pages, templates and such, most likely - so if you don't have access to do this, or lack the prowess, then it will be of little help.

Danjah
thank you for your reply. not necessarily HTML5, as I will be working with any kind of webpage. But your suggestion will be very helpful when a HTML5 webpage is encounter.thanks
ravi
+1  A: 

It's not a question of a '<text>' tag or anything--the question is what kind of text is it? If it's a header, it's <h1> or <h2> or whatever--if it's a paragraph it's <p>, if it's a list it's either <ul> or <ol> with <li>s in between.

D_N
thanks you very much for your answer. I will look into what are the common HTML tags and what are they use for as you have given examples.
ravi
+1  A: 

The answer to your question is, 'NO'. You can't. See boilerpipe for one example of the level of complexity involved in trying to find 'the main text' in a web page.

bmargulies
A: 

Have you looked at the Semantic Web?

It may help you understand the limitations of interpreting meaning from html tags.

Stevko