views:

794

answers:

14

I knew that Google’s search algorithm is mainly based on pagerank. However, it also does analyse and use the structure of the document H1, H2, title and other HTML tags to enhance the search results.

My question is:

What is the name of this technique "using the document structure to enhance the search results"?

And are there any academic papers to help me study this area?

The fact that Google is taking the HTML structure into account is well covered in SEO articles however I could not find it in the academic papers.

Many thanks because this will help me a lot in my research.

+10  A: 

SEO has become almost a religion to some people where they obsess about minutiae. Frankly, I'm not convinced that all this effort is justified.

My advice? Ignore what so-called pundits say and just follow Google's guidelines.

You might be looking for an academic answer but honestly, this isn't an academic question beyond the very basics of how Web indexing works. The reality of a modern page indexing and ranking algorithm is far more complex.

You may want to look at one of the earlier works on search engines. Note the authors' names. You may also want to read Google Patent application 20050071741.

These general principles aside, Google's search algorithm is constantly tweaked based on actual and desired results. The exact workings are a closely guarded secret just to make it harder for people to game the system. Much of the "advice" or descriptions on how Google's search algorithm works is pure supposition.

So, apart from having a title and having well-formed and valid HTML, I don't think you're going to find what you're looking for.

cletus
OP is looking specifically for academic work on the topic, not necessarily just how to get better Page Rank.
Chris
-1: While I agree with the opinion, this answer doesn't address the OP's question.
Joel Potter
Thanks for your advise but I am looking specifically for academic work on the topic like what Chris said, thanks for your contribution and thanks to Chris for explaining my question in a better way
ahmed
+1  A: 

Like cletus said follow the google guidelines.

I did a few tests came to the conclusion that title, image alt and h tags the most important. Also worth to mention is google adsense. I had the feeling if you implement these, the rank of your site increase.

Richard
As what Chris said "I am looking specifically for academic work on the topic, not necessarily just how to get better Page Rank" Thanks for your advise
ahmed
+15  A: 

I think it's called "Semantic Markup"

[...] semantic markup is markup that is descriptive enough to allow us and the machines we program to recognize it and make decisions about it. In other words, markup means something when we can identify it and do useful things with it. In this way, semantic markup becomes more than merely descriptive. It becomes a brilliant mechanism that allows both humans and machines to “understand” the same information. http://www.digital-web.com/articles/writing_semantic_markup/

A more practical article here http://robertnyman.com/2007/10/29/explaining-semantic-mark-up/

Philippe
I fail to see the relevance of semantic markup to the OP's question unless you can also show this has some relevance to search engines.
cletus
@cletus: using semantic markup, such as <h1> for the main heading, allows a search engine to have greater certainty about the structure of the page, which influences its ranking of that page for the relevant search terms. Although search engines are good at using heuristics to guess at the structure of pages that _don't_ use semantic markup, they definitely take note of semantic markup when they find it. Google's SEO Starter Guide http://googlewebmastercentral.blogspot.com/2008/11/googles-seo-starter-guide.html includes a section entitled "Use heading tags appropriately".
NickFitz
+1  A: 

I believe what you are interested in is called structural-fingerprinting, and it is often used to determine the similarity of two structures. In Google's case, applying a weight to different tags and applying to a secret algorithm that (probably) uses the frequencies of the different elements in the fingerprint. This is deeply routed in information theory - if you are looking for academic papers on information theory, I would start with "A Mathematical Theory of Communication" by Claude Shannon

Robert
+2  A: 

I have found this paper:

A New Study on Using HTML Structures to Improve Retrieval

however it is an old paper 1999,

still looking for more recent papers.

ahmed
Have you searched for papers citing this one? ACM Portal lists 2, and it's possible that Citeseer or Google Scholar might know of more.
Novelocrat
A: 

I suggest trying Google scholar as one of your avenues when looking for academic articles

semantic search

Zac Thompson
+1  A: 

I would also suggest looking at Microformats and RDF's. Both are used to enhance searching. These are mostly search engine agnostic, but there are some specific things as well. For google specific guidelines for HTML content read this link.

Ritesh M Nayak
+3  A: 

Google very deliberately doesn't give away too much information about its search algorithm, so it's unlikely you will find a definitve answer or academic paper that confirms this. If you're interested from an SEO point of view, just write your pages so they are good for humans and the robots will like them too.

To make a page good for humans, you SHOULD use tags such as h1, h2 and so on to create a hierarchical page outlay... a bit like this...

h1 "Contact Us" ...h2 "Contact Details" ......h3 "Telephone Numbers" ......h3 "Email Addresses" ...h2 "How To Find Us" ......h3 "By Car" ......h3 "By Train"

The difficulty with your question is that if you put something in your h1 tag hoping that it would increase your position in Google, but it didn't match up with other content on your page, you could look like you are spamming. Similarly, if your page is made up of too many headings and not enough actual content, you could look like you are spamming. It's not as simple as add a h1 and h2 tag and you'll go up! That's why you need to write websites for humans, not robots.

Sohnee
+2  A: 

Check out http://jcmc.indiana.edu/vol12/issue3/pan.html http://www.springerlink.com/content/l22811484243r261/

Some time spent on scholar.google.com might help you find what you are looking for

Amit Wadhwa
+2  A: 

You can also try searching the 'Computer Science' section of arXiv: http://arxiv.org for "search engine" and the various terms that others have suggested.

It contains many academic papers, all freely available... hopefully some of them will be relevant to your research. (Of course the caveat of validating any paper's content applies.)

A: 

I found it interesting that - with no meta keywords nor description provided - in a scenatio like this:

<p>Some introduction</p>
<h1>headline 1</h1>
<p>text for section one</p>

Always the "text for section one" is shown on the search result page.

bb
+1  A: 

In short; very carefully. In long:

Quote from anatomy of a large-scale hypertextual erb search engine:

[...] This gives us some limited phrase searching as long as there are not that many anchors for a particular word. We expect to update the way that anchor hits are stored to allow for greater resolution in the position and docIDhash fields. We use font size relative to the rest of the document because when searching, you do not want to rank otherwise identical documents differently just because one of the documents is in a larger font. [...]

It goes on:

[...] Another big difference between the web and traditional well controlled collections is that there is virtually no control over what people can put on the web. Couple this flexibility to publish anything with the enormous influence of search engines to route traffic and companies which deliberately manipulating search engines for profit become a serious problem. This problem that has not been addressed in traditional closed information retrieval systems. Also, it is interesting to note that metadata efforts have largely failed with web search engines, because any text on the page which is not directly represented to the user is abused to manipulate search engines. [...]

The Challenges in a web search engine addresses these issues in a more modern fashion:

[...] Web pages in HTML fall into the middle of this continuum of structure in documents, being neither close to free text nor to well-structured data. Instead HTML markup provides limited structural information, typically used to control layout but providing clues about semantic information. Layout information in HTML may seem of limited utility, especially compared to information contained in languages like XML that can be used to tag content, but in fact it is a particularly valuable source of meta-data in unreliable corpora such as the web. The value in layout information stems from the fact that it is visible to the user [...]:

And adds:

[...] HTML tags can be analyzed for what semantic information can be inferred. In addition to the header tags mentioned above, there are tags that control the font face (bold, italic), size, and color. These can be analyzed to determine which words in the document the author thinks are particularly important. One advantage of HTML, or any markup language that maps very closely to how the content is displayed, is that there is less opportunity for abuse: it is difficult to use HTML markup in a way that encourages search engines to think the marked text is important, while to users it appears unimportant. For instance, the fixed meaning of the tag means that any text in an HI context will appear prominently on the rendered web page, so it is safe for search engines to weigh this text highly. However, the reliability of HTML markup is decreased by Cascading Style Sheets which separate the names of tags from their representation. There has been research in extracting information from what structure HTML does possess.For instance, [Chakrabarti etal, 2001; Chakrabarti, 2001] created a DOM tree of an HTML page and used this information to in-crease the accuracy of topic distillation, a link-based analysis technique.

There are number of issues a modern search engine needs to combat, for example web spam and blackhat SEO schemes.

But even in a perfect world, e.g. after eliminating the bad apples from the index, the web is still an utter mess because no-one has identical structures. There are maps, games, video, photos (flickr) and lots and lots of user generated content. In other word, the web is still very unpredictable.

Resources

Hannson
+1  A: 

To keep it painfully simple. Make your information architecture logical. If the most important elements for user comprehension are highlighted with headings and grouped logically, then the document is easier to interpret using information processing algorithms. Magically, it will also be easier for users to interpret. Remember the search engine algorithms were written by people trying to interpret language.

The Basic Process is: Write well structured HTML - using header tags to indicate the most critical elements on the page. Use logical tags based on the structure of your information. Lists for lists, headers for major topics.

Supply relevant alt tags and names for any visual elements, and then use simple css to arrange these elements.

If the site works well for users and contains relevant information, you don't risk becoming a black listed spammer, and search engine algorithms will favor your page.

I really enjoyed the book Transcending CSS for a clean explanation of properly structured HTML.

jkelley
A: 

New tag to use called CANONICAL can now also be used, from Google, click HERE

Etienne