views:

593

answers:

11

How does Google find relevant content when its parsing the web?

Lets say for instance, Google uses the PHP native DOM Library to parse content, What methods would they be for it to find the most relevant content on a web page.

My thoughts would be that it would search for all paragraphs, order by the length of each paragraph and then from possible search strings and query params work out the percentage of relevance each paragraph is.

Lets say we had this URL:

http://domain.tld/posts/stackoverflow-dominates-the-world-wide-web.html

Now from that url I would work out that the HTML file name would be high relevance so then I would see how close that string compares with all the paragraphs in the page!

A really good example of this would be Facebook share, when you share a page. Facebook quickly bots the link and brings back images, content, etc etc.

I was thinking that some sort of calculative method would be best, to work out the % of relevancy depending on surrounding elements and meta data.

Are there any books / information on the best practices of content parsing that covers how to get the best content from a site, any algorithm's that may be talked about, or any in-depth reply would be appreciated.


Some ideas that I have in mind are:

  • Find all paragraphs and order by plain text length
  • Some how find the Width and Height of div containers and order by (W+H) - @Benoit
  • Check meta keywords,title,description and check relevancy within the paragraphs
  • Find all image tags and order by largest, and length of nodes away from main paragraph
  • Check for object data, such as videos and count the nodes from the largest paragraph / content div
  • Work out resemblances from previous pages parsed

The reason why I need this information:

I'm building a website where webmasters send us links and then we list there pages, but i want the webmaster to submit a link, then i go and crawl that page finding the following information.

  • An image (If applicable)
  • A < 255 paragraph from the best slice of text
  • Keywords that would be used for our search engine, (stack overflow style)
  • Meta data Keywords,Description, all images, change-log (for moderation and administration purposes)

Hope you guys can understand that this is not for a search engine but the way search engines tackle content discovery is in the same context as what I need it for.

I'm not asking for trade secrets, i'm asking what your personal approach to this would be.

Regards

A: 

Google for 'web crawlers, robots, Spiders, and Intelligent Agents', might try them separately as well to get individual results.

What I think you're looking for is Screen Scraping (with DOM) which Stack has a ton of Q&A on.

Phill Pafford
I really don't see how anything above is relevant? I fully understand what the entities there are to a search engine, I am specifically asking about the algorithm used in finding relevant content without specific selectors.
RobertPitt
@Robert if you find the algorithm Google or other search engines hold in secret, please start you own search engine company as they do not share this information as it would be considered a "Trade Secret".
Phill Pafford
Im not looking for Googles cluster bots source code here, im looking for community based methods on having programmatic way of finding relevant data by following a trend of content layout, read my examples please, and this is not for a search engine, this is for a **content sharing network**
RobertPitt
I think you should re-title you question as it reads "How does Search Engines find relevant content ?" to something like "Algorithm/Logic used for Content Sharing Network" just my 2 cents
Phill Pafford
Well I specifically wish to know about the search engine side of it because thats exactly what is needed, **Content Sharing Network** is basically a search engine as its a network that shares content.
RobertPitt
+1  A: 

Most search engines look for the title and meta description in the head of the document, then heading one and text content in the body. Image alt tags and link titles are also considered. Last I read Yahoo was using the meta keyword tag but most don't.

You might want to download the open source files from The Search Engine Project (TSEP) on Sourceforge https://sourceforge.net/projects/tsep/ and have a look at how they do it.

Chaoley
Plus 1 for the link, but lets say that there was certain elements such as a copyright overlay on every page that becomes visible by javascript, obviously that copyright div would be contained in the body, the ways that im searching for is to separate them from the actually content, is there a way with PHP DOM to compile the css code as well so you can see what elements have high z-index and are visible?
RobertPitt
No way to turns CSS into a DOM representation that I know of, you'd need to use the file functions for that. If you're searching a single site with consistent code structure the entire exercise is easy, if you want to search multiple sites it's a lot harder. Another link, check out http://www.webmasterworld.com/perl/3460556.htm for some more ideas.
Chaoley
-1 The big search engines (i.e. Google and no one else :D) do **not** use meta description and meta keywords for site ranking...
nikic
To be fair, he didn't say that they use meta data to rank the page, but to extract the content which is almost correct!
Mouhannad
I don't think we're talking about ranking here, the question was about finding relevant content.
kovshenin
Thank you kovshenin, +1
RobertPitt
@Mouhannad, kovshenin: Still the meta information isn't used for that either ^^
nikic
@nikic but they do use them. When I google any of my websites, i can see its description coming from the description meta tag. Obviously if you search for something specific, the description shows the exact quote you are looking for in that page- but if search for something general, google often shows you what the meta description is showing. Note: i'm not talking about keyword meta tag, that is definitely not used anymore.
Mouhannad
What nikik was trying to say is that Google does not use the keywords / description in its factors on finding what chunk of text within the page is most relevant.
RobertPitt
@Mouhannad: They only show the description but don't use it for ranking.
nikic
Yes i know, that is what i've trying to say!! @Chaoley is talking about extracting content and not ranking system
Mouhannad
Actually there is a library called Cobra which could get you the CSS Properties (it's a part of the rendering engine of a browser), a pity that it is in Java, not PHP :-) http://lobobrowser.org/cobra.jsp
giraff
+7  A: 

I don't work at Google but around a year ago I read they had over 200 factors for ranking their search results. Of course the top ranking would be relevance, so your question is quite interesting in that sense.

What is relevance and how do you calculate it? There are several algorithms and I bet Google have their own, but ones I'm aware of are Pearson Correlation and Euclidean Distance.

A good book I'd suggest on this topic (not necessarily search engines) is Programming Collective Intelligence by Toby Segaran (O'Reilly). A few samples from the book show how to fetch data from third-party websites via APIs or screen-scraping, and finding similar entries, which is quite nice.

Anyways, back to Google. Other relevance techniques are of course full-text searching and you may want to get a good book on MySQL or Sphinx for that matter. Suggested by @Chaoley was TSEP which is also quite interesting.

But really, I know people from a Russian search engine called Yandex here, and everything they do is under NDA, so I guess you can get close, but you cannot get perfect, unless you work at Google ;)

Cheers.

kovshenin
in laymen's terms: I'm not talking about ranking, i am talking about scraping a page and finding the best bits
RobertPitt
In that case it's information extraction and data mining I guess, not relevancy
kovshenin
What i want to know how relevant paragraph A is to paragraph B by using keywords found in links,meta,title, and a title provided when the link is submitted to me, +1 for the book, very nice, title looks very promising
RobertPitt
So first step is to find the most important piece of information and extract it from both websites, next step is to calculate their relevance. Once again, Toby's book has a good sample of fetching data from a bunch of RSS feeds and grouping relevant sources, which is nice, but simpler because RSS is short and supports tags and categories. But you will have to look somewhere else for extraction techniques. I suggest starting from "Mining the Social Web" by Matthew Russell. (yeah, I'm crazy about books)
kovshenin
+12  A: 

Tricky, but I'll take a stab:

An image (If applicable)

  • The first image on the page
  • the image with a name that includes the letters "logo"
  • the image that renders closest to the top-left (or top-right)
  • the image that appears most often on other pages of the site
  • an image smaller than some maximum dimensions

A < 255 paragraph from the best slice of text

  • contents of the title tag
  • contents of the meta content description tag
  • contents of the first h1 tag
  • contents of the first p tag

Keywords that would be used for our search engine, (stack overflow style)

  • substring of the domain name
  • substring of the url
  • substring of the title tag
  • proximity between the term and the most common word on the page and the top of the page

Meta data Keywords,Description, all images, change-log (for moderation and administration purposes)

  • ak! gag! Syntax Error.
John Mee
+1 for actually giving answers that are somewhat relative to my question, What are the reason for the first h1 and first p tag.
RobertPitt
thx. The first 'h1' should be the biggest and most important heading of the page; if it contains the search term then the page is more likely to be relevant. Similarly for the 'p'; the first paragraph on the page is more likely to contain words reflective of the rest of the page, like an introduction, or summary of what follows; so if it mentions the search term once or twice then the whole page is perhaps relevant.
John Mee
@John: I think you should edit your answer and include what you wrote in your comment. :)
musicfreak
Great answers, g'luck for the bounty ;)
kovshenin
+1  A: 

Hi Robert,

I'd just grab the first 'paragraph' of text. The way most people write stories/problems/whatever is that they first state the most important thing, and then elaborate. If you look at any random text and you can see it makes sense most of the time.

For example, you do it yourself in your original question. If you take the first three sentences of your original question, you have a pretty good summary of what you are trying to do.

And, I just did it myself too: the gist of my comment is summarized in the first paragraph. The rest is just examples and elaborations. If you're not convinced, take a look at a few recent articles I semi-randomly picked from Google News. Ok, that last one was not semi-random, I admit ;)

Anyway, I think that this is a really simple approach that works most of the time. You can always look at meta-descriptions, titles and keywords, but if they aren't there, this might be an option.

Hope this helps.

Edward
A: 

Google also uses a system called Page Rank, where it examines how many links to a site there are. Let's say that you're looking for a C++ tutorial, and you search Google for one. You find one as the top result, an it's a great tutorial. Google knows this because it searched through its cache of the web and saw that everyone was linking to this tutorial, while ranting how good it was. Google deceides that it's a good tutorial, and puts it as the top result.

It actually does that as it caches everything, giving each page a Page Rank, as said before, based on links to it.

Hope this helps!

Super_ness
The question wasn't about ranking pages but rather about finding relevance to a search term.
musicfreak
Incorrect, Not finding relevance to a search term but finding relevant content within any site that our system finds, basically create and engine that finds content as the visual eye would
RobertPitt
A: 

To answer one of your questions, I am reading the following book right now, and I recommend it: Google's PageRank and Beyond, by Amy Langville and Carl Meyer.

Mildly mathematical. Uses some linear algebra in a graph theoretic context, eigenanalysis, Markov models, etc. I enjoyed the parts that talk about iterative methods for solving linear equations. I had no idea Google employed these iterative methods.

Short book, just 200 pages. Contains "asides" that diverge from the main flow of the text, plus historical perspective. Also points to other recent ranking systems.

Steve
This is not what my question is asking!
RobertPitt
@RobertPitt: It sure looks like it is, to me. "How does Google find relevant content when its parsing the web?" This is how Google does it. "Are there any books..." This is a book.
Merlyn Morgan-Graham
Thank you, Merlyn Morgan-Graham. Re-reading the question, I admit my answer may have missed the point. However, RobertPitt, (1) is it necessary to chide people volunteering their knowledge to help you solve your problem, and (2) if five answerers misunderstood your question in the same way, could it be possible that, perhaps, just perhaps, the question itself could be improved?
Steve
+12  A: 

Hello,

this is a very general question but a very nice topic! Definately upvoted :) However i am not satisfied with the answers provided so far, so I decided to write a rather lengthy answer on this.

The reason I am not satisfied is that the answers are basically all true (I especially like the answer of kovshenin (+1), which is very Graph-Theory related...), but the all are either two specific on certain factors or too general.

Its like to ask how to bake a cake and you get the following answers:

  • You make a cake and you put it in the oven.
  • You definately need sugar in it!
  • What is a cake?
  • The cake is a lie!

You won't be satisfied because you wan't to know what makes a good cake. And of course there are a lot or recipies.

Of course Google is the most important player, but depending on the use case a search engine might include very different factors or weight them differently.

For example a search engine for discovering new independent music artists may put a malus on artists websites with a lots of external links in.

A mainstream search engine will probably do the exact opposite to provide you with "relevant results".

There are (as already said) over 200 factors that are published by Google. So webmasters know how to optimize their websites. There are veray likely many many more that the public is not aware of (In googles case).

But in the very borad and abstract term SEO optimazation you can generally break the important ones apart into two groups:

  1. How well does the answer match the question? Or: How well does the pages content match the search terms?

  2. How popular/good is the answer? Or: Whats the pagerank?

In both cases the important thing is that I am not talking about whole websites or domains, i am talking about single pages with a unique url.

Its also important that pagerank doesn't represent all factors, only the ones that Google categorizes as Popularity. And by Good I mean other factors that just have nothing to do with popularity.

In case of google the official statement is that they want to give relevant results to the user. Meaning that all algorithms will be optimized towards what the user wants.

So after this long introduction (glad you are still with me...) I will give you a list of factors that I consider to be very important (at the moment):

Category 1 (how good does the answer match the question?

You will notice that a lot comes down to the structure of the document!

  • The page primarily deals with the exact question.

Meaning: the question words appear in the pages title text or in heading paragraphs paragraphs. Same goes for the position of theese keywords. The earlier in the page the better. Repeated often as well (if not too much which goes under the name of Keywords Stuffing).

  • The whole Website deals with the topic (keywords appear in the domain/subdomain)

  • The words are an important topic in this page (internal links anchor texts jump to positions of the keyword or anchor texts / link texts contain the keyword).

  • Same goes if external links use they keywords in link text to link to this page

Category 2 (how important/popular is the page?)

You will notice that not all factors point towards this exact goal. Some are included (especially by google) just to give pages a boost, that... well... that just deserved/earned it.

  • Content is king!

The existence of unique content that cant be found or only very little in the rest of the web gives a boost. This is mostly measured buy unordered combinations of words on a website that are generally used very little (important words). But there are much more sophisticated methods as well.

  • Recency - newer is better

  • Historical change (how often the page has updated in the past. Changing is good.)

  • External link popularity (how many links in?)

If a page links another page the link is worth more if the page itself has a high pagerank.

  • External link diversity

basically links from different root domains, but other factors play a role too. Factors like even how seperated are the webservers of linking sites geographically (according to their ip address).

  • Trust Rank

For example if big, trusted, established sites with redactional content link to you, you get a trust rank. Thats why a link from NY Times is worth much more than some strange new website, even if its Pagerank is higher!

  • Domain trust

Your whole website gives a boost to your content if your domain is trusted Well different factors count here. Of course links from trusted sties to your domain but it will even do good if you are in the same datacenter as important websites.

  • Topic specific links in.

If websites that can be resolved to a topic link to you and the query can be resolved to this topic as well, its good.

  • Distribution of links in over time.

If you earned a lot of links in in a short period of time, this will do you good at this time and the near future afterwards. But not so good later in time. If you slow and steady earn links it will do you good for content that is "timeless".

  • Links from restrited domains

A link from a .gov domain is worth a lot.

  • User click behaviour

Whats the clickrate of your search result?

  • Time spent on site

Google analytics tracking etc. Its also tracked if the user clicks back or clicks another result after opening yours.

  • collected user data

Votes, rating etc, references in gmail etc

Now I will introduce a 3rd category, and one or two points from above would go into this category but I havent thought of that... The category is:

** How important/good is your website in general **

All your pages will be ranked up a bit depending on the quality of your websites

Factores include:

  • Good site architecture (easy to navgite, structured. Sitemaps etc...)

  • How established (long existing domains are worth more).

  • Hoster info (what other websites are hosted near you?

  • Search frequency of your exact name.

Last but not least I want to say that a lot of theese theese factors can be enriched by semantic technology and new ones can be introduced.

For example someone may search for Titanic and you have a website about Ice-Bergs ... that can be set into correlation which may be reflected.

Newly introduced semantic identifiers. For example owl tags may have a huge impact int he future.

For example a blog about the movie Titanic could put a sign on this page that its the same content as on the wikipedia article about the same movie.

This kind of linking is currently under heavy developement and establishment and nobody knows how it will be used.

Maby duplicate content is filtered and only the most important of same content is displayed? Or maby the other way round? That you get presented a lot of pages that match your query. Even if they dont contain your keywords?

Google even applies factors in different relevance depending on the topic of your search query!

Joe Hopfgartner
Thank you for taking the time to create such a wealthy answer but please read my question thoroughly, as im not talking about SEO, this has nothing to do with search engines apart from I would like to adopt a some techniques for a different usage
RobertPitt
@RoberPitt - This is to find _relevant_ content. Google implements all of this to find *relevant* content (as you ask in the very first line). I think he answered your question quite well. Just more detail than you may have wanted. Besides SEO, is nothing more than a web designer presenting *relevant* content to Google to increase page rank. Pretty links, h1 tags, page linking, etc. are all methods Google searches to find *relevant* content.
Jason
+1  A: 

There are lots of highly sophisticated algorithms for extracting the relevant content from a tag soup. If you're looking to build something usable your self, you could take a look at the source code for readability and port it over to php. I did something similar recently (Can't share the code, unfortunately).

The basic logic of readability is to find all block level tags and count the length of text in them, not counting children. Then each parent node is awarded a fragment (half) of the weight of each of its children. This is used to fund the largest block level tag that has the largest amount of plain text. From here, the content is further cleaned up.

It's not bullet proof by any means, but it works well in the majority of cases.

troelskn
+2  A: 

Actually answering your question (and not just generally about search engines):

I believe going bit like Instapaper does would be the best option.

Logic behind instapaper (I didn't create it so I certainly don't know inner-workings, but it's pretty easy to predict how it works):

  1. Find biggest bunch of text in text-like elements (relying on paragraph tags, while very elegant, won't work with those crappy sites that use div's instead of p's). Basically, you need to find good balance between block elements (divs, ps, etc.) and amount of text. Come up with some threshold: if X number of words stays undivided by markup, that text belongs to main body text. Then expand to siblings keeping the text / markup threshold of some sort.

  2. Once you do the most difficult part — find what text belongs to actual article — it becomes pretty easy. You can find first image around that text and use it as you thumbnail. This way you will avoid ads, because they will not be that close to body text markup-wise.

  3. Finally, coming up with keywords is the fun part. You can do tons of things: order words by frequency, remove noise (ands, ors and so on) and you have something nice. Mix that with "prominent short text element above detected body text area" (i.e. your article's heading), page title, meta and you have something pretty tasty.

All these ideas, if implemented properly, will be very bullet-proof, because they do not rely on semantic markup — by making your code complex you ensure even very sloppy-coded websites will be detected properly.

Of course, it comes with downside of poor performance, but I guess it shouldn't be that poor.

Tip: for large-scale websites, to which people link very often, you can set HTML element that contains the body text (that I was describing on point #1) manually. This will ensure correctness and speed things up.

Hope this helps a bit.

flixic
A: 

I would consider these building the code

  • Check for synonyms and acronyms
  • applying OCR on images to search as text(Abby Fine Reader and Recostar are nice, Tesseract is free and fine(no so fine as fine reader :) )
  • weight Fonts as well(size, boldness, underline, color)
  • weight content depending on its place on page(like contents on upper side of page is more relevant)

Also:

  • An optinal text asked from the webmaster to define the page

You can also check if you can find anything useful at Google search API: http://code.google.com/intl/tr/apis/ajaxsearch/

honibis