views:

126

answers:

3

I am writing a search engine whose purpose it is to search all kind of blogs for specific kinds of content.

I'm particularly focusing on these blogs engines:

  • Blogger
  • LiveJournal
  • MySpace
  • Open Diary
  • Tumblr
  • TypePad
  • Windows Live Spaces
  • Wordpress.com
  • Xanga
  • Now my question is: Is there any way I can use google to find links that are from a specific type of engine, like for example wordpress ?

    +2  A: 

    There is no generic way to recognize links or page structures per se.

    If I were you, I would look at the head section of typical blogs from your list. WordPress for example outputs a generator META tag. That can of course be switched off, but it rarely is. I'm sure there are similar telltale signs for most other blogging engines.

    For Wordpress, an additional way would be to silently check whether there is a wp-admin directory. It is usually protected by htaccess password dialog which you would have to recognize and ignore.

    Pekka
    +1  A: 

    I ran across a site called sucuri.net came up with a very interesting idea about fingerprinting web applications. Its basic idea is to

    download the packages for different versions and perform a diff between each of them. After that, we compare the diffs looking for unique patterns present on each version.

    It works pretty well over a couple of sites I tried and may give you some inspirations. You can read more about its methodology over: http://sucuri.net/?page=docs&title=fingerprinting-web-apps#v6

    Live test demo: http://sucuri.net/?page=docs&title=fingerprinting-web-apps#v5

    Jay Zeng
    +1  A: 

    There is a Firefox Add-on called the Backend Software Information. You might want to take a look at how it works and there are some source codes available too.

    Excerpt from the backendinfo.com

    How it works
    BackendInfo basically works by detecting URLs, strings and regular expressions. This ranges from typical html and css codes to certain directory and file-structures. It is easy to track changes in css, js and html files between releases that can be used to identify the backends exact version with tools like Meld. Furthermore, documents like CHANGELOG, UPDATES, etc are often left readable after a standard backend installation, providing some more insight as well.

    PS: The add-on has not been updated for use in FF3.6.

    o.k.w
    Ah, head, generator, urls with specific directories/structures, and specific css and js files/contents. Like it, sounds logical ;-))
    Quandary