Hi,
I need to scrape/parse some search engines related data for a given domain name(site).
I need
Google Page Rank (only for the domain name, not each pages).
Number of indexed results/pages (google, bing).
Number of Backlinks (google, bing, yahoo).
Traffic Rank (alexa).
Site thumbnail.
Could you provide me some pointers on where...
I have launched a new Movie Based web portal, I have completed the programming part and Made the site dynamic.
My Question, How can i get the data about the Movies from different languages
For example: www.imdb.com has a huge database collection.
Is there any web crwaling methodology where we can get it?
Or dirty method of Complete Data ...
I'm not talking about HTML tags, but tags used to describe blog posts, or youtube videos or questions on this site.
If I was crawling just a single website, I'd just use an xpath to extract the tag out, or even a regex if it's simple. But I'd like to be able to throw any web page at my extract_tags() function and get the tags listed.
I...
I have the following code in my .htaccess:
RewriteEngine On
RewriteBase /
RewriteRule ^index\.php$ - [L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]
My pretty-link converting is done on my index.php. If the user is looking for something that doesn't exist, a 404 Header is produc...
i'm crawling an SNS with crawler written in python
it works for a long time, but few days ago, the webpages got from my severs were ERROR 403 FORBIDDEN.
i tried to change the cookie, change the browser, change the account, but all failed.
and it seems that are the forbidden severs are in the same network segment.
what can i do? steal...
What will this line block when the search engine crawls the website?
Disallow: /?s
...
Is there a module out there that can give me links to all the pages a website has??
Why I need it: I want to crawl some sites and search for tags in them, searching only on mainpage is not enough.
Thanks,
...
How do you prevent emails being gathered from HTML pages by email spiders? Does mailto: linking them increase the likelihood of them being picked up? I think I read somewhere about a built-in Ruby function that confuses email spiders by decimal-encoding the email address - can anyone link me to some documentation or tell me how effective...
I have substantial PHP experience, although I realize that PHP probably isn't the best language for a large-scale web crawler because a process can't run indefinitely. What languages do people suggest?
...
Hi,
I have used robots.txt to restrict one of the folders in my site. The folder consists of the sites in under construction. Google has indexed all those sites which are in testing phase. So I used robots.txt. I first submitted the site and robots.txt is enabled. Now the status is success for the www.mysite.com/robots.txt. But the goog...
So im creating a webcrawler and everything works, only got 1 problem.
With file_get_contents($page_data["url"]); I get the content of a webpage. This webpage is scanned when one of my keywords excists on the webpage.
$find = $keywords; $str = file_get_contents($page_data["url"]);
if(strpos($str, $find) == true)
When i want to insert...
Currently I'm using Mechanize and the get() method to get each site, and check with content() method each mainpage for something.
I have a very fast computer + 10Mbit connection, and still, it took 9 hours to check 11K sites, which is not acceptable, the problem is, the speed of the get() function, which , obviously, needs to get the pag...
Disallow: /*“http:
is what I've been using - my guess is I may need to escape the quotation mark somehow. In Google webmaster tools, it's not even reading that quotation mark (where it allows you to see the robots.txt file and test it on a few urls).
On Google Webmaster Tools, it displays the robots.txt file without the quotes for ...
.. and how the web crawler infers the semantics of information on the website?
List out the ranking signal in separate answer.
...
I would like to start working with parsing large numbers of raw HTML pages into semantic data structures.
Just interested in the community opinion on various available tools for such a task, particularly various useful libraries in any language.
So far, planning on using Hadoop to manage a lot of the processing, but curious about alter...
if I want to build a complex webiste like google news , which gathers data from oher websites.
like data mining , crawling. In which language should i build the website.
Currently i know only PHP. Can i do that in PHP
...
I have to build the website where i need to crawl to filter or u can say read the 50 webistes.
then after reading those webistes i need to filter the news e,g news related to Mercedez benz and then i need to display that on that webiste with refrence to original source.
Basically what google news is doing
Currently i know PHP and can b...
I am interested to know in a very general situation (a home-brew amateur web crawler) what will be the performance of such. More specifically how many pages can a crawler process.
When i say home-brew take that in all senses, a 2.4Ghz core 2 processor, java written, 50mbit internet speed, etc, etc.
Any resources you may share in this r...
Goodmorning stackoverflow,
I'm still busy with my webcrawler and i just need some last help.
Because crawling the web can cost a lot of time I want to let pcntl_fork() help me in creating multiple childs to split my code in parts.
Master - crawling the domain
Child - When receiving a link child must crawl the link found on the domain
...
Is there a web spider which can grap the contents of the forums?
My company does not provide the internet connection, so I want to grap the threads of a forum, then I can have a look at the contents in company.
I have tried the WebLech, it can just grap the static pages.
...