scraper

HTML scraper to remove and modify html pages?

I need a HTML scraper or a DOM editor. I know the question has been asked many times, and the answer is HTML agility pack. But it doesn't look any good to me. I tried to removed a simple form element, but it removed only the <form> tag and leaved all other tags inside it, also it leaved the </form> tag. I used the PHP Simple HTML DOM Par...

Scrape data from HTML pages using Java, output to database

I need to know how to create a scraper (in Java) to gather data from HTML pages and output to a database...do not have a clue where to start so any information you can give me on this would be great. Also, you can't be too basic or simple here...thanks :) ...

How to rebuild Safari Web Clip functionality in PHP

Hi there, is there a way to rebuild Mac OSX Snow Leopard's Dashboard Widget 'Web Clip' on a PHP website? Something like a crawler or scraper. I thought about using file_get_contents to getting the page content into the page, but how do I select a section on the external page? And does this work with session/login content as well? I'm ...

beautifulsoup and mechanize to get ajax call result

hi im building a scraper using python 2.5 and beautifulsoup but im stuble upon a problem ... part of the web page is generating after user click on some button, whitch start an ajax request by calling specific javacsript function using proper parameters is there a way to simulate user interaction and get this result? i come across a mec...

Anyone have a good solution for scraping the HTML source of a page with content (in this case, HTML tables) generated with Javascript?

Anyone have a good solution for scraping the HTML source of a page with content (in this case, HTML tables) generated with Javascript? An embarrassingly simple, though workable solution using Crowbar: <?php function get_html($url) // $url must be urlencode(d) { $context = stream_context_create(array( 'http' => array('timeout' => 12...

Facebook like on demand meta content scraper

you guys ever saw that FB scrapes the link you post on facebook (status, message etc.) live right after you paste it in the link field and displays various metadata, a thumb of the image, various images from the a page link or a video thumb from a video related link (like youtube). any ideas how one would copy this function? i'm thinkin...

Getting all pdf files from a domain (for example *.adomain.com)

I need to download all pdf files from a certain domain. There are about 6000 pdf on that domain and most of them don't have an html link (either they have removed the link or they never put one in the first place). I know there are about 6000 files because I'm googling: filetype:pdf site:*.adomain.com However, Google lists only the fi...

crawler vs scraper

Can somebody distinguish between a crawler and scraper in terms of scope and functioanlity Thanks Nayn ...

Blocking Web Scrapers

What are ways that websites can block web scrapers? How can you identify if your server is being accessed by a bot? ...

scrape email addresses

fff.html is an email with email addresses in it some have href mailto links and some don't, i want to scrape them and output them into the following format [email protected],[email protected],[email protected] I have a simple scraper to get the ones that are href linked but something is wierd <?php $url = "fff.html"; $raw = f...

How do create a HTML scraper in PHP and get it working properly?

Please HELP! :( I am looking to develop a PHP Script to do the following: Scrap a remote HTML page and extract selected data (e.g. particular table/div) Use extracted data and save it into a Database (e.g. MySql) Anyone can help out? Thanks and appreciate your soonest feedback. ...

Facebook stream API error works in Browser but not Server-side

If I enter this URL in a browser it returns to me the valid XML data that I am interested in scraping. http://www.facebook.com/ajax/stream/profile.php?__a=1&amp;profile_id=36343869811&amp;filter=2&amp;max_time=0&amp;try_scroll_load=false&amp;_log_clicktype=Filter%20Stories%20or%20Pagination&amp;ajax_log=0 However, if I do it from the...

Is there a way to find all the pages' link by a URL?

If I have a link say http://yahoo.com/ so can I get the links inside yahoo? For example, I have a website http://umair.com/ and I know there are just 5 pages Home, About, Portfolio, FAQ, Contact so can I get links as follows programmatically? http://umair.com/index.html http://umair.com/about.html http://umair.com/portfolio.html http://...

how to stop scraping links from my php page

hello, i have a home page with some links and mail ids i need to stop scraping my urls and mail-ids form that web page... i have used robots.txt but most of the bad crawlers wont respect that.... ...

How can I write an automated script to log into, navigate, and save a website from a headless server?

Currently, I have a script that calls Firefox and runs a macro, but this is very buggy and rarely works the way I want it to. ...

Is it possible to scrape content and generate an rss feed from a membership site?

Is it Possible to scrape content from a membership site so that i can create an Rss feed for import into my inbox? You see, I'm a member of several sites that provide casting calls for the performing arts industry (some paid, some free), but most of them don't provide Rss feeds of the newest casting call updates with means that I have t...

Scrubyt "next_page" not working with relative links?

Hello all. I'm trying to scrape the the Yellow Pages website. Specifically, this link http://www.yellowpages.com/santa-barbara-ca/restaurants. My code works perfectly except for one small problem. Because the "Next" link to go to the next page of restaurants is a relative link, Scrubyt's "next_page" function doesn't work...apparently...

Import XML data via https

Hi , Is it possible to get/scrap data from https links using php, the https page ask for a user name and password and has data in XML format. so is it possible to get this data using PHP ? can anyone me suggest me the procedure ? ...