I want to be able to grab content from web pages, especially the tags and the content within them. I have tried XQuery and XPath but they don't seem to work for malformed XHTML and REGEX is just a pain.
Is there a better solution. Ideally I would like to be able to ask for all the links and get back an array of URLs, or ask for the text...
I'm using PHP and libtidy to attempt to screen scrape what might possibly be the most horrendous and malformed use of HTML tables in history. The site closes few table, tr, td, font, or bold tags and consistently nests many different layers of tables within tables.
Example snippet:
<center>
<table border="1" bordercolor="#000000" cells...
I'm currently developing an IE plugin using SpicIE.
This plugin does some web scraping similar to the example posted on MSDN:
WebRequest request = WebRequest.Create ("http://www.contoso.com/default.html");
request.Credentials = CredentialCache.DefaultCredentials;
HttpWebResponse response = (HttpWebResponse)request.GetResponse ();
S...
I am doing a project, in which i need to login into a site and scrape the webpage contents. i tried the following code:
protected void Page_Load(object sender, EventArgs e)
{
WebClient webClient = new WebClient();
string strUrl = "http://www.mail.yahoo.com?username=sakthivel123&password=operator&login=1";
byte[] reqH...
Hi I want to create a desktop app (c# prob) that scrapes or manipulates a form on a 3rd party web page. Basically I enter my data in the form in the desktop app, it goes away to the 3rd party website and, using the script or whatever in the background, enters my data there (incl my login) and clicks the submit button for me.I just want t...
Hi all
i am a newbie and try different things everyday and always come here when i am stuck with something.
I want to write a script using curl and php that goes to this link :http://tools.cisco.com/WWChannels/LOCATR/openBasicSearch.do and then goes through each page for each country capturing a list of every partner in every country a...
I think it'd be fun to build a little mini-fantasy baseball game, but after a bit of Googling, I'm getting the impression that there's no easy and reasonably-priced (or free!) way to get that data. Have any of you done something like this? Should I be thinking about screen-scraping?
...
Using Python, I'm trying to read the values on http://utahcritseries.com/RawResults.aspx. I can read the page just fine, but am having difficulty changing the value of the year combo box, to view data from other years. How can I read the data for years other than the default of 2002?
The page appears to be doing an HTTP Post once the ...
I want to get show information from myspace artists. One way I could do this is ask an artist to input their myspace URL and I could try to scrape the page.
What I would really like to do is ask the artist for their myspace credentials and use the myspace api to get their show data. I cannot find how to do this on the myspace develop...
I want to write a program that analyzes your fantasy baseball team and notifies you of recommended actions, possibly multiple times per day. The problem is, you aren't playing fantasy baseball on my site, you're playing on yahoo, or cbs, or espn, etc.
On the majority of these sites, fantasy teams and leagues are not public, so you must...
I am trying to write a Perl script to connect to me YouTube account but it doesnt seem to work. Basically I just want to connect to my account but apparently it is not working. I don't even have an idea on how I could debug this! Maybe it is something related to https protocol?
Please enlighten me! Thanks in advance.
use HTTP::Request:...
I am "screen scraping" my own pages as a temporary hack, using .NET's WebRequest.
This works well, but accented characters and diacritical characters do not translate correctly.
I am wondering if there is a way to make them translate correctly using .NET's many many built in properties and methods.
Here is the code I am using to gra...
I want to get an HTML page with python and then print out all the IPs from it.
I will define an IP as the following:
x.x.x.x:y
Where:
x = a number between 0 and 256.
y = a number with < 7 digits.
Thanks.
...
I am writing a curl script for collecting information about some sex offenders, i have developed the script that is picking up links like given below:
http://criminaljustice.state.ny.us/cgi/internet/nsor/... (snipped URL)
Now when we go on this link I want to get information under all the fields on this page like Offender Id:, last nam...
I have two machines each, to the best of my knowledge, running python 2.5 and BeautifulSoup 3.1.0.1.
I'm trying to scrape http://utahcritseries.com/RawResults.aspx, using:
from BeautifulSoup import BeautifulSoup
import urllib2
base_url = "http://www.utahcritseries.com/RawResults.aspx"
data=urllib2.urlopen(base_url)
soup=BeautifulSo...
Hi all
i am using a curl script to go to a link and get its content for further manipulation. following is the link and curl script:
<?php
$url = 'http://criminaljustice.state.ny.us/cgi/internet/nsor/fortecgi?serviceName=WebNSOR&amp;templateName=detail.htm&amp;requestingHandler=WebNSORDetailHandler&amp;ID=368343543';
//cur...
I need to do some HTML parsing with python. After some research lxml seems to be my best choice but I am having a hard time finding examples that help me with what I am trying to do. this is why i am hear. I need to scrape a page for all of its viewable text.. strip out all tags and javascript.. I need it to leave me with what text is vi...
I'm trying to build an iPhone application that gathers content from real estate websites to display it in a mashed-up and structured manner (mapping, price averages...etc)
I've stumbled upon many sites whose "Terms and Conditions" only allow downloading/re-using the data for personal purposes but not commercial ones. My intent is to hav...
I want to be able to log into a website programmatically and periodically obtain some information from the site. What is the best tool(s) that would make this as simple as possible? I'd prefer a Python library of some type because I want to become more proficient in Python, but I'm open to any suggestions.
...
I'm looking for a good C++ library for web scraping.
It has to be C/C++ and nothing else so please do not direct me to Options for HTML scraping or other SO questions/answers where C++ is not even mentioned.
...