html-content-extraction

Beautifulsoup get value in table

I am trying to scrape http://www.co.jefferson.co.us/ats/displaygeneral.do?sch=000104 and get the "owner Name(s)" What I have works but is really ugly and not the best I am sure, so I am looking for a better way. Here is what I have: soup = BeautifulSoup(url_opener.open(url)) x = soup('table', text = re.compile("Owner Name"))...

how to extract data from a raw html file

Is there a way to extract desired data from a raw html which has been written unsemantically with no IDs and classes? I mean, suppose there is a saved html file of a webpage(profile) and I want to extract the data like (say)'hobbies'. Is it possible to do this using PHP? ...

Extracting html elements in a given region?

Given a region defined by a rectangle and a url, is there any way to determine what elements lie within the given rectangle on the page at the given url? EDIT: Screen resolution, Font size, etc.. can all be set to reasonable defaults. ...

How can I read and parse the contents of a webpage in R

I'd like to read the contents of a URL (e.q., http://www.haaretz.com/) in R. I am wondering how I can do it ...

Screen scraping HTTPS using C#

How to screen scrape HTTPS using C#? ...

How to retrieve google pages

Dear all,I am now using a webtool http://fiddesktop.cs.northwestern.edu/mmp/scrape?url= to parse a webpage. For example,we can parse newyorktimes homepage,we do: http://fiddesktop.cs.northwestern.edu/mmp/scrape?url=http%3A//www.nytimes.com/pages/world/index.html in the address bar of our browser,it will parse things nicely for us. ...

BeautifulSoup Grab Visible Webpage Text

Basically, I want to use BeautifulSoup to grab strictly the visible text on a webpage... For instance, this webpage is my test case http://www.nytimes.com/2009/12/21/us/21storm.html .. And I mainly want to just get the body text (article) and and maybe even a few tab names here and there. However after trying this suggestion http://stack...

YQL scrape entire website/domain

Hi, I'm trying to scape back a set of links and content from a domain. The Query in google would be "site:www.newswebsite.com search_term" I've seen some close stuff to getting this working, but I can't seem to quite get a search working across a whole website, and then filter by the search term. Is this possible without a custom d...

Any ideas about the jQuery equivalent of the READABILITY code? (Or: building the best heuristic to find the main text using jQuery)

http://lab.arc90.com/experiments/readability/ is a very handy tool for viewing cluttered newspaper, journal and blog pages in a very readable manner. It does this by using some heuristcis and finding the relevant main text of a web page. Its source code is also available at http://lab.arc90.com/experiments/readability/js/readability.js ...

looking for alternative to Webzinc .NET , screen scraping, web automation library for .net

i came across this .net library http://www.webzinc.com/online/faq.aspx however, i was wondering if there was a free alternative out there ? ...

What is the state of the art in HTML content extraction?

There's a lot of scholarly work on HTML content extraction, e.g., Gupta & Kaiser (2005) Extracting Content from Accessible Web Pages, and some signs of interest here, e.g., one, two, and three, but I'm not really clear about how well the practice of the latter reflects the ideas of the former. What is the best practice? Pointers to goo...

PHP: Data from cURL, HTML Scan

How can i scan a html page, for text within a certain div? ...

What algorithms could I use to identify content on a web page

I have a web page loaded up in the browser (i.e. its DOM and element positioning are both accessible to me) and I want to find the block element (or a sorted list of these elements), which likely contains the most content (as in a continuous block of text). The goal is to exclude things like menus, headers, footers and such. ...

Using Beautiful Soup Python module to replace tags with plain text

Hi All, I am using Beautiful Soup to extract 'content' from web pages. I know some people have asked this question before and they were all pointed to Beautiful Soup and that's how I got started with it. I was able to successfully get most of the content but I am running into some challenges with tags that are part of the content. (I ...

Extracting data using screenscrapers

I am looking for recommendations for a screenscraper I need to extract "Contact Us" information from certain web sites. Any ideas where I can get a good (pref free) screenscarper? ...

Get content from table with id. Regex

I need to sort a html string so I get the content I need. Now I need to loop through the tr's in a table that got an ID. I could really need some help to get this regex going. Appriciate all help I can get ...

Using jQuery to Grab Content

Hello, I'm trying to pull a couple variables from the following block of html. If you wouldn't mind helping, it would be greatly appreciated! <div id="services"> <div id="service1"> <div class="left"> <img alt="Service Picture" src="/images/service-picture.jpg" /> <h2 class="serviceHeading">A Beautiful Header...

Recommended library for scraping html data.

I need to process quite a bit of [fairly] arbitrary html data. The data thankfully can be broken into about twelve different templates. My current plan is to build a filter for each of the templates that allows me to extract the required data sans irrelevant content. Problem is I'm not sure what the ideal tool for the job is. I was h...

PHP Session Variables

A user will click on a link which will open a new page (code below). My problem is that when this new page is opened, it creates a NEW session ID. How do I stop this from happening? require_once('../../config.php'); //Database connection details require_once('../../connect.php'); //Connect to database session_start(); <--------...

How to automatically update a site with some other site contents. ?

How to update a site with some other site contents that is getting refreshed often (may be twice in a minute)? ...