html-content-extraction

Options for HTML scraping?

I'm thinking of trying Beautiful Soup, a python package for HTML scraping. Are there any other HTML scraping packages I should be looking at? Python is not a requirement, I'm actually interested in hearing about other languages as well. The story so far: Python Beautiful Soup lxml Ruby Hpricot scrAPI scRUBYt! .NET Html Agility ...

What HTML parsing libraries do you recommend in Java

I want to parse some HTML in order to find the values of some attributes/tags etc. What HTML parsers do you recommend? Any pros and cons? ...

What is the best way to parse html in C#?

I'm looking for a library/method to parse an html file with more html specific features than generic xml parsing libraries. ...

How do you grab a text from webpage (Java)?

I'm planning to write a simple J2SE application to aggregate information from multiple web sources. The most difficult part, I think, is extraction of meaningful information from web pages, if it isn't available as RSS or Atom feeds. For example, I might want to extract a list of questions from stackoverflow, but I absolutely don't need...

How do screen scrapers work?

I hear people writing these programs all the time and I know what they do, but how do they actually do it? I'm looking for general concepts. ...

regular expression to extract text from HTML

I would like to extract from a general HTML page, all the text (displayed or not). I would like to remove any HTML tags Any javascript Any CSS styles Is there a regular expression (one or more) that will achieve that? ...

Strip HTML from a web page and calculate word frequency?

In Groovy, how do I grab a web page and remove HTML tags, etc., leaving only the document's text? I'd like the results dumped into a collection so I can build a word frequency counter. Finally, let me mention again that I'd like to do this in Groovy. ...

C# - Best Approach to Parsing Webpage?

I've saved an entire webpage's html to a string, and now I want to grab the "href" values from the links, preferably with the ability to save them to different strings later. What's the best way to do this? I've tried saving the string as an .xml doc and parsing it using an XPathDocument navigator, but (surprise surprise) it doesn't nav...

Extracting Information from websites

Not every website exposes their data well, with XML feeds, APIs, etc How could I go about extracting information from a website? For example: ... <div> <div> <span id="important-data">information here</span> </div> </div> ... I come from a background of Java programming and coding with Apache XMLBeans. Is there anything simil...

Extracting text from HTML file using Python

I'd like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad. I'd like something more robust than using regular expressions that may fail on poorly formed HTML. I've seen many people recommend Beautiful Soup, but I've had a ...

RegEx for extracting HTML Image properties

I need a RegEx pattern for extracting all the properties of an image tag. As we all know, there are lots of malformed HTML out there, so the pattern has to cover those possibilities. I was looking at this solution http://stackoverflow.com/questions/138313/how-to-extract-img-src-title-and-alt-from-html-using-php but it didn't quite get ...

How to extract element id attribute values from HTML

I am trying to work out the overhead of the ASP.NET auto-naming of server controls. I have a page which contains 7,000 lines of HTML rendered from hundreds of nested ASP.NET controls, many of which have id / name attributes that are hundreds of characters in length. What I would ideally like is something that would extract every HTML a...

Map RSS entries to HTML body w. non-exact search

How would you solve this problem? You're scraping HTML of blogs. Some of the HTML of a blog is blog posts, some of it is formatting, sidebars, etc. You want to be able to tell what text in the HTML belongs to which post (i.e. a permalink) if any. I know what you're thinking: You could just look at the RSS and ignore the HTML altogether...

parsing HTML on the iPhone

Can anyone recommend a C or Objective-C library for HTML parsing? It needs to handle messy HTML code that won't quite validate. Does such a library exist, or am I better off just trying to use regular expressions? ...

Parsing fixed-format data embedded in HTML in python

I am using google's appengine api from google.appengine.api import urlfetch to fetch a webpage. The result of result = urlfetch.fetch("http://www.example.com/index.html") is a string of the html content (in result.content). The problem is the data that I want to parse is not really in HTML form, so I don't think using a python HT...

How do you parse an HTML in vb.net

I would like to know if there is a simple way to parse HTML in vb.net. I know that HTML is not sctrict subset of XML, but it would be nice if it could be treated that way. Is there anything out there that would let me parse HTML in an XML-like way in VB.net? ...

Parsing an HTML file with selectorgadget.com

How can I use beautiful soup and selectorgadget to scrape a website. For example I have a website - (a newegg product) and I would like my script to return all of the specifications of that product (click on SPECIFICATIONS) by this I mean - Intel, Desktop, ......, 2.4GHz, 1066Mhz, ...... , 3 years limited. After using selectorgadget I ...

How do you parse a poorly formatted HTML file?

I have to parse a series of web pages in order to import data into an application. Each type of web page provides the same kind of data. The problem is that the HTML of each page is different, so the location of the data varies. Another problem is that the HTML code is poorly formatted, making it impossible to use a XML-like parser. So...

Need help writing regular expression (html parsing)

Hi I'm trying to write a regular expression for my html parser. I want to match a html tag with given attribute (eg. <div> with class="tab news selected" ) that contains one or more <a href> tags. The regexp should match the entire tag (from <div> to </div>). I always seem to get "memory exhausted" errors - my program probably takes ev...

python method to extract content (excluding navigation) from an HTML page

Of course an HTML page can be parsed using any number of python parsers, but I'm surprised that there don't seem to be any public parsing scripts to extract meaningful content (excluding sidebars, navigation, etc.) from a given HTML doc. I'm guessing it's something like collecting DIV and P elements and then checking them for a minimum...