I'm thinking of trying Beautiful Soup, a python package for HTML scraping. Are there any other HTML scraping packages I should be looking at? Python is not a requirement, I'm actually interested in hearing about other languages as well.
The story so far:
Python
Beautiful Soup
lxml
Ruby
Hpricot
scrAPI
scRUBYt!
.NET
Html Agility ...
I want to parse some HTML in order to find the values of some attributes/tags etc.
What HTML parsers do you recommend? Any pros and cons?
...
I'm looking for a library/method to parse an html file with more html specific features than generic xml parsing libraries.
...
I'm planning to write a simple J2SE application to aggregate information from multiple web sources.
The most difficult part, I think, is extraction of meaningful information from web pages, if it isn't available as RSS or Atom feeds. For example, I might want to extract a list of questions from stackoverflow, but I absolutely don't need...
I hear people writing these programs all the time and I know what they do, but how do they actually do it? I'm looking for general concepts.
...
I would like to extract from a general HTML page, all the text (displayed or not).
I would like to remove
any HTML tags
Any javascript
Any CSS styles
Is there a regular expression (one or more) that will achieve that?
...
In Groovy, how do I grab a web page and remove HTML tags, etc., leaving only the document's text? I'd like the results dumped into a collection so I can build a word frequency counter.
Finally, let me mention again that I'd like to do this in Groovy.
...
I've saved an entire webpage's html to a string, and now I want to grab the "href" values from the links, preferably with the ability to save them to different strings later. What's the best way to do this?
I've tried saving the string as an .xml doc and parsing it using an XPathDocument navigator, but (surprise surprise) it doesn't nav...
Not every website exposes their data well, with XML feeds, APIs, etc
How could I go about extracting information from a website? For example:
...
<div>
<div>
<span id="important-data">information here</span>
</div>
</div>
...
I come from a background of Java programming and coding with Apache XMLBeans. Is there anything simil...
I'd like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.
I'd like something more robust than using regular expressions that may fail on poorly formed HTML. I've seen many people recommend Beautiful Soup, but I've had a ...
I need a RegEx pattern for extracting all the properties of an image tag.
As we all know, there are lots of malformed HTML out there, so the pattern has to cover those possibilities.
I was looking at this solution http://stackoverflow.com/questions/138313/how-to-extract-img-src-title-and-alt-from-html-using-php but it didn't quite get ...
I am trying to work out the overhead of the ASP.NET auto-naming of server controls. I have a page which contains 7,000 lines of HTML rendered from hundreds of nested ASP.NET controls, many of which have id / name attributes that are hundreds of characters in length.
What I would ideally like is something that would extract every HTML a...
How would you solve this problem?
You're scraping HTML of blogs. Some of the HTML of a blog is blog posts, some of it is formatting, sidebars, etc. You want to be able to tell what text in the HTML belongs to which post (i.e. a permalink) if any.
I know what you're thinking: You could just look at the RSS and ignore the HTML altogether...
Can anyone recommend a C or Objective-C library for HTML parsing? It needs to handle messy HTML code that won't quite validate.
Does such a library exist, or am I better off just trying to use regular expressions?
...
I am using google's appengine api
from google.appengine.api import urlfetch
to fetch a webpage. The result of
result = urlfetch.fetch("http://www.example.com/index.html")
is a string of the html content (in result.content). The problem is the data that I want to parse is not really in HTML form, so I don't think using a python HT...
I would like to know if there is a simple way to parse HTML in vb.net.
I know that HTML is not sctrict subset of XML, but it would be nice if it could be treated that way. Is there anything out there that would let me parse HTML in an XML-like way in VB.net?
...
How can I use beautiful soup and selectorgadget to scrape a website. For example I have a website - (a newegg product) and I would like my script to return all of the specifications of that product (click on SPECIFICATIONS) by this I mean - Intel, Desktop, ......, 2.4GHz, 1066Mhz, ...... , 3 years limited.
After using selectorgadget I ...
I have to parse a series of web pages in order to import data into an application. Each type of web page provides the same kind of data. The problem is that the HTML of each page is different, so the location of the data varies. Another problem is that the HTML code is poorly formatted, making it impossible to use a XML-like parser.
So...
Hi
I'm trying to write a regular expression for my html parser.
I want to match a html tag with given attribute (eg. <div> with class="tab news selected" ) that contains one or more <a href> tags. The regexp should match the entire tag (from <div> to </div>). I always seem to get "memory exhausted" errors - my program probably takes ev...
Of course an HTML page can be parsed using any number of python parsers, but I'm surprised that there don't seem to be any public parsing scripts to extract meaningful content (excluding sidebars, navigation, etc.) from a given HTML doc.
I'm guessing it's something like collecting DIV and P elements and then checking them for a minimum...