html-parsing

Options for HTML scraping?

I'm thinking of trying Beautiful Soup, a python package for HTML scraping. Are there any other HTML scraping packages I should be looking at? Python is not a requirement, I'm actually interested in hearing about other languages as well. The story so far: Python Beautiful Soup lxml Ruby Hpricot scrAPI scRUBYt! .NET Html Agility ...

Problem with HTML Parser in IE

I am trying to create a dialog box that will appear only if the browser selected is IE (any version) however I get this error: Message: HTML Parsing Error: Unable to modify the parent container element before the child element is closed (KB927917) That's all in "Line/Char/Code" 0 so I do not know where is the error. The code I'm us...

Convert > to HTML entity equivalent within HTML string

I'm trying to convert all instances of the > character to its HTML entity equivalent, >, within a string of HTML that contains HTML tags. The furthest I've been able to get with a solution for this is using a regex. Here's what I have so far: public static readonly Regex HtmlAngleBracketNotPartOfTag = new Regex("(?:<[^>]*(?:>|$...

Library Recommendation: C++ HTML Parser

Preferably a light weight HTML parser, not exactly creating a browser or looking to modulate JS or any http connections. ...

Extracting meaning full content from web pages

I am doing some analysis by mining web content using my crawlers. Web pages often contain clutter (such as ads, unnecessary images and extraneous links) around the body of an article that distracts a user from actual content. To extract the sensible content is a difficult problem as I understand it, considering the fact that there is no...

What language/tool should I use for HTML parsing?

Hello all, I have a couple of websites that I want to extract data from and based on previous experiences, this isn't as easy as it sound. Why? Simply because the HTML pages I have to parse aren't properly formatted (missing closing tag, etc.). Considering that I have no constraints regarding the technology, language or tool that I can...

HTML Agility pack - parsing tables

Hello, I want to use the HTML agility pack to parse tables from complex web pages, but I am somehow lost in the object model. I looked at the link example, but did not find any table data this way. Can I use Xpath to get the tables? I am basically lost after having load the data how to get the tables. I have done this in Perl before and ...

Library to generate .NET XmlDocument from HTML tag soup

I'm looking for a .NET library that can generate a clean Xml tree, ideally System.Xml.XmlDocument, from invalid HTML code. I.E. it should make the kind of best effort guesses, repairs, and substitutions browsers do when confronted with this situation, and generate a pretend XmlDocument. The library should also be well-maintained. :) I...

Parsing HTML in Python

What's my best bet for parsing HTML if I can't use BeautifulSoup or lxml? I've got some code that uses SGMLlib but it's a bit low-level and it's now deprecated. I would prefer if it could stomache a bit of malformed HTML although I'm pretty sure most of the input will be pretty clean. ...

Parsing html data with nutch 1.0 and a custom plugin

I am currently trying to write a custom plugin for nutch 1.0. This plugin is supposed to parse html data and filter out relevant information from documents. I have a basic plugin working, it extends the HtmlParserResult object and is executed each time I do a parse. My problems are two faced at the moment: I do not understand the wor...

Html Agility Pack - Parsing <li>

I want to scrape a list of facts from simple website. Each one of the facts is enclosed in a <li> tag. How would I do this using Html Agility Pack? Is there a better approach? The only things enclosed in <li> tags are the facts and nothing else. ...

How can I clean HTML tags out of a ColdFusion string?

I am looking for a quick way to parse HTML tags out of a Coldfusion string. We are pulling in an RSS feed that that could potentially have anything in it. We are then doing some manipulation of the information and then spitting it back out to another place. Currently we are doing this with a regular expression. Is there a better way to d...

How to parse html and css to understand the layout of the page (java)

Hello all i need to find away to parse html and css layout to be able to transform it to to property language that understand simple html with inline css on each html element how i approach to such task ? ...

What regular expression would match this data?

I have the following within an XHTML document: <script type="text/javascript" id="JSBALLOONS"> function() { this.init = function() { this.wAPI = new widgetAPI('__BALLOONS__'); this.getRssFeed(); }; } </script> I'm trying to select everything in between the two script tags. The id will al...

What is the best practice for parsing remote content with jQuery?

Following a jQuery ajax call to retrieve an entire XHTML document, what is the best way to select specific elements from the resulting string? Perhaps there is a library or plugin that solves this issue? jQuery can only select XHTML elements that exist in a string if they're normally allowed in a div in the W3C specification; therefore,...

java parse html + css and convert the output to different lang

Hello all i need to understand html + css files and convert it to somthing like rtf layot in java now i understand i need somekind of html parser but what i need to do from there ? how can i implement html-css convertor ? is there somekind of patern or method for such jobs? ...

Advantages of XSLT or Linq to XML

What advantages are there for using either XSLT or Linq to XML for HTML parsing in C#? This is under the assumption that the html has been cleaned so it is valid xhtml. These values will eventually go into a c# object to be validated and processed. Please let me know if these are valid and if there are other things to consider. XSLT...

php regex for html

hey guys, I'm trying to make a regex for taking some data out of a table. the code i've got now is: <table> <tr> <td>quote1</td> <td>have you trying it off and on again ?</td> </tr> <tr> <td>quote65</td> <td>You wouldn't steal a helmet of a policeman</td> </tr> </table> This I want to replace by: quot...

lxml retrieving odd items with cssselector

In my test document I have a few classes labeled "item", currently I'm using the following to parse everything in the html file with this class with Selection = html.cssselect(".item") I'd like it to select all the odd items, like this in javascript using JQuery Selection = $(".item:odd"); Trying that verbatim I get the following e...

Script to build HTML page from from extracted DIVs from other HTML pages

I have a set of HTML reports that each contain two DIV elements with specific IDs that I need to strip out and compile into an overall summary report (again, an HTML file). My initial thoughts are that this is an ideal job for a Perl script, however we have no up-to-date in-house Perl skills (we're a .NET C# shop). Thoughts and suggest...