html-parsing

Parsing HTML in python - lxml or BeautifulSoup? Which of these is better for what kinds of purposes?

From what I can make out, the two main HTML parsing libraries in Python are lxml and BeautifulSoup. I've chosen BeautifulSoup for a project I'm working on, but I chose it for no particular reason other than finding the syntax a bit easier to learn and understand. But I see a lot of people seem to favour lxml and I've heard that lxml is f...

How to get the text from XML with position in the XML file?

I want to parse HTML (you can assume as a XML, converted via Tidy) and get all the text nodes (which means nodes in Body tag that are visible) and their location in the XML file. Location means the text position in the flat XML file. ...

HTML Agility Pack vs jquery

Do you know of any extension for HTML Agility Pack, that allows querying HtmlDocument object (created by HAP) in jQuery style (instead of XPath)? ...

Remove text from within a database text field

I recently tried to import a bunch of blog posts from an old blog (SharePoint) to my current blog (WordPress). When the import completed, a lot of nasty <div> tags and other HTML made it in to the content of the post, which screwed up the way my site was rendering. I'm able to view the offending rows in the MySQL database and want to k...

Could the value of an html anchor tag be fetched using xpath?

If I have HTML that looks like: <td class="blah">&nbs;<a href="http://....."&gt;????&lt;/a&gt;&amp;nbsp;&lt;/td&gt; Could I get the ???? value using xpath? What would it look like? ...

Parsing HTML to get content using C#

I am writing an application that crawls a group of my web pages. Rather than take the entire source code of the page I'd like to take all of the content and store that and be able to store the page as plain text within a database. The content will be used in other applications and not read by users so there's no need for it to be perfect...

BeautifulSoup HTML table parsing

I am trying to parse information (html tables) from this site: http://www.511virginia.org/RoadConditions.aspx?j=All&amp;r=1 Currently I am using BeautifulSoup and the code I have looks like this from mechanize import Browser from BeautifulSoup import BeautifulSoup mech = Browser() url = "http://www.511virginia.org/RoadConditions.aspx...

replacing xml tag with html value

Hi! I'm working with c# .Net I have a question, I'm loading Xml file with XDocument.xDoc.Load(file), but it fails because in my content I also have xml tags: Example: <root><abc><deg></abc></root> My problem is that the Load function treats the <deg> as an Xml tag without a matching "</deg>"... My question is, how can i replace th...

What regex can I use to extract URLs from a Google search?

I'm using Delphi with the JCLRegEx and want to capture all the result URL's from a google search. I looked at HackingSearch.com and they have an example RegEx that looks right, but I cannot get any results when I try it. I'm using it similar to: Var re:JVCLRegEx; I:Integer; Begin re := TJclRegEx.Create; With re do try ...

Adding whitespace to web page source so that I can read it.

I'm curious about the web page I'm viewing. I use the "view--page source" and get a window with the html. I cut and paste this into notepad++. I manually parse through adding whitespace to make it readable to me. Is there a better way to do the last step? I'm hoping something has been written which automates this process, giving the...

Java library for HTML analysis

Hi, (I've seen similar questions, but I think none of them cater to my specific needs, hence...) I would like to know if there is a Java library for analysis of real-world (read: incomplete, ill-formed) HTML. By analysis, I mean things like: figuring out the most prominent color in an HTML chunk changing that color to some other colo...

PHP Regex: Get info between groups of HTML tags?

I have been programming a word-unscrambler. I need to parse the information between a group of tags and another, and put all matches into an array. The beginning tag is: <tr> <td></td><td><li> and the ending tag is: </li></td> </tr> I know some regular expressions, but I am unfamiliar with PHP. ...

Indexing text content of html

I want to pull the text out of html files for indexing purposes, and do so as fast as possible. Rather than create something from scratch, I want to see how much I can find already done for me. Currently I'm just piping the output of html2text, which works, but between being python and trying to prettify the text, I'm sure the speed cou...

HTML parser for GAE

Generally I use lxml for my HTML parsing needs, but that isn't available on Google App Engine. The obvious alternative is BeautifulSoup, but I find it chokes too easily on malformed HTML. Currently I am testing libxml2dom and have been getting better results. Which pure Python HTML parser have you found performs best? My priority is th...

Which Html Parser is best?

I code a lot of parsers. Up till now, I was using HtmlUnit headless browser for parsing and browser automation. Now, I want to separate both the tasks. As 80% of my work involves just parsing, I want to use a light html parser because it takes much time in HtmlUnit to first load a page, then get the source and then parse it. I want to...

How would I get the inputs from a certain form with HtmlAgility Pack? Lang: C#.net

Code can explain this problem much better than I can. I have also included alternate ways i've tried to do this. If possible, please explain why these other methods didn't work either. I've ran out of ideas, and sadly there aren't many examples for HtmlAgilityPack. I'm currently going through the documentation looking for more ideas thou...

Selecting the next link using XPath

I have to write an XPath expression to get the href attribute of the anchor tag in the html bellow that comes right after the one that is marked as "current-page" (in the example #notimportant/2). <dd> <a href="#notimportant/1" class="current-page">1</a> <a href="#notimportant/2">2</a> <a href="#notimportant/3">3</a> <a ...

Load local .html file's strings into table view cells

iPhone OS Development I need to set the names of UITableView cells to strings I get from a local "file.html" file. I know I will need to parse the HTML but I'm not worried about that at the moment. If someone could show me some quick code that would set the first line of text in the html file and set it to an NSString variable, I think...

Html Parser for PHP like Java

I have been developing Java programs that parse html source code of webpages by using various html parsers like Jericho, NekoHtml etc... Now I want to develop parsers in PHP language. So before starting, I want to know that are there any html parsers available that I can use with PHP to parse html code ...

PHP DOMDocument, finding specific tags

I'm looking to find a specific attribute of a specific tag in an HTML document using PHP DOMDocument. Specifically, there is a div with a unique class set, and only a single span inside of it. I need to retrieve the style attribute of that span tag. Example: <div class="uniqueClass"><span style="text-align: center;" /></div> For th...