html-parsing

Regular Expressions - Match records i HTML

I have to match a large amount of records in HTML. I want each record matched with a regular expression (using .NET Regex Match). Each record is formatted like this (the total HTML contains of normal HTML and ~100 records like the following): <tr onclick="window.location.href='Vareauktion.asp?VISSER=Ja&funk=detaljedata&ID=14457'" sty...

How can I make multiple calls to initWithContentsOfURL without it eventually returning the wrong stuff?

I'm doing multiple levels of parsing of web pages where I use information from one page to drill down and grab a "lower" page to parse. When I get to the lowest level of my hierarchy, I no longer hit a new page, I basically hit the same one (with different parameters) and make SQL database entries. If I don't slow things down (by puttin...

How can I load an HTML string into Webkit.net so I can access its "DOM"

I'd like to use Webkit.net to load an (X)HTML string and then analyze the DOM in order to "compress" it (remove whitespace, newlines, convert <input></input> and <input /> to <input> (basically an XHTML to HTML conversion, doctype allowing). Is there anyway to do get the "DOM tree" in webkit.net? If not, are there any .net HTML parsers ...

HTML Regex Composition

I am trying to capture img tag in HTML using Regex... So these must be captured: <img/> < img id = "f" /> I have used: "<\s*img(\s.*?)?/>" But this goes wrong: < img id = "/>" /> Any idea how to probably capture img tag?? Thanks ...

Parsing html frame

Hi2all, what a good-way for parsing html page, by using QWebKit (lang cpp). I need to view frame, trimmed <div id="chat_wrapper"> *any_html_data* </div>? ...

Regular expression to match a certain HTML element

I'm trying to write a regular expression for matching the following HTML. <span class="hidden_text">Some text here.</span> I'm struggling to write out the condition to match it and have tried the following, but in some cases it selects everything after the span as well. $condition = "/<span class=\"hidden_text\">(.*)<\/span>/"; If ...

HTML::Treebuilder - Parse between parents

Folks, There is so much info out there on HTML::Treebuilder that I'm surprised I can't find the answer, hopefully I'm not just missing it. What I'm trying to do is simply parse between parent nodes, so given a html doc like this <html> <body> <a id="111" name="111"></a> <p>something</p> <p>something</p> <p>something</p> ...

HTML Parsing fails with accented letters (eg: é)

I'm using this library: http://benreeves.co.uk/objective-c-hmtl-parser/ to parse HTML for a little iPhone app I'm making. I have got the code working so far, but it fails when presented with an accent (so far only experienced é). This is the code I'm using: NSError * error = nil; HTMLParser * parser = [[HTMLParser alloc] initWithContent...

Parsing HTML with Hpricot & Ruby - getting the innermost html?

I'm looking to parse some old html that has plenty of extraneous tags that could be done with CSS now - <b>, <font>, etc. I'm using Hpricot to parse it, but I want to get the innermost "inner_html" - how does one do that with Hpricot? For example lets say I user Hpricot to grab all the <table> elements which I loop through to get the r...

PHP Simple HTML DOM Parser: Extract entire DOM tree

How can I use the SimpleHTMLDOM Parser to get the entire DOM tree snapshot? Any pointers would help. ...

How can I modify HTML files in Perl?

I have a bunch of HTML files, and what I want to do is to look in each HTML file for the keyword 'From Argumbay' and change this with some href that I have. I thought its very simple at first, so what I did is I opended each HTML file and loaded its content into an array (list), then I looked for each keyword and replaced it with s///, a...

Parse iframe source in Android WebView

I have pages that users will be accessing that contain iframes. I would like to be able to parse out the source URL for sharing. ...

Needed C++ HTML parser + regular expression support

I'm working on a C++ project and I need to find an external library which provides HTML parser and regular expression support. The project is under 2 OS - iOS & Android. I was thinking using libxml2 which has a HTML parser module and xml regular expression. Can I use the xml regular expression module on HTML page? In addition, I need...

How to parse and modify HTML file in Java.

Hello everyone, I am doing a project wherein I need to read a HTML file and identify specific tags, modify the contents of the tag and create a new HTML file. Is there a library that parses HTML tags and is capable of writing the tags back to a new file ? Cheers !!! Chaitannya ...

IE ierror with parsing html

i get this error HTML Parsing Error: Unable to modify the parent container element before the child element is closed (KB927917) when try to run my project in visual studio 2010. but ONLY when run in virtual mashine! otherwise same source code doesn't yield same error NOTE: IE 8 Advance settings are same for both configurations! help...

java html parser doesnt read all page

Hi everybody I'm parsing html pages to get specific information, but there are some pages that I cant get all the information displayed on the web page, for example in this page I cant get the reviews information. By the way, if you see the source code of the page there are very much empty lines, and the reviews information dont appear...

Parsing PHP with FluentPHP, are there hidden rocks?

Hi all, after reading some posts on parsing HTML with php (see http://stackoverflow.com/questions/3650125/how-to-parse-html-with-php-closed), decided to stay with FluentPHP library since it is still alive, and Simple HTML DOM Parser was abandoned in 2008 (no activity at SourceForge). What are known hidden rocks here, that may kill the ...

libxml2 HTMLparser module guideline

I'm trying to parse an HTML file saved in memory. I'm fetching the HTML with libcurl and save it in memory as string. I'm having problems parsing this html with the HTMLparser module. I'm looking for a short guideline on how to parse and walk on this parsed html using libxml2 HTMLparser module with c++ Thanks EDIT: I'm getting this e...

c# parse html using XPathDocument

hi all! i'm trying to parse an html page with XPathDocument, but gives error 'cause the html is not an xml... is there a way to do this or not? ...

Python - HTML Parsing with Tidy

This code takes a bit of bad html, uses the Tidy library to clean it up and then passes it to an HtmlLib.Reader(). import tidy options = dict(output_xhtml=1, add_xml_decl=1, indent=1, tidy_mark=0) from xml.dom.ext.reader import HtmlLib reader = HtmlLib.Reader() doc = reader.fromString...