html-parsing

Non-destructive parsing and modifying of HTML elements in C++

I have a need to do some simple modifications to HTML in C++, preferably without completely rewriting the HTML, such as what happens when I use libxml2 or MSHTML. In particular I need to be able to read, and then (potentially) modify, the "src" attribute of all "img" elements. I need it to be robust enough to be able to do this with any...

.NET Html Parser

This must be the 20th duplicate or so, here is one: Looking for C# HTML parser I'm looking for an open source, fast, w3c-equivalent html/xhtml parser for C# without native dlls. Thanks. ...

Regex to Match HTML Style Properties.

In need of a regex master here! <img src="\img.gif" style="float:left; border:0" /> <img src="\img.gif" style="border:0; float:right" /> Given the above HTML, I need a regex pattern that will match "float:right" or "float:left" but only on an img tag. Thanks in advance! ...

PHP regular expression to remove tags in HTML document

Say I have the following text ..(content)............. <A HREF="http://foo.com/content" >blah blah blah </A> ...(continue content)... I want to delete the link and I want to delete the tag (while keeping the text in between). How do I do this with a regular expression (since the URLs will all be different) Much thanks ...

Converting web page into UITableView

Hi! I have an UITableView and I want to populate it with data from this page: http://tvgids.mobi/gids/ned1.php I have this code: NSURL *urlll = [NSURL URLWithString:[NSString stringWithFormat:url]]; NSString *test = [NSString stringWithContentsOfURL:urlll]; UIAlertView *av = [[UIAlertView alloc] initWithTitle:@"LOL" message:t...

Loading a webpage for parsing in Rails

Assume, I want to get a page from the web to my application and make some sort of parsing with it. How do I do that? Where should I start from? Should be some plugins/gems required? What is your usual practice in resolving such type of tasks? ...

What regex would match a nested table with identifiable text in the table cell?

What regex would match a nested table with identifiable text in the table cell? I've tried but failed to come up with a regular expression to extract the specific table I want with out grabbing the beginning and end of both tables in the example. Here is something to get started: "<table>.*?</table>" <table> <tr> <td> <ta...

Why can I only get the HTML for the homepage of websites and not others?

I am writing a java program that connects to a website and it returns the HTML, for some reason I am having problems with it. Right now I am only able to access the website if I do //example String host = "www.google.com" but If I want to access a URL that is any more complicated then I get an UnknownHostException. At first I tho...

Trimming whitespace from HTML content?

I have a CRUD maintenance screen with a custom rich text editor control (FCKEditor actually) and the program extracts the formatted text as HTML from the control for saving to the database. However, part of our standards is that leading and trailing whitespace needs to be stripped from the content before saving, so I have to remove extra...

problem in parsing

i have a page,say abc.html, that is having a small form with some fields. <form name="form" method="post" action="abc.html">.......................</form> when we submit the form it again comes back to abc.html with some data posted and shows the resulted names on the page which came after processing the posted data. in the whole pro...

Select specific child elements with BeautifulSoup

I'm reading up on BeautifulSoup to screen-scrape some pretty heavy html pages. Going through the documentation of BeautifulSoup I can't seem to find a easy way to select child elements. Given the html: <div id="top"> <div>Content</div> <div> <div>Content I Want</div> </div> </div> I want a easy way to to get the "Content I ...

Why is Swing Parser's handleText not handling nested tags?

I need to transform some HTML text that has nested tags to decorate 'matches' with a css attribute to highlight it (like firefox search). I can't just do a simple replace (think if user searched for "img" for example), so I'm trying to just do the replace within the body text (not on tag attributes). I have a pretty straightforward HTML...

Create array from the contents of <div> tags in php

I have the contents of a web page assigned to a variable $html Here's an example of the contents of $html: <div class="content">something here</div> <span>something random thrown in <strong>here</strong></span> <div class="content">more stuff</div> How, using PHP can I create an array from that that finds the contents of <div class="...

Python RegEx skipping the first few characters?

Hey I have a fairly basic question about regular expressions. I want to just return the text inside (and including) the body tags, and I know the following isn't right because it'll also match all the characters before the opening body tag. I was wondering how you would go about skipping those? x = re.match('(.*<body).*?(</body>)', file...

Regex PHP, pattern matching

Hi i would like to have a regex in php which matches a word in a string but if the word is a link. The problem is that I replace words with links for example: "text" => < a href = "mylink">text< /a>. But sometimes I have the problem that it is replaced twice. So I want to avoid this problem. My pattern now is /text/i. Eg. This is my...

Parsing vCards on web pages into a MySQL DB

I have a client who is using a separate vCard on a separate page. These are being pasted into a wordpress text field. (Not the most efficient way to maintain a list of people, but I won't editorialize after the fact.) My mission is to write something to parse through all the addresses in the vCards and to dump the information into a c...

Regex to move image in markup with PHP when publishing content with TinyMCE

Hi guys, I am using TinyMCE to publish content to my site. I have the problem whereby I can only insert an image inside another element, eg a paragraph, even if I place the cursor at the end of the content. So, when I publish the content, I currently end up with markup like: <p>Text content <img src="blah" /></p><p>Another paragraph</...

Regex PHP, Match all links with specific text.

Hi, I am looking for a regular expression in PHP which would match the anchor with a specific text on it. E.g I would like to get anchors with text mylink like: <a href="blabla" ... >mylink</a> So it should match all anchors but only if they contain specific text So it should match these strings: <a href="blabla" ... >mylink</a> <a ...

How do you escape regex strings in Freemarker

I am using the matches string builtin and need to run a regex pattern (Views:).*?(span>)(.*?)(<\/div) However, Freemarker freaks out because of the ">" character which is a special character in Freemarker. Any ideas how to get round this? ...

Easiest way to fetch all href contents on page in Ruby?

I'm writing a simple web crawler in Ruby and I need to fetch all href contents on the page. What is the best way to do this, or any other web page source parsing, since some pages might not be valid, but I still want to be able to parse them. Are there any good Ruby HTML parsers that allow validity agnostic parsing, or is the best way j...