html-parsing

How to find all javascript on page with php given url?

In php how would I grab all javascript from a page given it's url? Is there a good regular expression to get the src of all javascript script tags or the script inside of them? ...

Accessing html generated by Javascript with htmlunit -Java

I am trying to be able to test a website that uses javascript to render most of the HTML. With the HTMLUNIT browser how would you be able to access the html generated by the javascript? I was looking through their documentation but wasn't sure what the best approach might be. WebClient webClient = new WebClient(); HtmlPage currentPage ...

Help with Java Swing HTML parsing

I am parsing a collection of HTML documents with the Java Swing HTML parsing libraries and I am trying to isolate the text between <title> tags so that I can use them to identify the documents but I am having a hard time accomplishing that since the handleStartTag method doesn't have access to the text inside of the tags ...

libxml2 on iPhone

I'm trying to parse HTML file with libxml2. Usually this works fine, but not in this case: <p> <b>Titles</b> (Some Text) <table> <tr> <td valign="top"> …Something1... </td> <td align="right" valign="top"> …Something2... </td> </tr...

Get the rendered text from HTML (Delphi)

I have some HTML and I need to extract the actual written text from the page. So far I have tried using a web browser and rendering the page, then going to the document property and grabbing the text. This works, but only where the browser is supported (IE com object). The problem is I want this to be able to run under wine also, so...

Problem with eastern european characters when scraping data from the European Parliament Website

Dear Experts EDIT: thanks a lot for all the answers an points raised. As a novice I am a bit overwhelmed, but it is a great motivation for continuing learning python!! I am trying to scrape a lot of data from the European Parliament website for a research project. The first step is to create a list of all parliamentarians, however due ...

Simple libxml2 HTML parsing example, using Objective-c, Xcode, and HTMLparser.h

Please can somebody show me a simple example of parsing some HTML using libxml. #import <libxml2/libxml/HTMLparser.h> NSString *html = @"<ul><li><input type=\"image\" name=\"input1\" value=\"string1value\" /></li><li><input type=\"image\" name=\"input2\" value=\"string2value\" /></li></ul><span class=\"spantext\"><b>Hello World 1</b></...

Is it kosher for me to use HTMLAgilityPack in my free open source C# library?

I'm going to make a movie site scraping library that's free and open source. I want to use HTMLAgilityPack to easily parse web information from HTML source code, but I'm not sure if I legally can? Can I use this library in this way? Thank you. ...

How could I parse this HTML file?

<div id="main"> <style type="text/css"> </style> <script language="JavaScript"> </script> <p style="margin: 0pt 0pt 0.5em;"><b>Media from&nbsp;<a onclick="(new Image()).src='/rg/find-media-title/media_strip/images/b.gif?link=/title/tt0087538/';" href="/title/tt0087538/">The Karate Kid</a> (1984)</b></p> <style type="text/css"> ...

Parse text from an HTML page

I know it is possible to get information (text) from another page. For example, on the page at http://www.page.com/ is a div named news. How can I get the text from this div? ...

SimpleTest assertTags - loose matching? (for CakePHP)

I'd like to use SimpleTest to set up some functionality tests for our project - in particular, we have a very busy page which has some random components and some static components, and I'd like to be able to write a simple test which only confirms the static bits (preferably only the one or two most important ones). In other words, I wa...

Problem parsing children of a node with HtmlAgilityPack

Hi All, I'm having a problem parsing the input tag children of a form in html. I can parse them from the root using //input[@type] but not as children of a specific node. Here's some code that illustrates the problem: private const string HTML_CONTENT = "<html>" + "<head>" + "<title>Test Page</title>" + ...

Replace Links Location (href='...')

Hello, I would like to replace the link location (of anchor tag) of a page as follows. Sample Input: text text text <a href='http://test1.com/'&gt; click </a> text text other text <a class='links' href="gallery.html" title='Look at the gallery'> Gallery</a> more text Sample Output text text text <a href='http://example.com/p.php?q=...

Converting anchor tag with relative URL to absolute URL in HTML content using Java

The situation: On server A we want to display content from server B in line on server A. The problem: Some of the hyperlinks in the content on server B are relative to server B which makes them invalid when displayed on server A. Given a block of HTML code that contains anchor tags like the following <a href="/something/somwhere.h...

HTML Agility Pack strip tags NOT IN whitelist

I'm trying to create a function which removes html tags and attributes which are not in a white list. I have the following HTML: <b>first text </b> <b>second text here <a>some text here</a> <a>some text here</a> </b> <a>some twxt here</a> I am using HTML agility pack and the code I have so far is: static List<string> Whit...

Xquery parsing text with <a> tags

I am using XQuery to extract content from html pages. The html body structure is of this kind: <td> <a href ="hw1">xyz </a> Hello world 1 <a href="hw2">Helloworld 2</a> Helloworld 3 </td> My XQuery expression for extracting the text is as follows: //a[starts-with(@href,'hw1')]/following...

parse xhtml in iphone sdk?

Hello all, i want to parse a xhtml file and display in UITableView. what is the best way to parse xhtml file so that i could be able to display as it is shown in browser. here is a sample xhtml source <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"...

get contents of <a> tags using python

Assuming I have html read into my program like this: <p><a href="http://vancouver.en.craigslist.ca/nvn/ret/1817849271.html"&gt;F/T &amp; P/T Sales Associate - Caliente Fashions</a> - <font size="-1"> (North Vancouver)</font></p> <p><a href="http://vancouver.en.craigslist.ca/van/ret/1817804151.html"&gt;IMMEDIATE EMPLOYMENT WANTED!</a> - ...

detect and parse embedded video in html?

I am working on a project which requires me to detect and extract the embed code of videos on a web page. I know the tag is used to embed videos, however, the specification says that it can also be used for other things like images. So how do i deterministically know that an tag contains a video within? or is there some other way to...

How does a parser (for example, HTML) work?

For argument's sake lets assume a HTML parser. I've read that it tokenizes everything first, and then parses it. What does tokenize mean? Does the parser read every character each, building up a multi dimensional array to store the structure? For example, does it read a < and then begin to capture the element, and then once it meets ...