html-parsing

j2me reading html differs between WTK and device

I have built a mobile application in J2ME and it reads data from a website. In WTK (wireless toolkit) everything works now, but when I test the samen app on my mobile (nokia) device, it behaves differently: It gives another type of html back: it doesn't show a <hr> tag, but a <hr/> tag. There is a possibility that the remote website ...

How do I remove some (or all) HTML elements and/or attributes using HTML Agility Pack?

Using the HTML Agility Pack, how can I remove all HTML attributes, elements, etc, etc, from a blob of HTML, with the result as if I pasted it into notepad? Additionally, I need to remove all formatting but I need to keep UL/LI and B tags. ...

Grabbing meta-tags and comments using HTML Agility Pack

I've looked for tutorials on using HTML Agility Pack as it seems to do everything I want it to do but it seems that for such a powerful tool there is little noise about it on the Internet. I am writing a simple method that will retrieve any given tag based on name: public string[] GetTagsByName(string TagName, string Source) { ... ...

loading html page as xml

I use this to load html page by xml Dim xmlDoc As New XmlDocument() xmlDoc.Load(Server.MapPath("index.htm")) Or Dim xmldoc As XDocument xmldoc = XDocument.Load(Server.MapPath("index.htm")) but i got some errors like : Expecting an internal subset or the end of the DOCTYPE declaration. Line 2, position 14. '>' is an unexpected to...

Parsing Malformed HTML with PHP Dom

I've got a client who wants their videos (provided by a third party) displayed on their web site. The web site uses swfobject to display the video, so I thought that it would be easiest to grab that and slightly modify it so that it works on the client's web site. Using PHP DOMDocument seems the way to go, but unfortunately the HTML tha...

create a dictionary or list from string(HTML tag included) in C#

Hello, A have a string like this: string s = @" <tr> <td>11</td><td>12</td> </tr> <tr> <td>21</td><td>22</td> </tr> <tr> <td>31</td><td>32</td> </tr>"; How to create Dictionary<int, int> d = new Dictionary<int, int>(); from string s to get same result as : d.Add(11, 12); d.Add(21, 22); d.Add(31, ...

How to parse a HTML file at a URL?

I am new to iphone development.I am able to parse a Xml file at a URL and retrieve it contents from a particular nodes. For Parsing at url NSString * path = @"xxxxxxxxxxxxxxxxxxxxxx"; [self parseXMLFileAtURL:path]; For retrieving the data i use NSXMLParser .How can i achieve the same thing if i have HTML file at my URL(Source ...

HTML to XML conversion or a good HTML Parser Suggestion written in .Net

Hi all, I need to parse html for a project and looking for a good html parser or an API providing conversion from html to xml. Waiting for suggestions... Thanks All... ...

If you're not supposed to use Regular Expressions to parse HTML, then how are HTML parsers written?

I see questions every day asking how to parse or extract something from some HTML string and the first answer/comment is always "Don't use RegEx to parse HTML, lest you feel the wrath!" (that last part is sometimes omitted). This is rather confusing for me, I always thought that in general, the best way to parse any complicated string i...

How to parse the content of the tag with attribute in iphone?

I am new to iphone development.I want to parse and retrieve a particular content from the HTML file at the url.I have a sample code from this link http://blog.objectgraph.com/index.php/2010/02/24/parsing-html-iphone-development/ NSData *htmlData = [[NSString stringWithContentsOfURL:[NSURL URLWithString: @"http://www.objectgraph.com/...

Best way to get back to using the power of lxml after having to use a regex to find something in an html document

I am trying to rip some text out of a large number of html documents (numbers in the hundreds of thousands). The documents are really forms but they are prepared by a very large group of different organizations so there is significant variation in how they create the document. For example, the documents are divided into chapters. I mi...

HTML Agility Pack

I want to parse the html table using html agility pack. I want to extract only some predefined column data from the table. But I am new to parsing and html agility pack and I have tried but I don't know how to use the html agility pack for my need. If anybody knows then give me example if possible EDIT : Is it possible to parse htm...

HTML Agility Pack

I have html tables in one webpage like <table border=1> <tr><td>sno</td><td>sname</td></tr> <tr><td>111</td><td>abcde</td></tr> <tr><td>213</td><td>ejkll</td></tr> </table> <table border=1> <tr><td>adress</td><td>phoneno</td><td>note</td></tr> <tr><td>asdlkj</td><td>121510</td><td>none</td></tr> <tr><td>asdlkj<...

Check string for link

I have rather long entries being submitted to a database. How can I create a function to see if this entry has a link within it? Can someone get me started? Pretty much, I want the function to find any <a, <a href or any other related link instances within a string. I'd prefer not to throw the entry into an array. Are there an...

Which is the best HTML tidy pack? Is there any option in HTML agility pack to make HTML webpage tidy?

I am using html agility pack to parse html tabular information. Now there is some html content with missing ending tags and from such page because of missing ending tags html agility pack does not parse information properly.So I want to insert ending tags where there are missing ending tags so html agility pack parse information properly...

MODX: Snippet strips and hangs string when parsing the vars.

Hey all i have a snippet call like this: [!mysnippet?&content=`[*content*]` !] What happen is that, if i send some html like this: [!mysnippet?&content=`<p color='red'>Yeah</p>` !] it will return this: <p colo the [test only] snippet code (mysnippet) is: <?php return $content; ?> Why is this happening? My actual snippet is c...

HTML Purifier: Removing an element conditionally based on its attributes

As per the HTML Purifier smoketest, 'malformed' URIs are occasionally discarded to leave behind an attribute-less anchor tag, e.g. <a href="javascript:document.location='http://www.google.com/'"&gt;XSS&lt;/a&gt; becomes <a>XSS</a> ...as well as occasionally being stripped down to the protocol, e.g. <a href="http://1113982867/"&gt;XSS&...

Modifying the contents of a HTML webpage on the fly in PHP

I would like to load a HTML document and modify it's text in PHP. For example, if I have a document like this: <html> <head><title>Test - Example.com</title></head> <body> <p><a href="http://www.example.com"&gt;Link number 1: Example.com</a></p> <p>Link number 2: Example.com - some random text</p> </body> </html> I would like to add a...

HTML Parser to extract text out of the body (in java)

Hi all, I am working on this project that requires me to carry out some text manipulation out of the text that I obtain from web pages. Now, the first step towards doing this would be for me to find a parser that would extract the required body text ignoring the redundant information. I am not sure how I would do this, since I am extreme...

need help working with the Jericho Html Parser

Hi all I've simply used the following program on the url below http://jericho.htmlparser.net/samples/console/src/ExtractText.java My goal is to be able to extract the main body text, to be able to summarize it and present the summarized text as output to the user. My problem is that, I'm not sure how I'd modify the above program to on...