html-parsing

how to extract keywords from html page in c#

basically i want to extract keywords or words or tokens that are present in the webpage after removing the stopwords..please help if anybody knws how to do it..will be thankful to u...code in c# would be appreciable..seeking for replies..pls help ...

Ruby HTML scraper written in Hpricot having trouble with escaped HTML

I am trying to scrape this page: http://www.udel.edu/dining/menus/russell.html. I have written a scraper in Ruby using the Hpricot library. problem: HTML page is escaped and I need to display it unescaped example: "M&M" should be "M&M" example: "Entrée" should be "Vegetarian Entrée" I have tried using the CGI library...

The best way to parse HTML tags in java-script

can anybody help/advice that is there any way to parse HTML tags appear in side the <body>...</body> tags ...

phrase images from webpage coldfusion

i need to get images from a webpage source. i can use cfhttp method get and use htmleditformat() to read the html from that page, now i need to loop through the content to get all image url's(src) can i use rematch() or refind() etc... and if yes how?? please help!!!!! if im not clear i can try to clarify.. ...

How can I use "PHP Simple HTML DOM Parser" to get the contents of an <h1></h1> tag?

I'm new to PHP =) Right now I am using PHP includes for my site template. I have my header, containing all my <head></head> info. What I want to do is write a code that will take the contents of the <h1></h1> tag from the page, and echo it into the <title></title> tag in my header.php include. I got the PHP Simple HTML DOM Parser from h...

PHP Regular Expression to quotations to attributes

hey Guys, I need to run a string of html through a regex function that checks to see if the attributes are closed in quotes, and if they aren't then close them. for example i want <img src=http://www.domain.com/image.gif border=0> to turn into <img src='http://www.domain.com/image.gif' border='0'> Can anyone help me? ...

Any suggestions in a way to parse headers and links from blog pages using C#?

I'm currently self-studying C# in my free time and thought of a "little" project to get me going (and one that I or others will actually find useful). It ended up being more complicated than I thought. Or maybe I'm just thinking it is? Anyway, this project would parse the homepages of the blogs (most of them are Wordpress blogs) I frequ...

nokogiri vs hpricot?

Which one would you choose? My important attributes are (not in order) Support & Future enhancements Community & general knowledge base (on the Internet) Comprehensive (i.e proven to parse a wide range of *.*ml pages) Performance Memory Footprint (runtime, not the code-base) ...

Remove anchor from URL in C#

I'm trying to pull in an src value from an XML document, and in one that I'm testing it with, the src is: <content src="content/Orwell - 1984 - 0451524934_split_2.html#calibre_chapter_2"/> That creates a problem when trying to open the file. I'm not sure what that #(stuff) suffix is called, so I had no luck searching for an answer. I'd...

urllib alternative for iPhone

hi, I am trying to create an iPhone application which in some point connects to the internet, fills an on-line form, fetches the resulting website, parses it and returns a string to the user. I want all this process to happen in the background. I know how to do this kind of things with python and urllib but in objc I can't find an altern...

UNIX Parse HTML Page Display Contents of a Tag - One Liner?

I have an HTML file and I am interested in the data enclosed by <pre> </pre> tags. Is there a one-liner that can do achieve this? Sample file : <html> <title> Hello There! </title> <body> <pre> John Working Kathy Working Mary Working Kim N/A </pre> </body> </html> Output should be : John Kathy Mary Kim Much appreciat...

Fast, lightweight HTML parser for C++

I'm looking for a fast, lightweight open-source HTML parser -- something along the lines of a non-validating SAX parser (except, of course, for HTML). The answers to this question cover a parser that generates a DOM (don't want that), and these answers suggest conforming the HTML to XML before sending it to Xerxes (can't do that in my c...

input URL, output contents of "view page source", i.e. after javascript / etc, library or command-line

I need a scalable, automated, method of dumping the contents of "view page source", after manipulation, to a file. This non-interactive method would be (more or less) identical to an army of humans navigating my list of URLs and dumping "view page source" to a file. Programs such as wget or curl will non-interactively retrieve a set of ...

RegEx: h1 followed by h2 without p in between

Hey everyone, I need a regular expression to find out whether or not a h1 tag is followed by a h2 tag, without any paragraph elements in between. I tried to use a negative lookahead but it doesn't work: <h1(.+?)</h1>(\s|(?!<p))*<h2(.+?)</h2> ...

Tips for Html parsing and web driving with clojure?

I want to automate filling in data on a website using clojure. For this I want to query elements of webpages and create http requests. I have been looking at using HttpUnit and contrib.clojure.zip-filter.xml. So far neither approach feels right. Are there alternative libraries to aid with this task? thanks ...

Is there any inbuilt support or native library in the .net for parsing html file ?

Why html agility pack is used to parse the information from the html file ? Is not there inbuilt or native library in the .net to parse the information from the html file ? If there then what is the problem with inbuilt support ? What the benefits of using html agility pack versus inbuilt support for parsing information from the html f...

Regular expressions in java

String s= "(See <a href=\"/wiki/Grass_fed_beef\" title=\"Grass fed beef\" " + "class=\"mw-redirect\">grass fed beef.) They have been used for " + "<a href=\"/wiki/Paper\" title=\"Paper\">paper-making since " + "2400 BC or before."; In the string above I have inter-mixed html with text. Well the requiremen...

Weird CSS behavior... removing a 1px border makes <DIV> move about 20px

I have the following: CSS #pageBody { height: 500px; padding:0; margin:0; /*border: 1px solid #00ff00;*/ } #pageContent { height:460px; margin-left:35px; margin-right:35px; margin-top:30px; margin-bottom:30px; padding:0px 0 0 0; } HTML <div id="pageBody"> <div id="pageContent"> ...

PHP Simple_html_dom issue

The snippet below loops through some web pages, grabs the html and then looks for table.results and gets the plaintext out of the tags contained in each . $result is ok. Now I'm trying to get the href value of an tag that is found in the second of each . I'd like to include this in the $results array, but I'm not sure how to do this....

PHP SAX parser for HTML?

Hi. I need HTML SAX (not DOM!) parser for PHP able to process even invalid HTML code. The reason i need it is to filter user entered HTML (remove all attributes and tags except allowed ones) and truncate HTML content to specified length. Any ideas? ...