tags:

views:

347

answers:

2

What is the most convenient way to remove all the HTML tags when using the SAS URL access method to read web pages?

A: 

I think the methodology is not to remove the HTML from the page, but identify the standard patterns for the data you are trying to capture. This is the perl / regular expressions type methodology.

An example might be some data or table that comes so many characters after the logo image. You could write a script to keep only the data.

If you want to post up some html, maybe we can help decode it.

AFHood
I am looking for a purely SAS System solution. I know SAS Supports regular expressions, and I just want the code so I don't have to do it myself because I don't like reinventing wheels. The gobbeltygook HTML could be anything doable with Gobbelgook HTML. I want to read many different kinds of web pages and extract just the content not the Gobbeltygook HTML.
Joe Whitehurst
+4  A: 

This should do what you want. Removes everything between the <> including the <> and leaves just the content (aka innerHTML).

Data HTMLData;

filename INDEXIN URL "http://www.zug.com/";

input;

textline = _INFILE_;

/*-- Clear out the HTML text --*/
re1 = prxparse("s/<(.|\n)*?>//");
call prxchange(re1, -1, textline);

run;
Jay Stevens
Thank you very much Warpraptor!! I really like you elegantly simple solution totally within the confines of a professional programming environment--no need for any amateurish tools like Perl. With the HTML Gobbelgook removed we are left with beauties like:Fanaticism consists in redoubling your effort when you have forgotten your aim.There is no cure for birth and death save to enjoy the interval.A man is morally free when, in full possession of his living humanity, he judges the world, and judges other men, with uncompromising sincerity.
Joe Whitehurst
Joe, please refrain from smoking that.
alamar