tags:

views:

240

answers:

3

I have the following function defined via PROC FCMP. The point of the code should be pretty obvious and relatively straightforward. I'm returning the value of an attribute from a line of XHTML. Here's the code:

proc fcmp outlib=library.funcs.crawl;
    function getAttr(htmline $, Attribute $) $;

       /*-- Find the position of the match --*/
    Pos = index( htmline , strip( Attribute )||"=" );

       /*-- Now do something about it --*/
       if pos > 0 then do;
          Value = scan( substr( htmline, Pos + length( Attribute ) + 2), 1, '"');
       end;
       else Value = "";
       return( Value);
    endsub;
run;

No matter what I do with length or attrib statement to try to explicitly declare the data type returned, it ALWAYS returns only a max of 33 bytes of the requested string, regardless of how long the actual return value is. This happens no matter which attribute I am searching for. The same code (hard-coded) into a data step returns the correct results so this is related to PROC FCMP.

Here is the datastep I'm using to test it (where PageSource.html is any html file that has xhtml compliant attributes -- fully quoted):

data TEST;
length href $200;
infile "F:\PageSource.html";

input;

htmline = _INFILE_;

href = getAttr( htmline, "href");
x = length(href);

run;

UPDATE: This seems to work properly after upgrading to SAS9.2 - Release 2

+2  A: 

I think the problem (though I don't know why) is in the scan function - it seems to be truncating input from substr(). If you pull the substr function out of scan(), assign the result of the substr funtion to a new variable that you then pass to scan, it seems to work.

Here is what I ran:

proc fcmp outlib=work.funcs.crawl;
    function getAttr(htmline $, Attribute $) $;
    length y $200;
       /*-- Find the position of the match --*/
    Pos = index( htmline , strip( Attribute )||"=" );

       /*-- Now do something about it --*/
       if pos > 0 then do;
          y=substr( htmline, Pos + length( Attribute ) + 2);
          Value = scan( y, 1, '"');       
       end;
       else Value = "";
       return( Value);
    endsub;
run;

options cmplib=work.funcs;

data TEST;
length href $200;
infile "PageSource.html";

input;

htmline = _INFILE_;
href = getAttr( htmline, "href");
x = length(href);
run;
cmjohns
Brilliant DUDE! Crazy nuts that it acts this way, but Brilliant that it works now. Accepting your answer based on my evaluation of your previous SAS answers on SO. I will test when I get home later tonight (don't have SAS on this machine). Thanks!
Jay Stevens
Maybe it actually has something to do with the fact that you can't dimension a variable if it gets "returned" from the function?
Jay Stevens
Sorry @cmjohns. It didn't work for me. Not only did it continue to return 33 bytes, but I got to the point where I could reproducibly hard crash sas.exe just by trying to access the function. I don't think PROC FCMP is release quality for Data Step use.
Jay Stevens
which version of SAS are you using and did you add the length statemnt for y? I'm on 9.2 and I swear it's working for me.Here's the first line from test.sas7bdatObs href 1 http://www.w3.org/StyleSheets/TR/W3C-REC.cssObs htmline 1 <?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet href="http://www.w3.org/StyleSheets/TR/W3C-REC.css" type="text/css"?>Obs x 1 44
cmjohns
+1  A: 

I ended up backing out of using FCMP defined data step functions. I don't think they're ready for primetime. Not only could I not solve the 33 byte return issue, but it started regularly crashing SAS.

So back to the good old (decades old) technology of macros. This works:

/*********************************/
/*= Macro to extract Attribute  =*/
/*= from XHTML string           =*/
/*********************************/
%macro getAttr( htmline, Attribute, NewVar );
   if index( &htmline , strip( &Attribute )||"=" ) > 0 then do;
      &NewVar = scan( substr( &htmline, index( &htmline , strip( &Attribute )||"=" ) + length( &Attribute ) + 2), 1, '"' );
   end;
%mend;
Jay Stevens
+2  A: 

In this case, an input pointer control should be enough. hope this helps.

/* create a test input file */
data _null_;
  file "f:\pageSource.html";
  input;
  put _infile_;
cards4;
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="w3.org/StyleSheets/TR/W3C-REC.css"; type="text/css"?>
;;;;
run;

/* extract the href attribute value, if any.                          */
/* assuming that the value and the attribute name occurs in one line. */
/* and max length is 200 chars.                                       */
data one;
  infile "f:\pageSource.html" missover;
  input @("href=") href :$200.;
  href = scan(href, 1, '"'); /* unquote */
run;

/* check */
proc print data=one;
run;
/* on lst
Obs                  href
 1
 2     w3.org/StyleSheets/TR/W3C-REC.css
*/
Chang Chung