views:

91

answers:

2

I am trying to convert an html file to xml. It is working for the most part. The issue I am having is with links. Right now it seems to be completely ignoring the link in my test file.

Here is the convert code:

<?php
ini_set('display_errors', 1); 
ini_set('log_errors', 1); 
ini_set('error_log', dirname(__FILE__) . '/error_log.txt'); 
error_reporting(E_ALL);

function convertToXML()
{

    $titleLength = 35;
    $output = "";
    $date = date("D, j M Y G:i:s T");
    $fi = fopen( "../newsTEST.htm", "r" );
    $fo = fopen( "../newsfeed.xml", "w" );

    //This is the first parts of the XML
    $output .= "<?xml version=\"1.0\"?>\n";
    $output .= "<rss version=\"2.0\">\n";
    $output .= "<channel>\n";
    $output .= "\t<title>Wiggle 100 News</title>\n";
    $output .= "\t<link>http://www.wiggle100.com/news.php&lt;/link&gt;\n";
    $output .= "\t<description>Wiggle 100 Daily News</description>\n";
    $output .= "\t<language>en-us</language>\n";
    $output .= "\t<pubDate>". $date ."</pubDate>\n";
    $output .= "\t<managingEditor>[email protected]</managingEditor>\n";
    $output .= "\t<webMaster>[email protected]</webMaster>\n";

    $article = "";
    $skip = true; //if false will continue to put lines into output until </p>
    $newArticle = false;

    while( !feof($fi) )
    {
     $line = fgets($fi);
     $link = "";

     if( strpos( $line, "<p" ) !== false)
     {
      $pos = strpos( $line, "<p" );
      $line = substr( $line, $pos );

      $pos = strpos( $line, ">" );
      $line = substr( $line, $pos + 1 );

      $skip = false;   
     }

     if( strpos( $line, "</p>" ) !== false )
     {
      $pos = strpos( $line, "</p>" );
      $line = substr( $line, 0, $pos - 1 );

      $newArticle = true;
     }

     //This adds the line to the article
     if( !$skip )
     {
      $article .= $line;
     }

     //This mixes the article, title, link, and date with 
     // XML and puts it into the output
     if( $newArticle )
     {
      //This if is to get rid of stuff like <p>&nbsp;</p>
      if( (strlen($article) > 10) )
      {
       $link = findLink( $article );
       //$article = strip_tags($article);
         $title = substr( $article, 0, $titleLength ) . "...";

       $output .= "\t<item>\n";
          $output .= "\t\t<title>". $title ."</title>\n";
          $output .= "\t\t<link>". $link ."</link>\n";
       $output .= "\t\t<description>". $article . "</description>\n";
          $output .= "\t\t<pubDate>". $date . "</pubDate>\n";
       $output .= "\t</item>\n\n";
      }

      $article = "";
      $line = "";
      $skip = true;
     }
    }

    $output .= "</channel>\n";
    $output .= "</rss>\n";

    fwrite( $fo, $output );

    fclose($fi);
    fclose($fo);

    echo "<br /><br /> News converted to XML";
}

    //*****************************************************************************
    //*****************************************************************************

    //Find and return a link in the input.
    //Else use the a default
    function findLink( $input )
    { 
     $link = "http://www.wiggle100.com/news.php";

     if( strpos( $input, "<a" ) !== false )
     {
      $startpos = strpos( $input, "href" );
      $link = substr( $input, $startpos + 5 );
      $endpos = strpos( $link, ">" );
      $link = substr( $link, 0, $endpos - 2 );
     }
     return $link;
    }


?>

Here is the html test code:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> 
<html><head><title>Test Page</title> 
<meta name="GENERATOR" content="MSHTML 8.00.6001.18812"> 
<meta content="text/html; charset=unicode" http-equiv="Content-Type"></head> 
<body bgcolor="#ffffff"> 
<p>&nbsp;</p> 
<p>This is an article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p> 
<p>&nbsp;</p> 
<p>This is another article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p> 
<p>This is the 3rd article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p> 
<p>&nbsp;</p> 
<p align="center"><font size="6">This is the news for today. Blah Blah Blah!</font> 
<a href="http://www.thedailyreview.com/news/"&gt; 
http://www.thedailyreview.com/news/&lt;/a&gt;&lt;/p&gt; 
</body> 
</html>

Here is the XML output:

<rss version="2.0"> 
<channel> 
    <title>Wiggle 100 News</title> 
    <link>http://www.wiggle100.com/news.php&lt;/link&gt; 
    <description>Wiggle 100 Daily News</description> 
    <language>en-us</language> 
    <pubDate>Fri, 23 Oct 2009 23:49:04 EDT</pubDate> 
    <managingEditor>[email protected]</managingEditor> 
    <webMaster>[email protected]</webMaster> 
    <item> 
     <title>This is an article. Blah. Blah. Bla...</title> 
     <link>http://www.wiggle100.com/news.php&lt;/link&gt; 
     <description>This is an article. Blah. Blah. Blah. Blah. Blah. Blah. Blah</description> 
     <pubDate>Fri, 23 Oct 2009 23:49:04 EDT</pubDate> 
    </item> 

    <item> 
     <title>This is another article. Blah. Blah...</title> 
     <link>http://www.wiggle100.com/news.php&lt;/link&gt; 
     <description>This is another article. Blah. Blah. Blah. Blah. Blah. Blah. Blah</description> 
     <pubDate>Fri, 23 Oct 2009 23:49:04 EDT</pubDate> 
    </item> 

    <item> 
     <title>This is the 3rd article. Blah. Blah...</title> 
     <link>http://www.wiggle100.com/news.php&lt;/link&gt; 
     <description>This is the 3rd article. Blah. Blah. Blah. Blah. Blah. Blah. Blah</description> 
     <pubDate>Fri, 23 Oct 2009 23:49:04 EDT</pubDate> 
    </item> 

    <item> 
     <title><font size="6">This is the news for...</title> 
     <link>http://www.wiggle100.com/news.php&lt;/link&gt; 
     <description><font size="6">This is the news for today. Blah Blah Blah!</font> 
</description> 
     <pubDate>Fri, 23 Oct 2009 23:49:04 EDT</pubDate> 
    </item> 

</channel> 
</rss>

The font tag will disappear when I uncomment the strip_tags().

+1  A: 

I did a bit of testing, and found that it works fine on paragraphs that are all on a single line in the input file, as in the example below. (Except that it reads the opening quotation mark as part of the URL, but that's easily fixed.)

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> 
<html><head><title>Test Page</title> 
<meta name="GENERATOR" content="MSHTML 8.00.6001.18812"> 
<meta content="text/html; charset=unicode" http-equiv="Content-Type"></head> 
<body bgcolor="#ffffff"> 
<p>&nbsp;</p> 
<p>This is an article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p> 
<p>&nbsp;</p> 
<p>This is another article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p> 
<p>This is the 3rd article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p> 
<p>&nbsp;</p> 
<p align="center"><font size="6">This is the news for today. Blah Blah Blah!</font> <a href="http://www.thedailyreview.com/news/"&gt; http://www.thedailyreview.com/news/&lt;/a&gt;&lt;/p&gt; 
</body> 
</html>
David
Thanks. That helped me to find the problem.
Josh Curren
A: 

The problem ended up being that I never reset $newArticle to false after writing to the xml output. So after $newArticle got set to true (which was when </p> was found) there could never be more than one line read before the article was output. By setting $newArticle to false after writing to the output the program properly adds lines to the article until </p> is encountered.

Josh Curren