views:

262

answers:

5

I have a text file that has data wrapped between tags. The tags are:

<title>
<url>
<pubDate>

So, the entries look like this:

<title>title 1</title>
<url>url 1</url> 
<pubDate>pubDate 1</pubDate>

<title>title 2</title>
<url>url 2</url> 
<pubDate>pubDate 2</pubDate>

<title>title 3</title>
<url>url 3</url> 
<pubDate>pubDate 3</pubDate>

I need a script that reads this text file and prepares each item to be inserted into a database. The query will look like this:

insert into table (title,url,pubdate) values ($title,$url,$pubdate)....
+2  A: 

Why are you using '&lt;' and not just '<'?

Just convert all of the '&lt;' and '&gt;' to '<' and '>' then throw it through something like XML::Simple in Perl.

Weegee
rascher
+1  A: 

Or SimpleXML in PHP5 http://php.net/simplexml

@rascher there shouldn't be any problem with converting the XML entities to "XML literals".

&lt; title &gt; C &gt; Java &lt; /title &gt;

Would be encoded as:

&lt; title &amp;gt; C &gt; Java &lt; /title &gt;

And decoding the XML entities would produce valid XML.

bucabay
+1  A: 
#!/usr/bin/perl

use strict;
use warnings;

my %seen = (); 

sub seen_all {     
      defined $seen{title}
   && defined $seen{url} 
   && defined $seen{pubDate};
}  

while (<>) {                   
     /<(.+?)>(.+)<\/\1>/ && do {
         $seen{$1} = $2;
     }; 

    if(seen_all){ 
        print "insert into table (title,url,pubdate) " .        
              "values ('$seen{title}','$seen{url}','$seen{pubDate}')\n";
        %seen = (); 
    } 
}
dsm
+1  A: 

You might like to look at Text::Balanced. It has a function "extract_tagged" that solves exactly the problem you have outlined.

Gurunandan
A: 

You could use this. You can read RSS with the Simple XML class

$data = file_get_contents('http://www.example.com/path-to-feed.xml');
$xml = new SimpleXMLElement($data);

foreach($xml->feed as $feed){
 echo $feed->title;
 echo '<br />';
 echo $feed->url;
 echo '<br />';
 echo $feed->pubDate;
 echo '<br />';
}
Ben Shelock