views:

174

answers:

3

Whats the easiest way to strip the HTML tags in perl. I am using a regular expression to parse HTML from a URL which works great but how can I strip the HTML tags off?

Here is how I am pulling my HTML

 #!/usr/bin/perl -w
use strict;
use warnings;
use LWP::Simple;
my $now_string = localtime;

my $html = get("http://www.spc.noaa.gov/climo/reports/last3hours.html")
    or die "Could not fetch NWS page.";
$html =~ s/<script.*?<\'/script>/sg;
$html =~ s/<.+?>//sg;
$html =~ m{(Hail Reports.*)Wind Reports}s || die;
my @hail = $1;
A: 

If you just want to remove HTML tags:

s/<script.*?<\/script>//sg
s/<.+?>//sg

This will (most of the time) remove script tags and their contents, and all other HTML tags. You could also probably remove everything before the <body> tag safely with regex.

For anything more complex than that, though, regular expressions are not a suitable tool, and you really need to parse the HTML with an actual HTML parser and then manipulate that to remove the tags.

Anon.
@Downvoter: Yes, I realize that regular expressions for HTML are generally a bad idea. I even commented on that in the answer. However, for simple text manipulation, they will suffice. If you disagree, perhaps you should leave a comment explaining your position instead of a hit-and-run that leaves us both with fewer rep and doesn't improve anything.
Anon.
I know that you should use tokens more than regular expressions but with what I need to do regular expressions works well.
shinjuo
I am still very knew to perl and intend on rewriting my program the correct way but I just want to get it going for now and then work on the other way after it is done. I know people dont like that idea and most say just do it right the first time, but i just want to get it going quickly and will then adjust it correctly
shinjuo
@Anon where would I put that code you gave me? I put a snippet of my code in my question for you to see.
shinjuo
You want to match the fetched HTML against the two regexes - after fetching the page, you can put `$html =~ s/<script.*?</script>//sg`, and similarly for the second one.
Anon.
Anon, `s/<script.*?</script>//sg` needs a `\\` before the `/` in `</script>`.
Kinopiko
I have updated my code to what you gave me which works very well. While we are is there an HTML parser made that I should use or is it better to write your own? I have only read the learning perl book and I am now on intermediate perl.
shinjuo
@Kinopiko: Indeed it does. Thanks. @shinjuo: When it comes to Perl, if you ever find yourself asking "Is there an {X} already made", the answer is probably "Yes, and you'll find it on CPAN". The base module is known as HTML::Parser, and there are several other modules derived from/based on that which you may find useful.
Anon.
@hobbs: A couple of regex substitutions is hardly "overly complex".
Anon.
@hobbs ??? I dont know much about perl but why is this "overly complex"? it works well and was easy to implement.
shinjuo
@hobbs People come to this site for help not to have other people treat them like they are stupid. If you dont like the way he is doing it then explain a better way to do it. So far all you have done is put him done but helped me out none
shinjuo
@shinjuo I'd love to help you, but so far you've asked a question that only admits stupid answers.
hobbs
No I have asked a question that you apparently cant answer. I asked for help. You did read that I said I was knew right? So how about instead of being an ass you try to help people out who ask for help. You dont like the way I am doing it? Then explain to me why my way is wrong and leave it at that. So far all you have done is gotten on here and tore down every answer I got and basically called me an idiot without ever explaining a better way to do anything. Why do you even come to this site if not to help?
shinjuo
Also Anons code worked for what I wanted. I also mentioned that I knew this way was not the best way to do it and said I was still learning but just wanted to get a simple program going to pull some information for me until I learned the correct way to do it. I am sorry I didnt wake up this morning knowing how to program perl like you hobbs.
shinjuo
+4  A: 

An attempt to answer your misguided question


Problems


It's a bad habit to get into regex'ing out HTML because there are so many rules and ways to get around them, that may eventually open your code up to hacking techniques. While you might have a legitimate need for something simple now, it is very easy to reuse code and forget why it was a bad idea to reuse it, especially when you don't add comments like # This code is NOT secure and should not be used to parse HTML anywhere else!!! or # Christina Alguilera writes songs based on this code!!!

Example of differences in HTML that require lots of regex rules:

<div>...</div>
<div style="blah">
<div style="background:url(../div)">
<div style=".." class='noticesinglequote'>

The list goes on and that's only for well-formed HTML. Some other examples of problems include:

  1. HTML elements closed improperly (eg <div><span></div></span>) or not at all
  2. Spelling errors (eg <dvi>..</div>)
  3. HTML designed with the intention to break your script
  4. Other issues: comments, whitespaces, charsets, etc

Solution


You may have accepted an answer, but you should look at XML::Parser and HTML::TreeBuilder.

Rather than stripping out parts of the HTML Document, you are probably more interested in drilling down to the part of the document you want (eg everything in <body> or a certain div inside of it), which is why you most likely want something that one of the above modules provide. Not to mention, parsers can be used to do their best at removing all HTML elements and returning only text/CData.

vol7ron
shinjuo
All I need is just the words between Hail damage and storm damage and the regex works for that. It does not do everything I want it to do at the moment, but logging the information from the site was the most important and this was just a simple work around until I learn more. As for reusing it, I am working on upgrading it now. I will not reuse it.
shinjuo
**@shinjuo:** The NOAA page you have listed provides a dynamically generated CSV file for their data. It'd be better to just retrieve data using that file, that way no parsing is even needed. Otherwise, it seems the information you want is contained in a table, so again, all you have to do is traverse the HTML tree to the data that you need.
vol7ron
The reason I am pulling it off the page and not the file is because If a hail size is larger than 2.0 I want it to email me the information listed. They update that every 10 minutes
shinjuo
**@shinjuo:** The hail size is in the CSV. If it's larger than 2.0, it can easily email all the information in the CSV.
vol7ron
But wouldnt that mean I would have to download the file every 10 minutes?
shinjuo
You either need to download the csv or the html page every 10 minutes, one's no better than the other. The difference is that one (csv) has less data and is easier to pull information from
vol7ron
Is there an easy way to download files from the internet? I have not read that far in my books
shinjuo
I think it's the same exact code with LWP::Simple, instead of retrieving a `.html` file, you're just getting `.csv`, both are text-based files.
vol7ron
I think this should be the accepted answer.
Armando
+2  A: 

As mentioned, don't use regular expressions for this. There are simply too many exceptions.

One CPAN module which can help is HTML::Strip:

use HTML::Strip;

my $hs         = HTML::Strip->new();
my $clean_text = $hs->parse( $raw_html );
$hs->eof;

It's worth learning what's available on the CPAN and making use of it. It will save you a lot of work in the long run.

Ovid
Yeah I have been getting on there and looking around. Thanks for the help
shinjuo