tags:

views:

198

answers:

2

I'm attempting to use wget with the -p option to download specific documents and the images linked in the HTML.

The problem is, the site that is hosting the HTML has some non-html information preceding the HTML. This is causing wget to not interpret the document as HTML and doesn't search for images.

Is there a way to have wget strip the first X lines and/or force searching for images?

Example URL:

First Lines of Content:

<DOCUMENT>
<TYPE>S-4
<SEQUENCE>1
<FILENAME>ds4.htm
<DESCRIPTION>FORM S-4
<TEXT>
<HTML><HEAD>
<TITLE>Form S-4</TITLE>

Last Lines of Content:

</BODY></HTML>
</TEXT>
</DOCUMENT>

EDIT: Solutions in PHP are certainly accepted.

A: 

In PHP, you could use this function to strip out X lines:

function strip_toplines($string,$lines){
    $string = explode(PHP_EOL,$string);
    foreach($string as $line_num => $line){
        if($line_num>($lines - 1)){
            $output .= $line . PHP_EOL;
        }
    }
    return trim($output);
}

and then this:

strip_toplines(file_get_contents($url),6);
Jamza
True, but I need to download all the images from the HTML as well.
St. John Johnson
+1  A: 
Devon_C_Miller
Great find! I didn't even think to look at the robots file. Well, your alternate method gave me some issues (due to anchor links in the file), so instead I'm just bypassing the Robots file with `-e robots=off` Thank you!
St. John Johnson