tags:

views:

865

answers:

3

I've got GNU Wget 1.10.2 for windows and linux and the -k option behaves differently on those two.

-k, --convert-links make links in downloaded HTML point to local files.

On windows it produces:

www.example.com/index.html
www.example.com/index.html@page=about
www.example.com/index.html@page=contact
www.example.com/index.html@page=sitemap

and on linux it produces:

www.example.com/index.html
www.example.com/index.html?page=about
www.example.com/index.html?page=contact
www.example.com/index.html?page=sitemap

This is problematic in linux because when I serve the mirror through Apache it will not distinguish between the 4 generated pages since the part after the questionmark (?) character is used as the query string to the file.

Any ideas on how I can control this?

thanks

+3  A: 

You can't use a question mark (?) in a filename on NTFS or FAT32. This is why wget uses the at symbol (@) instead.

In Linux, only a slash (/) is forbidden on most filesystems, so wget uses the question mark (since it's part of the URI).

You can force either behaviour by using --restrict-file-names=unix or --restrict-file-names=windows.

From the wget documentation:

When mode is set to “unix”, Wget escapes the character ‘/’ and the control characters in the ranges 0–31 and 128–159. This is the default on Unix-like OS'es.

When mode is set to “windows”, Wget escapes the characters ‘\’, ‘|’, ‘/’, ‘:’, ‘?’, ‘"’, ‘*’, ‘<’, ‘>’, and the control characters in the ranges 0–31 and 128–159. In addition to this, Wget in Windows mode uses ‘+’ instead of ‘:’ to separate host and port in local file names, and uses ‘@’ instead of ‘?’ to separate the query portion of the file name from the rest. Therefore, a URL that would be saved as ‘www.xemacs.org:4300/search.pl?input=blah’ in Unix mode would be saved as ‘www.xemacs.org+4300/search.pl@input=blah’ in Windows mode. This mode is the default on Windows.

Can Berk Güder
Thanks a lot of the info. I could mass rename but I'd have to mass search/replace these references from the inside the html files themselves, right?
cherouvim
Yes, and that would be more work than necessary. It took me a moment but I found the command-line option. =)
Can Berk Güder
+2  A: 

--restrict-file-names=windows

ax
+1  A: 

This is problematic in linux because when I serve the mirror through Apache it will not distinguish between the 4 generated pages since the part after the questionmark (?) character is used as the query string to the file.

To include a question mark in a URL path part, you can escape it:

www.example.com/index.html%3Fpage=about

--convert-links should be doing this for you, I'd think — may be a bug if not.

bobince