views:

1758

answers:

8

Question says it all, I've got a 500,000 line file that gets generated as part of an automated build process on a Windows box and it's riddled with ^M's. When it goes out the door it needs to *nix friendly, what's the best approach here, is there a handy snippet of code that could do this for me? Or do I need to write a little C# or Java app?

+1  A: 

Ftp it from the dos box, to the unix box, as an ascii file, instead of a binary file. Ftp will strip the crlf, and insert a lf. Transfer it back to the dos box as a binary file, and the lf will be retained.

EvilTeach
I'm not such a fan of this one, seems like it would be a PITA as part of an automated build. Plus, if I don't have a local unix box on the network, I've either got to buy one, or transfer the file over the WAN, twice. Must be possible to do this locally, no?
ninesided
Neither am I. It requires at least one running FTP server, which is a little overkill for a file conversion.
Federico Ramponi
Good answer to get a laugh though!
Tom Leys
FTP in ascii mode can also translate between tabs and spaces, depending on the implementation, which would be undesirable.
paxdiablo
+3  A: 
tr -d '^M' < infile > outfile

You will type ^M as : ctrl+V , Enter

Edit: You can use '\r' instead of manually entering a carriage return, [thanks to @strager]

tr -d '\r' < infile > outfile

Edit 2: 'tr' is a unix utility, you can download a native windows version from http://unxutils.sourceforge.net[thanks to @Rob Kennedy] or use cygwin's unix emulation.

hayalci
This works nice if you have tr on the dos box.It's fast too.
EvilTeach
cygwin may be of help
hayalci
I don't have tr, where can I find it?
ninesided
Don't want to install Cygwin just for this.
ninesided
Native, non-Cygwin utilties: http://unxutils.sourceforge.net/
Rob Kennedy
You can also write:tr -d '\r' < in > out
strager
Rob, you should've put your own answer in!
ninesided
A: 

There's a few methods mentioned here...

Vincent Van Den Berghe
+9  A: 

Here is a Perl one-liner, taken from http://www.technocage.com/~caskey/dos2unix/

#!/usr/bin/perl -pi
s/\r\n/\n/;

You can run it as follows:

perl dos2unix.pl < file.dos > file.unix

Or, you can run it also in this way (the conversion is done in-place):

perl -pi dos2unix.pl file.dos

And here is my (naive) C version:

#include <stdio.h>

int main(void)
{
   int c;
   while( (c = fgetc(stdin)) != EOF )
      if(c != '\r')
         fputc(c, stdout);
   return 0;
}

You should run it with input and output redirection:

dos2unix.exe < file.dos > file.unix
Federico Ramponi
Don't worry about performance until you must deal with terabytes :D The C version takes ~ 5 seconds to convert a 65 MB file with 500000 lines of text (on an old Pentium4 with a standard EIDE disk)
Federico Ramponi
@Federico, that (naive) C version will remove all CR characters, not just those in a CR-LF pair. But I guess that's why you called it naive. :-)
paxdiablo
@Pax: exactly :D
Federico Ramponi
+5  A: 

If you're on Windows and need something run in a batch script, you can compile a simple C program to do the trick.

#include <stdio.h>

int main() {
    while(1) {
        int c = fgetc(stdin);

        if(c == EOF)
            break;

        if(c == '\r')
            continue;

        fputc(c, stdout);
    }

    return 0;
}

Usage:

myprogram.exe < input > output

Editing in-place would be a bit more difficult. Besides, you may want to keep backups of the originals for some reason (in case you accidentally strip a binary file, for example).

That version removes all CR characters; if you only want to remove the ones that are in a CR-LF pair, you can use (this is the classic one-character-back method :-):

/* XXX Contains a bug -- see comments XXX */

#include <stdio.h>

int main() {
    int lastc = EOF;
    int c;
    while ((c = fgetc(stdin)) != EOF) {
        if ((lastc != '\r') || (c != '\n')) {
            fputc (lastc, stdout);
        }
        lastc = c;
    }
    fputc (lastc, stdout);
    return 0;
}

You can edit the file in-place using mode "r+". Below is a general myd2u program, which accepts file names as arguments. NOTE: This program uses ftruncate to chop off extra characters at the end. If there's any better (standard) way to do this, please edit or comment. Thanks!

#include <stdio.h>

int main(int argc, char **argv) {
    FILE *file;

    if(argc < 2) {
        fprintf(stderr, "Usage: myd2u <files>\n");
        return 1;
    }

    file = fopen(argv[1], "rb+");

    if(!file) {
        perror("");
        return 2;
    }

    long readPos = 0, writePos = 0;
    int lastC = EOF;

    while(1) {
        fseek(file, readPos, SEEK_SET);
        int c = fgetc(file);
        readPos = ftell(file);  /* For good measure. */

        if(c == EOF)
            break;

        if(c == '\n' && lastC == '\r') {
            /* Move back so we override the \r with the \n. */
            --writePos;
        }

        fseek(file, writePos, SEEK_SET);
        fputc(c, file);
        writePos = ftell(file);

        lastC = c;
    }

    ftruncate(fileno(file), writePos); /* Not in C89/C99/ANSI! */

    fclose(file);

    /* 'cus I'm too lazy to make a loop. */
    if(argc > 2)
        main(argc - 1, argv - 1);

    return 0;
}
strager
@strager, fixed to use ints (required for EOF) and added code to do CRs only in a CR-LF pair - hopefully this'll get you more rep. Oh yes, and upvoted.
paxdiablo
I noticed the correction using int; thanks! I'll leave the second one alone, even if it isn't my style. =]
strager
The second snippet fails on the empty file, although it's fairly trivial to fix that.
Adam Rosenfield
A: 

Some text editors, such as UltraEdit/UEStudio have this functionality built-in.

File > Conversions > DOS to UNIX

nickf
gVim can also do this, loading it automatically in DOS mode, then type ":set filemode=unix" without the quotes (from memory) and saving.
paxdiablo
not useful for an automated process though...
ninesided
ah, true. UEStudio does actually have a rather good scripting and macro system built in, which would actually let you do this via the command line, but you're right, it's not the best tool for an automated process.
nickf
+4  A: 

If installing a base cygwin is too heavy, there are a number of standalone dos2unix and unix2dos Windows standalone console-based programs on the net, many with C/C++ source available. If I'm understanding the requirement correctly, either of these solutions would fit nicely into an automated build script.

Ken Gentle
A: 

If it is just one file I use notepad++. Nice because it is free. I have cygwin installed and use a one liner script I wrote for multiple files. If your interest in the script leave a comment. (I don't have it available to me a this moment.)

Paul