ansaurus

Question

Using stdio to read and sort piped data in Linux

Answer 1

+2 A:

fgets and fprintf deal with strings (which in C are char-arrays ending with the special character \0).

Use fread and fwrite for binary data. qsort is good, except your compare function needs to compare byte by byte for entry_size bytes, not just cast from void to int, which is unsafe and probably incorrect.

jakber 2009-02-09 21:50:10

Answer 2

+2 A:

A couple of points that might help (I'm assuming that this is homework, so this has a pedagogical structure):

You are going to need to stick the data somewhere while you're accumulating it. And since you don't know how much is coming, dynamic allocation is going to be necessary. Lots of choices here: dynamic arrays, linked lists, hash tables, etc. ad nauseum... Pick one that will work for you (see below).
Once you've got it, you're going to need to sort it. Or, you could store it sorted in the first place. Do you happen to know a data structure that can manage that?
Once it is sorted, you just spit it out the other end.

Easy, no?

Apropos the comment:

Do you have K&R? Or some other basic text on c or c++? You'll want one. And you want to pick one language or the other and stick with it. If you are used to OO programing in another context use c++. If you don't know either, use c++.
If you're using c++, the STL container will take care of the dynamic storage for you. The simplest thing would be to use a std::vector<char[2]> (using the size from the command line argument), and invoke the <algorithm> sort on it.
Other people have pointed out that you need to do binary IO. This is important, because neither your input nor your output are structured like text IO.

dmckee 2009-02-09 21:51:05

DV 2009-02-09 21:57:11

Answer 3

+2 A:

printf interprets \0 in your stream as terminating character.

Use read() and write() instead:

    int res;
    int count;
    while (1) {
            count = 0;
            while(count < entry_size) {
                    res = read(STDIN_FILENO, input + count, entry_size - count);
                    if (res <= 0)
                            exit(errno);
                    count += res;
            }
            count = 0;
            while(res) {
                    res = write(STDOUT_FILENO, input + count, entry_size - count);
                    if (res < 0)
                            exit(errno);
                    count += res;
            }
    }

Quassnoi 2009-02-09 21:56:31

Answer 4

+5 A:

Don't use scanf() and printf(). Those are supposed to be used with text data. Since you're dealing with binary data, you instead want to use the lower-level fread() and fwrite() functions. Since you don't know how much total data there is, you'll have to use a dynamic data structure (such as a resizeable array) to store the input. You can't process the data on-line -- you have to read it all in, sort it, then write it back out. Your sort function is also wrong -- you're only comparing the first 4 bytes of each record, which is wrong if the record size is anything other than 4 bytes. Use memcmp() instead.

This should work:

char *input = NULL;
size_t input_size = 0;
int num_items = 0;
int entry_size;

int compare_func(const void *e1, const void *e2)
{
  return memcmp(e1, e2, entry_size);
}

int main(int argc, char **argv)
{
   // ...
  char *datum = malloc(entry_size);  // check for NULL
  input_size = 4096;
  input = malloc(input_size);  // check for NULL

  while(1)
  {
    if(fread(datum, 1, entry_size, stdin) < entry_size)
      break;
    size_t new_size = (num_items + 1) * entry_size;
    if(new_size > input_size)
    {
      input = realloc(input, input_size * 2);  // check for NULL
      input_size *= 2;
    }
    memcpy(input + num_items * entry_size, datum, entry_size);
    num_items++;
  }

  qsort(input, num_items, entry_size, compare_func);

  fwrite(input, entry_size, num_items, stdout);

  return 0;
}

Adam Rosenfield 2009-02-09 22:08:11

Not doubling the buffer size on realloc may save a lengthy and unnecessary explanation, but it is making my hands itch.

dmckee 2009-02-09 22:26:21

elaborate please?

DV 2009-02-09 22:32:29

By doubling when you resize the buffer (i.e. input) you guarantee amortized O(n) coping costs. The worst case for extend-only-as-needed expansions is O(n^2). Probably not an issue here (because the buffer is the only thing using the heap), but important in other cases.

dmckee 2009-02-09 22:45:20

In the above you would write something like:int temp = input_size ? input_size*2 : entry_size; input = realloc(input, temp); input_size = temp;which is much less clear the first time you see it.

dmckee 2009-02-09 22:49:43

Fixed; I was thinking about that while writing it, but it slipped my mind.

Adam Rosenfield 2009-02-09 22:51:34

Answer 5

+2 A:

P.S. People asked if this is a homework assignment... It's not - developers at my company want to use this to debug output of their project.

If it's not homework, and this is under Linux, why not use the included "sort" program?

E.g.:

% printf 'D\x00C\x00\x00B\x00A' | sort -z | od -c
0000000  \0   A  \0   B  \0   C  \0   D  \0
0000011

FSF/GNU sort offers the -z option:

-z, --zero-terminated
        end lines with 0 byte, not newline

Amended to add:

Okay, seeing as of how you still want your own code...

And seeing as of how no one else has yet posted an STL-based approach.

Note the use of struct FUNCTOR for the comparison (via stl::sort()). And, if you want, you could always use ostream_iterator<string>(cout, "\n") instead to make things a little more human-readable..

#include <iostream>
#include <fstream>
#include <iomanip>
#include <string>
#include <vector>
#include <algorithm>
#include <iterator>

using namespace std;

   /* ifstream won't set EOF-state until we try to read past the end-of-file.*/
   /* (Makes for lots of grief trying to read in files...)                   */
inline bool
IsStreamStillGood( istream & theStream )
{
  return theStream && (theStream . peek() != EOF);
}

template<class TYPE> inline void DELETE( TYPE * x)
{
  delete x;
  x = (TYPE *) NULL;
}

struct FUNCTOR
{
  bool operator()(const string & x, const string & y) { return x < y; }
};


int
main(int argc, char **argv)
{
  istream *       stream;
  vector<string>  v;

  UASSERT( argc, >, 1 );

  const int recordSize = atoi( argv[1] );
  char      buffer [ recordSize + 1 ];

  UASSERT( recordSize, >, 0 );


  if ( argc > 2 )
    stream = new ifstream( argv[2] );
  else
    stream = & cin;


  while ( IsStreamStillGood( * stream ) )
  {
    stream-> read( buffer, recordSize );
    v.push_back( string( buffer, stream->gcount() ) );
  }

  UASSERT( v.back().size(), ==, size_t(recordSize) );


  FUNCTOR functor;
  sort( v.begin(), v.end(), functor );

  copy( v.begin(), v.end(), ostream_iterator<string>(cout) );


  if ( argc > 2 )
    DELETE(stream);
}

Amended (again) to add:

Question regarding usage of strings in your STL approach: since there are potentially null characters ('\0') in the input, wouldn't strings get confused because of that? Someone suggested using char[] but I suspect the same problem would arise

If I have char c[10];, I'm free to say c[0] = '\0';. Same for c[1] or c[9]. I can have as many null characters in that array as I want. (Subject to the size of the array, of course.)

Now, when using c as a c-string, the question arises as to how long it is. Is it 1 character long or 9? Normally, in C, this is decided by where the first NULL character appears.

So things like printf(%s), scanf(%s), strncat(), strncpy(), strncmp(), etc aren't really happy with NULL ('\0') characters embedded within our binary data array.

But C++ std::string keeps track of the length independently. At least it seems to, and it permits things like: myString.append( 10, '\0' );

So, by using stream->read(buffer,recordSize) we read in a set number of chars (bytes). We don't really care whether they are nulls ('\0') or not. It's all good. Just give me recordSize number of bytes.

By creating with v.push_back(string(buffer, stream->gcount())), we push back a new string containing stream->gcount() chars (bytes). And again we don't care whether they are nulls ('\0') or not. We just need all stream->gcount() bytes.

Sort uses functor, which uses operator<(const string &, const string &), which uses string::compare(), which again will use the string's length and doesn't care about what data is actually contained in the string. Nulls ('\0') are just fine.

Now if we tried to use v.back().c_str(), well, then we have no length so the nulls would confuse us. But as long as we use the string object (e.g. v.back()) containing both data and length, we're good.

Which brings us to output. And again we are outputting the string, not myString.c_str(), so all the chars in the string are printed. Nulls ('\0') included.

Mr.Ree 2009-02-09 22:33:57

Answer: because they want to have great flexibility in sorting. E.g., they will write their own compare_func() for qsort().

DV 2009-02-09 22:37:21

Thanks for the suggestions, though :) I could use these commands to test if my program works correctly.

DV 2009-02-09 22:38:01

Ahh, you know about -k2,2n and -t, right? Sort IS pretty generic...

Mr.Ree 2009-02-10 00:20:56

Question regarding usage of strings in your STL approach: since there are potentially null characters ('\0') in the input, wouldn't strings get confused because of that? Someone suggested using char[] but I suspect the same problem would arise.

DV 2009-02-10 20:41:07

Answered by amending original post. (Needed more than 300 chars.)

Mr.Ree 2009-02-11 00:05:07

Thanks! The answer couldn't be better!

DV 2009-02-13 00:56:07

You're welcome. ;-)

Mr.Ree 2009-02-13 14:52:00

Answer 6

+1 A:

Thanks to all for great suggestions! Just for the record, here's completed binsort.c that does what is expected.

#include <stdio.h>
#include <stdlib.h>
#include <cstring>
#include <limits.h>

using namespace std;


int num_items = 0;
size_t input_size = 0;
int entry_size = 0;


void print_usage()
{
        printf("%s\n", "Usage: <binary data> | ./binsort [RECORD_LENGTH]");
        printf("%s\n", "For example: printf 'D\\x00C\\x00\\x00B\\x00A' | ./binsort 2 | od -c");
}


int compare (const void * a, const void * b)
{
        return memcmp(a, b, entry_size);
}


int main(int argc, char *argv[])
{
        if (argc != 2 || stdin == NULL)
        {
                print_usage();
                exit(EXIT_FAILURE);
        }

        entry_size = atoi(argv[1]);

        if (entry_size <= 0 || entry_size >= INT_MAX)
        {
                print_usage();
                exit(EXIT_FAILURE);
        }

        char *input = NULL;
        char *datum = (char*) malloc(entry_size);
        if (datum == NULL)
                exit(EXIT_FAILURE);

        while(1)
        {
                int read_size = fread(datum, 1, entry_size, stdin);

                if (read_size == 0)
                        break;

                if(read_size < entry_size)
                {
                        while(read_size < entry_size)
                        {
                                memcpy(datum, '\0', 1);
                                read_size++;
                        }
                        break;
                }

                size_t new_size = (num_items + 1) * entry_size;
                if(new_size > input_size)
                {
                        input = (char*) realloc(input, new_size);
                        if (input == NULL)
                                exit(EXIT_FAILURE);
                        input_size = new_size;
                }
                memcpy(input + num_items * entry_size, datum, entry_size);
                num_items++;
        }

        qsort(input, num_items, entry_size, compare);

        fwrite(input, entry_size, num_items, stdout);

        exit(EXIT_SUCCESS);
}

DV 2009-02-09 22:46:58

I'm sure this will work. But you might look at my STL approach up above. While less efficient in terms of memory use, perhaps it might be easier to follow...

Mr.Ree 2009-02-10 01:53:15

Thanks! Yeah, I'll use STL approach next time :D Saves lots of effort until I learn all the memory management quirks of C.

DV 2009-02-10 20:38:06

Saves lots of time even after you learn all the memory management quirks of C... ;-)

Mr.Ree 2009-02-10 23:26:22

ansaurus

tags:

views:

answers:

Using stdio to read and sort piped data in Linux

Amended to add:

Amended (again) to add:

related questions