views:

638

answers:

10

Hi all, here's a problem I've solved from a programming problem website(codechef.com in case anyone doesn't want to see this solution before trying themselves). This solved the problem in about 5.43 seconds with the test data, others have solved this same problem with the same test data in 0.14 seconds but with much more complex code. Can anyone point out specific areas of my code where I am losing performance? I'm still learning C++ so I know there are a million ways I could solve this problem, but I'd like to know if I can improve my own solution with some subtle changes rather than rewrite the whole thing. Or if there are any relatively simple solutions which are comparable in length but would perform better than mine I'd be interested to see them also.

Please keep in mind I'm learning C++ so my goal here is to improve the code I understand, not just to be given a perfect solution.

Thanks

Problem:

The purpose of this problem is to verify whether the method you are using to read input data is sufficiently fast to handle problems branded with the enormous Input/Output warning. You are expected to be able to process at least 2.5MB of input data per second at runtime. Time limit to process the test data is 8 seconds.

The input begins with two positive integers n k (n, k<=10^7). The next n lines of input contain one positive integer ti, not greater than 10^9, each. Output

Write a single integer to output, denoting how many integers ti are divisible by k. Example

Input:

7 3
1
51
966369
7
9
999996
11

Output:

4

Solution:

#include <iostream>
#include <stdio.h>
using namespace std;

int main(){
  //n is number of integers to perform calculation on
  //k is the divisor
  //inputnum is the number to be divided by k
  //total is the total number of inputnums divisible by k

  int n,k,inputnum,total;

  //initialize total to zero
  total=0;

  //read in n and k from stdin
  scanf("%i%i",&n,&k);

  //loop n times and if k divides into n, increment total
  for (n; n>0; n--)
  {
    scanf("%i",&inputnum);
    if(inputnum % k==0) total += 1;
  }

 //output value of total
 printf("%i",total);
 return 0;
}
A: 

Dividing two large numbers is hard. Perhaps an improvement would be to first characterize k a little by looking at some of the smaller primes. Let's say 2, 3, and 5 for now. If k is divisible by any of these, than inputnum also needs to be or inputnum is not divisible by k. Of course there are more tricks to play (you could use bitwise and of inputnum to 1 to determine whether you are divisible by 2), but I think just removing the low prime possibilities will give a reasonable speed improvement (worth a shot anyway).

RickNotFred
I'm not familiar with C++ I/O but what about reading the whole input in buffer first? Maybe calling scanf() produce a lot of unneeded overhead.
tur1ng
I just tested this with a little example, didn't make any difference. This is dominated by the I/O.
ahans
"Dividing two large numbers is hard". That's why there's a bunch of CPU area dedicated to the task of doing division of 32 bit numbers. This should not be the limiting factor.
Andrew Dalke
+2  A: 

hi

try to replace if statement with count += ((n%k)==0);. that might help little bit.

but i think you really need to buffer your input into temporary array. reading one integer from input at a time is expensive. if you can separate data acquisition and data processing, compiler may be able to generate optimized code for mathematical operations.

aaa
Tried this. It saved me 0.04 of a second. A small improvement, thanks.
Griffo
+2  A: 

The I/O operations are bottleneck. Try to limit them whenever you can, for instance load all data to a buffer or array with buffered stream in one step.

Although your example is so simple that I hardly see what you can eliminate - assuming it's a part of the question to do subsequent reading from stdin.

A few comments to the code: Your example doesn't make use of any streams - no need to include iostream header. You already load C library elements to global namespace by including stdio.h instead of C++ version of the header cstdio, so using namespace std not necessary.

mloskot
+1  A: 

You could try to read input line by line and use atoi() for each input row. This should be a little bit faster than scanf, because you remove the "scan" overhead of the format string.

frunsi
This improvement cut the time in half for me, still not a big deal. I think the code is alright, there's probably something else wrong.
ahans
"cut the time in half" **is a big deal** for such a small change in code. The other "wrong" thing was mentioned by others: buffering should help, either automatically by the clib (setvbuf), or by reading big chunks and parsing in-memory.
frunsi
+1  A: 

I think the code is fine. I ran it on my computer in less than 0.3s I even ran it on much larger inputs in less than a second.

How are you timing it?

One small thing you could do is remove the if statement. start with total=n and then inside the loop:

total -= int( (input % k) / k + 1) //0 if divisible, 1 if not

DanJ
What input do you use? How much lines?
tur1ng
I don't have the test data but as the problem describes above, the input can be up to 10^7 lines long
Griffo
+10  A: 

The speed is not being determined by the computation—most of the time the program takes to run is consumed by i/o.

Add setvbuf calls before the first scanf for a significant improvement:

setvbuf(stdin, NULL, _IOFBF, 32768);
setvbuf(stdout, NULL, _IOFBF, 32768);

-- edit --

The alleged magic numbers are the new buffer size. By default, FILE uses a buffer of 512 bytes. Increasing this size decreases the number of times that the C++ runtime library has to issue a read or write call to the operating system, which is by far the most expensive operation in your algorithm.

By keeping the buffer size a multiple of 512, that eliminates buffer fragmentation. Whether the size should be 1024*10 or 1024*1024 depends on the system it is intended to run on. For 16 bit systems, a buffer size larger than 32K or 64K generally causes difficulty in allocating the buffer, and maybe managing it. For any larger system, make it as large as useful—depending on available memory and what else it will be competing against.

Lacking any known memory contention, choose sizes for the buffers at about the size of the associated files. That is, if the input file is 250K, use that as the buffer size. There is definitely a diminishing return as the buffer size increases. For the 250K example, a 100K buffer would require three reads, while a default 512 byte buffer requires 500 reads. Further increasing the buffer size so only one read is needed is unlikely to make a significant performance improvement over three reads.

wallyk
Here are some magic numbers!
Swiss
OK thanks, I figured it was the computation between each I/O step which would cause the bottleneck. Thanks for your suggestions, would you mind explaining what it does, I can guess it buffers the I/O but how?
Griffo
They're not magic. It's kind of like 42.
Richard Pennington
Doesn't make any difference here (using gcc -O2).
ahans
@ahans It's because you don't believe in magic.
Swiss
I have a 130 MB test file and setting this buffer to either value doesn't make any difference. The file is cached by the OS, seems like system calls like this aren't that expensive after all (on Linux 2.6.30).
ahans
I ran the program with the test data and these changes with no performance increase whatsoever. Maybe they won't make a difference in this instance because of the way I'm using scanf and each call only needs a small buffer.
Griffo
For better and worse, the filesystem caching is interfering with what would be the typical condition: an uncached file. I created a data set with one million lines and got these results with the OP's original algorithm: real 0m0.264s, user 0m0.253s, sys 0m0.007s. Adding the 32K caches makes a minuscule improvement: real 0m0.258s, user 0m0.254s, sys 0m0.005s. I'll try it again when the file stops being cached.
wallyk
+2  A: 

You can read each line with gets(), and parse the strings yourself without scanf(). (Normally I wouldn't recommend gets(), but in this case, the input is well-specified.)

A sample C program to solve this problem:

#include <stdio.h>
int main() {
   int n,k,in,tot=0,i;
   char s[1024];
   gets(s);
   sscanf(s,"%d %d",&n,&k);
   while(n--) {
      gets(s);
      in=s[0]-'0';
      for(i=1; s[i]!=0; i++) {
        in=in*10 + s[i]-'0';   /* For each digit read, multiply the previous 
                                  value of in with 10 and add the current digit */
      }
      tot += in%k==0;          /* returns 1 if in%k is 0, 0 otherwise */
   }
   printf("%d\n",tot);
   return 0;
}

This program is approximately 2.6 times faster than the solution you gave above (on my machine).

stubbscroll
I disagree. One would be hard pressed to measure any performance improvement in doing it this way.
wallyk
This ran in 2.20 seconds with the test data, definitely an improvement on my solution. Could you add comments just to make it clearer what your code does at each step within the for loop.Thanks
Griffo
@wallyk Have you tried it? It's the fastest solution I've seen so far here. And the improvement is measurable. Though I wouldn't recommend doing it that way, since you sacrifice a lot of readability for a little performance gain here.
ahans
@Griffo, I added a few comments, and made the loop slightly easier to read. s[i]-'0' converts the character value to a numeric value.
stubbscroll
+4  A: 

I tested the following on 28311552 lines of input. It's 10 times faster than your code. What it does is read a large block at once, then finishes up to the next newline. The goal here is to reduce I/O costs, since scanf() is reading a character at a time. Even with stdio, the buffer is likely too small.

Once the block is ready, I parse the numbers directly in memory.

This isn't the most elegant of codes, and I might have some edge cases a bit off, but it's enough to get you going with a faster approach.

Here are the timings (without the optimizer my solution is only about 6-7 times faster than your original reference)

[xavier:~/tmp] dalke% g++ -O3 my_solution.cpp
[xavier:~/tmp] dalke% time ./a.out < c.dat
15728647
0.284u 0.057s 0:00.39 84.6% 0+0k 0+1io 0pf+0w
[xavier:~/tmp] dalke% g++ -O3 your_solution.cpp
[xavier:~/tmp] dalke% time ./a.out < c.dat
15728647
3.585u 0.087s 0:03.72 98.3% 0+0k 0+0io 0pf+0w

Here's the code.

#include <iostream>
#include <stdio.h>
using namespace std;

const int BUFFER_SIZE=400000;
const int EXTRA=30;  // well over the size of an integer 

void read_to_newline(char *buffer) {
  int c;
  while (1) {
    c = getc_unlocked(stdin);
    if (c == '\n' || c == EOF) {
      *buffer = '\0';
      return;
    }
    *buffer++ = c;
  }
} 

int main() {
  char buffer[BUFFER_SIZE+EXTRA];
  char *end_buffer;
  char *startptr, *endptr;

  //n is number of integers to perform calculation on
  //k is the divisor
  //inputnum is the number to be divided by k
  //total is the total number of inputnums divisible by k

  int n,k,inputnum,total,nbytes;

  //initialize total to zero
  total=0;

  //read in n and k from stdin
  read_to_newline(buffer);
  sscanf(buffer, "%i%i",&n,&k);

  while (1) {
    // Read a large block of values
    // There should be one integer per line, with nothing else.
    // This might truncate an integer!
    nbytes = fread(buffer, 1, BUFFER_SIZE, stdin);
    if (nbytes == 0) {
      cerr << "Reached end of file too early" << endl;
      break;
    }
    // Make sure I read to the next newline.
    read_to_newline(buffer+nbytes);

    startptr = buffer;
    while (n>0) {
      inputnum = 0;
      // I had used strtol but that was too slow
      //   inputnum = strtol(startptr, &endptr, 10);
      // Instead, parse the integers myself.
      endptr = startptr;
      while (*endptr >= '0') {
        inputnum = inputnum * 10 + *endptr - '0';
        endptr++;
      }
      // *endptr might be a '\n' or '\0'

      // Might occur with the last field
      if (startptr == endptr) {
        break;
      }
      // skip the newline; go to the
      // first digit of the next number.
      if (*endptr == '\n') {
        endptr++;
      }
      // Test if this is a factor
      if (inputnum % k==0) total += 1;

      // Advance to the next number
      startptr = endptr;

      // Reduce the count by one
      n--;
    }
    // Either we are done, or we need new data
    if (n==0) {
      break;
    }
  }

 // output value of total
 printf("%i\n",total);
 return 0;
}

Oh, and it very much assumes the input data is in the right format.

Andrew Dalke
Thanks very much @dalke It's certainly the fastest solution yet, ran with the test data in 0.57. It's a good bit more involved when compared to my code but it's the best answer I've got so far and is proven to work. Well done! Now I just need to see if I can follow it! ;o)
Griffo
could you add comments to your function? Just would make it clearer to others who come across your solution. Thanks
Griffo
Feel free to add comments.
Andrew Dalke
+1  A: 

Though I doubt CodeChef will accept it, one possibility is to use multiple threads, one to handle the I/O, and another to process the data. This is especially effective on a multi-core processor, but can help even with a single core. For example, on Windows you code use code like this (no real attempt at conforming with CodeChef requirements -- I doubt they'll accept it with the timing data in the output):

#include <windows.h>
#include <process.h>
#include <iostream>
#include <time.h>
#include "queue.hpp"

namespace jvc = JVC_thread_queue;

struct buffer { 
    static const int initial_size = 1024 * 1024;
    char buf[initial_size];
    size_t size;

    buffer() : size(initial_size) {}
};

jvc::queue<buffer *> outputs;

void read(HANDLE file) {
    // read data from specified file, put into buffers for processing.
    //
    char temp[32];
    int temp_len = 0;
    int i;

    buffer *b;
    DWORD read;

    do { 
        b = new buffer;

        // If we have a partial line from the previous buffer, copy it into this one.
        if (temp_len != 0)
            memcpy(b->buf, temp, temp_len);

        // Then fill the buffer with data.
        ReadFile(file, b->buf+temp_len, b->size-temp_len, &read, NULL);

        // Look for partial line at end of buffer.
        for (i=read; b->buf[i] != '\n'; --i)
            ;

        // copy partial line to holding area.
        memcpy(temp, b->buf+i, temp_len=read-i);

        // adjust size.
        b->size = i;

        // put buffer into queue for processing thread.
        // transfers ownership.
        outputs.add(b);
    } while (read != 0);
}

// A simplified istrstream that can only read int's.
class num_reader { 
    buffer &b;
    char *pos;
    char *end;
public:
    num_reader(buffer *buf) : b(*buf), pos(b.buf), end(pos+b.size) {}

    num_reader &operator>>(int &value){ 
        int v = 0;

        // skip leading "stuff" up to the first digit.
        while ((pos < end) && !isdigit(*pos))
            ++pos;

        // read digits, create value from them.
        while ((pos < end) && isdigit(*pos)) {
            v = 10 * v + *pos-'0';
            ++pos;
        }
        value = v;
        return *this;
    }

    // return stream status -- only whether we're at end
    operator bool() { return pos < end; }
};

int result;

unsigned __stdcall processing_thread(void *) {
    int value;
    int n, k;
    int count = 0;

    // Read first buffer: n & k followed by values.
    buffer *b = outputs.pop();
    num_reader input(b);
    input >> n;
    input >> k;
    while (input >> value && ++count < n) 
        result += ((value %k ) == 0);

    // Ownership was transferred -- delete buffer when finished.
    delete b;

    // Then read subsequent buffers:
    while ((b=outputs.pop()) && (b->size != 0)) {
        num_reader input(b);
        while (input >> value && ++count < n)
            result += ((value %k) == 0);

        // Ownership was transferred -- delete buffer when finished.
        delete b;
    }
    return 0;
}

int main() { 
    HANDLE standard_input = GetStdHandle(STD_INPUT_HANDLE);
    HANDLE processor = (HANDLE)_beginthreadex(NULL, 0, processing_thread, NULL, 0, NULL);

    clock_t start = clock();
    read(standard_input);
    WaitForSingleObject(processor, INFINITE);
    clock_t finish = clock();

    std::cout << (float)(finish-start)/CLOCKS_PER_SEC << " Seconds.\n";
    std::cout << result;
    return 0;
}

This uses a thread-safe queue class I wrote years ago:

#ifndef QUEUE_H_INCLUDED
#define QUEUE_H_INCLUDED

namespace JVC_thread_queue { 
template<class T, unsigned max = 256>
class queue { 
    HANDLE space_avail; // at least one slot empty
    HANDLE data_avail;  // at least one slot full
    CRITICAL_SECTION mutex; // protect buffer, in_pos, out_pos

    T buffer[max];
    long in_pos, out_pos;
public:
    queue() : in_pos(0), out_pos(0) { 
        space_avail = CreateSemaphore(NULL, max, max, NULL);
        data_avail = CreateSemaphore(NULL, 0, max, NULL);
        InitializeCriticalSection(&mutex);
    }

    void add(T data) { 
        WaitForSingleObject(space_avail, INFINITE);       
        EnterCriticalSection(&mutex);
        buffer[in_pos] = data;
        in_pos = (in_pos + 1) % max;
        LeaveCriticalSection(&mutex);
        ReleaseSemaphore(data_avail, 1, NULL);
    }

    T pop() { 
        WaitForSingleObject(data_avail,INFINITE);
        EnterCriticalSection(&mutex);
        T retval = buffer[out_pos];
        out_pos = (out_pos + 1) % max;
        LeaveCriticalSection(&mutex);
        ReleaseSemaphore(space_avail, 1, NULL);
        return retval;
    }

    ~queue() { 
        DeleteCriticalSection(&mutex);
        CloseHandle(data_avail);
        CloseHandle(space_avail);
    }
};
}

#endif

Exactly how much you gain from this depends on the amount of time spent reading versus the amount of time spent on other processing. In this case, the other processing is sufficiently trivial that it probably doesn't gain much. If more time was spent on processing the data, multi-threading would probably gain more.

Jerry Coffin
A: 

2.5mb/sec is 400ns/byte.

There are two big per-byte processes, file input and parsing.

For the file input, I would just load it into a big memory buffer. fread should be able to read that in at roughly full disc bandwidth.

For the parsing, sscanf is built for generality, not speed. atoi should be pretty fast. My habit, for better or worse, is to do it myself, as in:

#define DIGIT(c)((c)>='0' && (c) <= '9')
bool parsInt(char* &p, int& num){
  while(*p && *p <= ' ') p++; // scan over whitespace
  if (!DIGIT(*p)) return false;
  num = 0;
  while(DIGIT(*p)){
    num = num * 10 + (*p++ - '0');
  }
  return true;
}

The loops, first over leading whitespace, then over the digits, should be nearly as fast as the machine can go, certainly a lot less than 400ns/byte.

Mike Dunlavey