ansaurus

Question

Comparing data bytewise in a effective way (with c++)

Answer 1

+7 A:

Don't use a std::list use a std::vector.

std::list is a linked-list, elements are not guaranteed to be stored contiguously.

std::vector on the other hand seems far better suited for the specified task (storing contiguous bytes and comparing large chunks of data).

If you have to compare several files multiple times and don't care about where the differences are, you may also compute a hash of each file and compare the hashes. This would be even faster.

ereOn 2010-08-31 12:30:19

I will give it a try, thanks for the advice

zitroneneis 2010-08-31 12:33:29

This. An array is the fastest storage container by miles if you're not modifying the elements in the middle.

DeadMG 2010-08-31 12:55:16

changing the implmentation to std::vector totally solved the problem of slow comparison.Thank you very much!

zitroneneis 2010-08-31 13:16:45

@DeadMG: (Okay, this is pedantic --) but modifying them is fine, it's inserting or removing that is the performance suck :) +1 to answer.

Billy ONeal 2010-08-31 13:17:26

@zitroneneis: Glad I could help. I can see that you have only 50% of accept rate. Don't forget to accept answers when they somehow solve your problems ;)

ereOn 2010-08-31 13:21:51

@Billy: You're right, I wasn't very clear. You can change the elements all you like, just don't go around re-ordering them.

DeadMG 2010-08-31 13:54:41

@DeadMG: reordering is fine (using `swap`) it's inserting / removing (apart from the end) that isn't.

Matthieu M. 2010-08-31 14:16:58

@ereOn: I already wodered how this accept rate was calculated,now I now ;-)

zitroneneis 2010-08-31 14:25:48

A wise man might also find out the size of the files ahead of time and use the `.reserve(` method on the vector to preallocate the needed space.

T.E.D. 2010-08-31 16:33:57

Answer 2

+1 A:

If there's nothing else to be done with the contents of those files (looks like you're going to let them get deleted by shared_ptr at the end of CompareContent()'s scope), why not compare the files using iterators, not creating any containers at all?

Here's a piece of my code that compares two files bytewise:

// compare files
if (equal(std::istreambuf_iterator<char>(local_f),
          std::istreambuf_iterator<char>(),
          std::istreambuf_iterator<char>(host_f)))
{
    // we're good: move table to OutPath, remove other files

EDIT: if you do need to keep contents around, I think std::deque might be slightly more efficient than std::vector for the reasons explained in GOTW #54.. or not -- profiling will tell. And still, there would be the need for only one of the two identical files to be loaded in memory -- I'd read one into a deque and compare with the other file's istreambuf_iterator.

Cubbi 2010-08-31 12:38:53

Nice solution. However, what if he has to compare the same file to other files (multiple times). Wouldn't this solution force him to re-read the file every time ?

ereOn 2010-08-31 12:43:15

Given the data size of the files they could all be held in RAM for the duration. Depends on the total number of files.

Steve Townsend 2010-08-31 12:44:31

@ereOn: OP said *there're always two copies of the file*

Cubbi 2010-08-31 12:45:03

actually it's only an excerpt from my code. In my program, there's a database of test objects where content patterns are stored for future reference.Thank you anyway!

zitroneneis 2010-08-31 12:46:39

@Cubbi: Perhaps I don't understand it well, but doesn't this strenghten my point ?

ereOn 2010-08-31 12:47:37

@ereOn: Then you can always read the first file into a `std::vector` once and use the same `std::equal` code to compare it to all the files.

sbi 2010-08-31 12:56:03

Anyway, this presumes that the files are of the same length. Is that the case?

sbi 2010-08-31 12:56:43

@ereOn: I mean, I've read it as "there's a copy 'fileA' and a copy 'fileB', and those two need to be compared bytewise, and there's always just two." Anyway, since they have to be stored in memory anyway, it's moot.

Cubbi 2010-08-31 13:00:58

I'd like to point out that article is exceedingly old. There are new facilities of vector (like vector::shrink_to_fit) and the implementations will have evolved slightly over the last fifteen years (the article references MSVC 5, we're now on MSVC 10).

DeadMG 2010-08-31 13:59:52

Answer 3

A:

std::list is monumentally inefficient for a char element - there is overhead for every element to facilitate O(1) insertion and removal, which is really not what your task requires.

If you must use STL, then either std::vector or the iterator approach suggested would be preferable to std::list, but why not just read the data into a char* wrapped in some smart pointer of your choice and use memcmp?

Steve Townsend 2010-08-31 12:43:15

If he does C++, why would he use C functions and not the STL ? `std::vector` is perfectly designed for this task.

ereOn 2010-08-31 12:45:17

"why not just read the data into a char* wrapped in some smart pointer of your choice ...?" Because that wouldn't have any advantage over `std::vector`, but would have disadvantages.

sbi 2010-08-31 12:59:27

Answer 4

A:

It is crazy to use anything other than memcmp for the comparison. (Unless you want it even faster, in which case you might want to code it in assembly language.)

TonyK 2010-08-31 12:48:21

I'd expect my standard lib's `std::equal()` to use `std::memcmp()` if that's the fasted. I don't expect to have to decide that myself and adapt that code whenever I change the data type or port to a new platform.

sbi 2010-08-31 12:50:27

@sbi: But `std::equal()` is only provided with iterators, not references to the container itself. How can it optimise to use `memcmp()`?

Oli Charlesworth 2010-08-31 13:08:05

@Oli Charlesworth: Nothing says iterators cannot reference their parent container. More to the point, at least for vectors, you can get the underlying memory block based on the iterators themselves. (`std::equal` can be template specialized for `std::vector::iterator` and friends)

Billy ONeal 2010-08-31 13:23:59

@Billy: I was just poking about inside the GNU `vector` implementation, and found no specialisation for `equal`. I understand what you're saying, but it appears that this isn't done in practice...

Oli Charlesworth 2010-08-31 13:29:36

@Oli: 1. There are many more STL implementations than GCC. 2. Even if this is not the case, I seriously doubt `memcmp` and `std::equal` according to a naive implementation would differ in performance. It's not like either function is complex.

Billy ONeal 2010-08-31 13:36:48

@Oli: There could be a special implementation for iterators from `std::vector` instances of PODs. That's not particularly hard to find out.

sbi 2010-08-31 13:42:14

@Billy: 3. Such special code would be found in the algorithms' implementation, not in the containers'. (It would be the same for pointers and for `std::vector<>::iterator`.)

sbi 2010-08-31 13:43:59

@sbi: I've looked in `stl_algobase.h` as well, and can't find anything. I'm going to try compiling something and see where it resolves to...

Oli Charlesworth 2010-08-31 14:48:00

@Billy: there are indeed other implementations! But GCC's a pretty major one. I would imagine that the optimised assembler for `memcmp` (which uses x86 string loops and long comparisons) will be a fair bit faster than a character-by-character comparison. But I haven't profiled...

Oli Charlesworth 2010-08-31 14:49:48

@Oli: I'd still expect my implementation to handle this for me, although I have to admit that it might fail to do so. `:)` But if it does, then the next version might do. And even if the next version fails, too, I'd _still_ not clutter my code with such low-level details (you change the container's element type and suddenly got a non-POD which you shouldn't use `std::memcmp()` on in a completely different part of your code), _until profiling shows_ that a particular call to `std::equal()` is too slow and that performance indeed increases by using `std::memcmp()` instead.

sbi 2010-08-31 18:30:01

@Oli: FWIW, I've looked into VC9's standard lib (from Dinkumware) and found that it will call `std::memcmp()` from `std::equal()` for (`signed char*` OR `unsigned char*`) AND random access iterators. It will, however, resort to its own loop for `char*`. Why it does this, I wouldn't know and I'm not interested either. (I already said I'd expect my implementation to take care of this.)

sbi 2010-08-31 18:39:54

@sbi: I take your point. However, if profiling shows that this really is the hotspot in the application, I'd say that would justify replacing with a call to `memcmp()` if it brings a significant speedup, no matter what the cost in clarity. So I guess we agree! (FWIW, I ran `objdump` on a trivial example compiled against the GNU library, and it doesn't resolve to `memcmp()`).

Oli Charlesworth 2010-08-31 20:00:42

@Oli: Yes, in that we agree. However, this seems a long shot from "It is crazy to use anything other than memcmp for the comparison", to which I objected.

sbi 2010-08-31 21:06:38

You're all crazy! memcmp is a perfectly respectable function. It's fast, and it's portable. That's all you need to know, isn't it?

TonyK 2010-08-31 21:11:50

@sbi: Agreed on that, too.

Oli Charlesworth 2010-08-31 21:19:42

@Tony: I understand where you're coming from, but `std::memcmp()` isn't very C++. It's not type-safe, you need to manually tell it how long the arrays are (in bytes), and it doesn't extend to arbitrary container types.

Oli Charlesworth 2010-08-31 21:21:28

There we have it. 'Isn't very C++'. Is that like Republican In Name Only? This isn't a religious issue, it's a speed issue.

TonyK 2010-08-31 21:48:27

@Tony: It's not simply "religious". Type-safety, etc. are all Good Things. You shouldn't sacrifice them unless you're certain that `memcmp()` offers you a significant speed boost.

Oli Charlesworth 2010-08-31 21:56:35

You're a loony.

TonyK 2010-08-31 22:01:09

@Tony: Yes, all C++ best practices are lunacy. To hell with Amdahl's law...

Oli Charlesworth 2010-08-31 22:12:51

Amdahl's law. You really are a loony.

TonyK 2010-08-31 22:38:14

@Tony: can you explain why?

Oli Charlesworth 2010-08-31 22:44:17

Arguments in favour of using memcmp to compare byte arrays: 1. It's fast. 2. It's portable. 3. It's easy to use. Arguments against using memcmp to compare byte arrays: 1. It isn't very C++. 2. It's not type-safe if you're not comparing byte arrays. 3. It flouts Amdahl's law.

TonyK 2010-08-31 22:47:42

@Tony: It isn't very C++ *because* it's not type-safe, generic or encapsulated. It doesn't "flout" Amdahl's law; the point is that premature optimisation which leads to sacrificing any of the above is indeed "lunacy".

Oli Charlesworth 2010-08-31 22:55:27

Do you work for IBM?

TonyK 2010-08-31 22:57:36

@TonyK: Sometimes, when a whole community seems to disagree, it is wise to listen. If you can't see the difference between `C` and `C++`, you probably don't master any of them. Real arguments against `memcmp`: `std::equal` is as fast, as portable and even easier to use. Moreover, it makes your `C++` code consistent and maintainable.

ereOn 2010-09-01 06:21:05

OK, I've written a benchmark program so we can evaluate this objectively. It wouldn't fit in a comment, so I've posted another answer. It turns out that on my system, std::equal is a little slower than memcmp, but operator== is just as fast. What happens on your system?

TonyK 2010-09-01 08:20:14

Answer 5

+1 A:

My first piece of advice would be to profile your code.

The reason I say that is that, no matter how slow your comparison code is, I suspect your file I/O time dwarfs it. You don't want to waste days trying to optimize a part of your code that only takes 1% of your runtime as-is.

It could even be that there is something else you didn't notice before that is actually causing the slowness. You won't know until you profile.

T.E.D. 2010-08-31 12:52:40

As wholeheartedly as I normally agree to such advice, reading binary blobs into __linked lists of bytes__ will so _very_ likely yield inferior performance, even I wouldn't bother to measure.

sbi 2010-08-31 12:54:46

Agreed on the profiling. However, in the case, using a linked-list instead of a contiguous list is an algorithmic problem in itself.

ereOn 2010-08-31 12:56:21

I'm not so shure what you mean by profiling. But I think I've done a similar thing, and that's the reason I came up with the question.I logg every operation with a abstaction level for the method which wrote the log entry. So I found out that the very small comparison method is causing the whole program to slow down, because it's an essential part of the program and it's beeing called very regulary.

zitroneneis 2010-08-31 13:08:08

@zitroneneis: You're getting excellent advice here. Also, if you're not sure what is meant by profiling, you should *get* sure. Here's what I do (http://stackoverflow.com/questions/375913/what-can-i-use-to-profile-c-code-in-linux/378024#378024). If you're a student, you probably haven't heard of this, but every line of code, whether or not it's a function call, is responsible for a certain percent of overall time, such that if it were not there, that time would not be spent. During that time, it is on the stack for you to see.

Mike Dunlavey 2010-08-31 15:45:32

@Mike Dunlavey - your method looks a lot simpler and faster than what I've done till now (use a logger with time stamps und different debug levels, scanning the log file with my eyes to find ineffective methods ). Will try it on my program. Thank you for the advice

zitroneneis 2010-09-01 08:56:17

@zitroneneis: That's a method that's been discovered independently by some people, as in http://stackoverflow.com/questions/2473666/tips-for-optimizing-c-net-programs/2474118#2474118and http://stackoverflow.com/questions/266373/one-could-use-a-profiler-but-why-not-just-halt-the-program/317160#317160. I've done my best to spread the word: http://stackoverflow.com/questions/2624667/whats-a-very-easy-c-profiler-vc/2624725#2624725and http://stackoverflow.com/questions/1777556/alternatives-to-gprof/1779343#1779343

Mike Dunlavey 2010-09-01 12:28:26

@Mike Dunlavey - Oddly, my objection to that method is exactly the opposite of the one you dismissed: Often I find that where the debugger stops things is *not* random. There are certian things (eg: system calls) that you will never interrupt with the debugger. Still, there will be times where it would point you straight at the problem. Way better than guessing.

T.E.D. 2010-09-01 12:50:09

@T.E.D.: Yes, if you pause it during a system call, it should halt when that call returns, so that's just as good because it tells where in *your* code (which is all you can edit) it was at that time. What used to frost me was when it would delay the pause until the next windows message. Man, that was maddening!

Mike Dunlavey 2010-09-01 13:03:44

@Mike Dunlavey - Yeah, the old SunOS debugger used to do the equivalent thing to me. I guess that's where I got it in my head that I couldn't use that as a profiling technique. I'd have to really trust a debugger before I tried it again. For profiling, if I can't totally trust my results, it is essentially worthless to me.

T.E.D. 2010-09-01 13:24:34

#zitroneneis: @T.E.D: Just yesterday I got a 24% speedup in a simulation program, in two fixes. The first was because of some recent changes I had made, which I didn't think would be a performance hit. The second was in a general routine for operating on matrices in LAPACK that was spending too much time classifying its arguments. It took about 10 stackshots to find these.

Mike Dunlavey 2010-09-02 11:24:02

Answer 6

A:

In the interest of objectivity in the memcmp-vs-equal debate, I offer the following benchmark program, so that you can see for yourselves which, if any, is faster on your system. It also tests operator==. On my system (Borland C++ 5.5.1 for Win32):

std::equal: 1375 clock ticks
operator==: 1297 clock ticks
memcmp: 1297 clock ticks

What happens on your system?

#include <algorithm>
#include <vector>
#include <iostream>

using namespace std;

static char* buff ;
static vector<char> v0, v1 ;

static int const BufferSize = 100000 ;

static clock_t StartTimer() ;
static clock_t EndTimer (clock_t t) ;

int main (int argc, char** argv)
  {
  // Allocate a buffer
  buff = new char[BufferSize] ;

  // Create two vectors
  vector<char> v0 (buff, buff + BufferSize) ;
  vector<char> v1 (buff, buff + BufferSize) ;

  clock_t t ;

  // Compare them 10000 times using std::equal
  t = StartTimer() ;
  for (int i = 0 ; i < 10000 ; i++)
    if (!equal (v0.begin(), v0.end(), v1.begin()))
      cout << "Error in std::equal\n", exit (1) ;
  t = EndTimer (t) ;
  cout << "std::equal: " << t << " clock ticks\n" ;

  // Compare them 10000 times using operator==
  t = StartTimer() ;
  for (int i = 0 ; i < 10000 ; i++)
    if (v0 != v1)
      cout << "Error in operator==\n", exit (1) ;
  t = EndTimer (t) ;
  cout << "operator==: " << t << " clock ticks\n" ;

  // Compare them 10000 times using memcmp
  t = StartTimer() ;
  for (int i = 0 ; i < 10000 ; i++)
    if (memcmp (&v0[0], &v1[0], v0.size()))
      cout << "Error in memcmp\n", exit (1) ;
  t = EndTimer (t) ;
  cout << "memcmp: " << t << " clock ticks\n" ;

  return 0 ;
  }

static clock_t StartTimer()
  {
  // Start on a clock tick, to enhance reproducibility
  clock_t t = clock() ;
  while (clock() == t)
    ;
  return clock() ;
  }

static clock_t EndTimer (clock_t t)
  {
  return clock() - t ;
  }

TonyK 2010-09-01 08:17:56

@ TonyK: your benchmark gives me ---------------------------------- std::equal: 437 clock ticks ----------operator==: 453 clock ticks ----------memcmp: 500 clock ticks ------------( using gcc 4.4.0 with mingw ). Very interesting resultst. Glad I went with operator == and didn't hassle with memcmp

zitroneneis 2010-09-01 08:50:32

FYI: In the GNU library at least, `std::vector::operator==` essentially wraps a call to `std::equal`. If I run under GCC using `-O2`, the loop around `memcmp` is optimised away, which makes it difficult to compare results...

Oli Charlesworth 2010-09-01 09:48:28

@zitroneneis: That is indeed interesting! I've learned something from this! And you have your answer. (I tried debugging the program, to see what the Borland compiler was doing -- but in the debug version, the C++ variants take four times longer, even with -O2. memcmp timing is unchanged.)

TonyK 2010-09-01 19:38:42

@TonyK: BCB might not inline in Debugging mode.

sbi 2010-09-03 11:36:03

ansaurus

tags:

views:

answers:

Comparing data bytewise in a effective way (with c++)

related questions