views:

195

answers:

6

Question: Is there a more effective way to compare data bytewise than using the comparison operator of the c++ list container?

I have to compare [large? 10kByte < size < 500kByte] amounts of data bytewise, to verify the integrity of external storage devices.

Therefore I read files bytewise and store the values in a list of unsigned chars. The recources of this list are handled by a shared_ptr, so that I can pass it around in the program without the need to worry about the size of the list

typedef boost::shared_ptr< list< unsigned char > > = contentPtr;
namespace boost::filesystem = fs;

contentPtr GetContent( fs::path filePath ){
 contentPtr actualContent (new list< unsigned char > );       
 // Read the file with a stream, put read values into actual content
return actualContent;

This is done twice, because there're always two copies of the file. The content of these two files has to be compared, and throw an exception if a mismatch is found

void CompareContent() throw( NotMatchingException() ){
 // this part is very fast, below 50ms
 contentPtr contentA = GetContent("/fileA");
 contentPtr contentB = GetContent("/fileB");
 // the next part takes about 2secs with a file size of ~64kByte
 if( *contentA != *contentB )
      throw( NotMatchingException() );
}

My problem is this:
With increasing file size the comparison of the lists gets very slow. Working with file sizes of about 100kByte, it will take up to two seconds to compare the content. Increasing and decreasing with the file size....

Is there a more effective way of doing this comparison?
Is it a problem of the used container ?

+7  A: 

Don't use a std::list use a std::vector.

std::list is a linked-list, elements are not guaranteed to be stored contiguously.

std::vector on the other hand seems far better suited for the specified task (storing contiguous bytes and comparing large chunks of data).

If you have to compare several files multiple times and don't care about where the differences are, you may also compute a hash of each file and compare the hashes. This would be even faster.

ereOn
I will give it a try, thanks for the advice
zitroneneis
This. An array is the fastest storage container by miles if you're not modifying the elements in the middle.
DeadMG
changing the implmentation to std::vector totally solved the problem of slow comparison.Thank you very much!
zitroneneis
@DeadMG: (Okay, this is pedantic --) but modifying them is fine, it's inserting or removing that is the performance suck :) +1 to answer.
Billy ONeal
@zitroneneis: Glad I could help. I can see that you have only 50% of accept rate. Don't forget to accept answers when they somehow solve your problems ;)
ereOn
@Billy: You're right, I wasn't very clear. You can change the elements all you like, just don't go around re-ordering them.
DeadMG
@DeadMG: reordering is fine (using `swap`) it's inserting / removing (apart from the end) that isn't.
Matthieu M.
@ereOn: I already wodered how this accept rate was calculated,now I now ;-)
zitroneneis
A wise man might also find out the size of the files ahead of time and use the `.reserve(` method on the vector to preallocate the needed space.
T.E.D.
+1  A: 

If there's nothing else to be done with the contents of those files (looks like you're going to let them get deleted by shared_ptr at the end of CompareContent()'s scope), why not compare the files using iterators, not creating any containers at all?

Here's a piece of my code that compares two files bytewise:

// compare files
if (equal(std::istreambuf_iterator<char>(local_f),
          std::istreambuf_iterator<char>(),
          std::istreambuf_iterator<char>(host_f)))
{
    // we're good: move table to OutPath, remove other files

EDIT: if you do need to keep contents around, I think std::deque might be slightly more efficient than std::vector for the reasons explained in GOTW #54.. or not -- profiling will tell. And still, there would be the need for only one of the two identical files to be loaded in memory -- I'd read one into a deque and compare with the other file's istreambuf_iterator.

Cubbi
Nice solution. However, what if he has to compare the same file to other files (multiple times). Wouldn't this solution force him to re-read the file every time ?
ereOn
Given the data size of the files they could all be held in RAM for the duration. Depends on the total number of files.
Steve Townsend
@ereOn: OP said *there're always two copies of the file*
Cubbi
actually it's only an excerpt from my code. In my program, there's a database of test objects where content patterns are stored for future reference.Thank you anyway!
zitroneneis
@Cubbi: Perhaps I don't understand it well, but doesn't this strenghten my point ?
ereOn
@ereOn: Then you can always read the first file into a `std::vector` once and use the same `std::equal` code to compare it to all the files.
sbi
Anyway, this presumes that the files are of the same length. Is that the case?
sbi
@ereOn: I mean, I've read it as "there's a copy 'fileA' and a copy 'fileB', and those two need to be compared bytewise, and there's always just two." Anyway, since they have to be stored in memory anyway, it's moot.
Cubbi
I'd like to point out that article is exceedingly old. There are new facilities of vector (like vector::shrink_to_fit) and the implementations will have evolved slightly over the last fifteen years (the article references MSVC 5, we're now on MSVC 10).
DeadMG
A: 

std::list is monumentally inefficient for a char element - there is overhead for every element to facilitate O(1) insertion and removal, which is really not what your task requires.

If you must use STL, then either std::vector or the iterator approach suggested would be preferable to std::list, but why not just read the data into a char* wrapped in some smart pointer of your choice and use memcmp?

Steve Townsend
If he does C++, why would he use C functions and not the STL ? `std::vector` is perfectly designed for this task.
ereOn
"why not just read the data into a char* wrapped in some smart pointer of your choice ...?" Because that wouldn't have any advantage over `std::vector`, but would have disadvantages.
sbi
A: 

It is crazy to use anything other than memcmp for the comparison. (Unless you want it even faster, in which case you might want to code it in assembly language.)

TonyK
I'd expect my standard lib's `std::equal()` to use `std::memcmp()` if that's the fasted. I don't expect to have to decide that myself and adapt that code whenever I change the data type or port to a new platform.
sbi
@sbi: But `std::equal()` is only provided with iterators, not references to the container itself. How can it optimise to use `memcmp()`?
Oli Charlesworth
@Oli Charlesworth: Nothing says iterators cannot reference their parent container. More to the point, at least for vectors, you can get the underlying memory block based on the iterators themselves. (`std::equal` can be template specialized for `std::vector::iterator` and friends)
Billy ONeal
@Billy: I was just poking about inside the GNU `vector` implementation, and found no specialisation for `equal`. I understand what you're saying, but it appears that this isn't done in practice...
Oli Charlesworth
@Oli: 1. There are many more STL implementations than GCC. 2. Even if this is not the case, I seriously doubt `memcmp` and `std::equal` according to a naive implementation would differ in performance. It's not like either function is complex.
Billy ONeal
@Oli: There could be a special implementation for iterators from `std::vector` instances of PODs. That's not particularly hard to find out.
sbi
@Billy: 3. Such special code would be found in the algorithms' implementation, not in the containers'. (It would be the same for pointers and for `std::vector<>::iterator`.)
sbi
@sbi: I've looked in `stl_algobase.h` as well, and can't find anything. I'm going to try compiling something and see where it resolves to...
Oli Charlesworth
@Billy: there are indeed other implementations! But GCC's a pretty major one. I would imagine that the optimised assembler for `memcmp` (which uses x86 string loops and long comparisons) will be a fair bit faster than a character-by-character comparison. But I haven't profiled...
Oli Charlesworth
@Oli: I'd still expect my implementation to handle this for me, although I have to admit that it might fail to do so. `:)` But if it does, then the next version might do. And even if the next version fails, too, I'd _still_ not clutter my code with such low-level details (you change the container's element type and suddenly got a non-POD which you shouldn't use `std::memcmp()` on in a completely different part of your code), _until profiling shows_ that a particular call to `std::equal()` is too slow and that performance indeed increases by using `std::memcmp()` instead.
sbi
@Oli: FWIW, I've looked into VC9's standard lib (from Dinkumware) and found that it will call `std::memcmp()` from `std::equal()` for (`signed char*` OR `unsigned char*`) AND random access iterators. It will, however, resort to its own loop for `char*`. Why it does this, I wouldn't know and I'm not interested either. (I already said I'd expect my implementation to take care of this.)
sbi
@sbi: I take your point. However, if profiling shows that this really is the hotspot in the application, I'd say that would justify replacing with a call to `memcmp()` if it brings a significant speedup, no matter what the cost in clarity. So I guess we agree! (FWIW, I ran `objdump` on a trivial example compiled against the GNU library, and it doesn't resolve to `memcmp()`).
Oli Charlesworth
@Oli: Yes, in that we agree. However, this seems a long shot from "It is crazy to use anything other than memcmp for the comparison", to which I objected.
sbi
You're all crazy! memcmp is a perfectly respectable function. It's fast, and it's portable. That's all you need to know, isn't it?
TonyK
@sbi: Agreed on that, too.
Oli Charlesworth
@Tony: I understand where you're coming from, but `std::memcmp()` isn't very C++. It's not type-safe, you need to manually tell it how long the arrays are (in bytes), and it doesn't extend to arbitrary container types.
Oli Charlesworth
There we have it. 'Isn't very C++'. Is that like Republican In Name Only? This isn't a religious issue, it's a speed issue.
TonyK
@Tony: It's not simply "religious". Type-safety, etc. are all Good Things. You shouldn't sacrifice them unless you're certain that `memcmp()` offers you a significant speed boost.
Oli Charlesworth
You're a loony.
TonyK
@Tony: Yes, all C++ best practices are lunacy. To hell with Amdahl's law...
Oli Charlesworth
Amdahl's law. You really are a loony.
TonyK
@Tony: can you explain why?
Oli Charlesworth
Arguments in favour of using memcmp to compare byte arrays: 1. It's fast. 2. It's portable. 3. It's easy to use. Arguments against using memcmp to compare byte arrays: 1. It isn't very C++. 2. It's not type-safe if you're not comparing byte arrays. 3. It flouts Amdahl's law.
TonyK
@Tony: It isn't very C++ *because* it's not type-safe, generic or encapsulated. It doesn't "flout" Amdahl's law; the point is that premature optimisation which leads to sacrificing any of the above is indeed "lunacy".
Oli Charlesworth
Do you work for IBM?
TonyK
@TonyK: Sometimes, when a whole community seems to disagree, it is wise to listen. If you can't see the difference between `C` and `C++`, you probably don't master any of them. Real arguments against `memcmp`: `std::equal` is as fast, as portable and even easier to use. Moreover, it makes your `C++` code consistent and maintainable.
ereOn
OK, I've written a benchmark program so we can evaluate this objectively. It wouldn't fit in a comment, so I've posted another answer. It turns out that on my system, std::equal is a little slower than memcmp, but operator== is just as fast. What happens on your system?
TonyK
+1  A: 

My first piece of advice would be to profile your code.

The reason I say that is that, no matter how slow your comparison code is, I suspect your file I/O time dwarfs it. You don't want to waste days trying to optimize a part of your code that only takes 1% of your runtime as-is.

It could even be that there is something else you didn't notice before that is actually causing the slowness. You won't know until you profile.

T.E.D.
As wholeheartedly as I normally agree to such advice, reading binary blobs into __linked lists of bytes__ will so _very_ likely yield inferior performance, even I wouldn't bother to measure.
sbi
Agreed on the profiling. However, in the case, using a linked-list instead of a contiguous list is an algorithmic problem in itself.
ereOn
I'm not so shure what you mean by profiling. But I think I've done a similar thing, and that's the reason I came up with the question.I logg every operation with a abstaction level for the method which wrote the log entry. So I found out that the very small comparison method is causing the whole program to slow down, because it's an essential part of the program and it's beeing called very regulary.
zitroneneis
@zitroneneis: You're getting excellent advice here. Also, if you're not sure what is meant by profiling, you should *get* sure. Here's what I do (http://stackoverflow.com/questions/375913/what-can-i-use-to-profile-c-code-in-linux/378024#378024). If you're a student, you probably haven't heard of this, but every line of code, whether or not it's a function call, is responsible for a certain percent of overall time, such that if it were not there, that time would not be spent. During that time, it is on the stack for you to see.
Mike Dunlavey
@Mike Dunlavey - your method looks a lot simpler and faster than what I've done till now (use a logger with time stamps und different debug levels, scanning the log file with my eyes to find ineffective methods ). Will try it on my program. Thank you for the advice
zitroneneis
@zitroneneis: That's a method that's been discovered independently by some people, as in http://stackoverflow.com/questions/2473666/tips-for-optimizing-c-net-programs/2474118#2474118and http://stackoverflow.com/questions/266373/one-could-use-a-profiler-but-why-not-just-halt-the-program/317160#317160. I've done my best to spread the word: http://stackoverflow.com/questions/2624667/whats-a-very-easy-c-profiler-vc/2624725#2624725and http://stackoverflow.com/questions/1777556/alternatives-to-gprof/1779343#1779343
Mike Dunlavey
@Mike Dunlavey - Oddly, my objection to that method is exactly the opposite of the one you dismissed: Often I find that where the debugger stops things is *not* random. There are certian things (eg: system calls) that you will never interrupt with the debugger. Still, there will be times where it would point you straight at the problem. Way better than guessing.
T.E.D.
@T.E.D.: Yes, if you pause it during a system call, it should halt when that call returns, so that's just as good because it tells where in *your* code (which is all you can edit) it was at that time. What used to frost me was when it would delay the pause until the next windows message. Man, that was maddening!
Mike Dunlavey
@Mike Dunlavey - Yeah, the old SunOS debugger used to do the equivalent thing to me. I guess that's where I got it in my head that I couldn't use that as a profiling technique. I'd have to really trust a debugger before I tried it again. For profiling, if I can't totally trust my results, it is essentially worthless to me.
T.E.D.
#zitroneneis: @T.E.D: Just yesterday I got a 24% speedup in a simulation program, in two fixes. The first was because of some recent changes I had made, which I didn't think would be a performance hit. The second was in a general routine for operating on matrices in LAPACK that was spending too much time classifying its arguments. It took about 10 stackshots to find these.
Mike Dunlavey
A: 

In the interest of objectivity in the memcmp-vs-equal debate, I offer the following benchmark program, so that you can see for yourselves which, if any, is faster on your system. It also tests operator==. On my system (Borland C++ 5.5.1 for Win32):

std::equal: 1375 clock ticks
operator==: 1297 clock ticks
memcmp: 1297 clock ticks

What happens on your system?

#include <algorithm>
#include <vector>
#include <iostream>

using namespace std;

static char* buff ;
static vector<char> v0, v1 ;

static int const BufferSize = 100000 ;

static clock_t StartTimer() ;
static clock_t EndTimer (clock_t t) ;

int main (int argc, char** argv)
  {
  // Allocate a buffer
  buff = new char[BufferSize] ;

  // Create two vectors
  vector<char> v0 (buff, buff + BufferSize) ;
  vector<char> v1 (buff, buff + BufferSize) ;

  clock_t t ;

  // Compare them 10000 times using std::equal
  t = StartTimer() ;
  for (int i = 0 ; i < 10000 ; i++)
    if (!equal (v0.begin(), v0.end(), v1.begin()))
      cout << "Error in std::equal\n", exit (1) ;
  t = EndTimer (t) ;
  cout << "std::equal: " << t << " clock ticks\n" ;

  // Compare them 10000 times using operator==
  t = StartTimer() ;
  for (int i = 0 ; i < 10000 ; i++)
    if (v0 != v1)
      cout << "Error in operator==\n", exit (1) ;
  t = EndTimer (t) ;
  cout << "operator==: " << t << " clock ticks\n" ;

  // Compare them 10000 times using memcmp
  t = StartTimer() ;
  for (int i = 0 ; i < 10000 ; i++)
    if (memcmp (&v0[0], &v1[0], v0.size()))
      cout << "Error in memcmp\n", exit (1) ;
  t = EndTimer (t) ;
  cout << "memcmp: " << t << " clock ticks\n" ;

  return 0 ;
  }

static clock_t StartTimer()
  {
  // Start on a clock tick, to enhance reproducibility
  clock_t t = clock() ;
  while (clock() == t)
    ;
  return clock() ;
  }

static clock_t EndTimer (clock_t t)
  {
  return clock() - t ;
  }
TonyK
@ TonyK: your benchmark gives me ---------------------------------- std::equal: 437 clock ticks ----------operator==: 453 clock ticks ----------memcmp: 500 clock ticks ------------( using gcc 4.4.0 with mingw ). Very interesting resultst. Glad I went with operator == and didn't hassle with memcmp
zitroneneis
FYI: In the GNU library at least, `std::vector::operator==` essentially wraps a call to `std::equal`. If I run under GCC using `-O2`, the loop around `memcmp` is optimised away, which makes it difficult to compare results...
Oli Charlesworth
@zitroneneis: That is indeed interesting! I've learned something from this! And you have your answer. (I tried debugging the program, to see what the Borland compiler was doing -- but in the debug version, the C++ variants take four times longer, even with -O2. memcmp timing is unchanged.)
TonyK
@TonyK: BCB might not inline in Debugging mode.
sbi