Looking at the STL that comes with gcc 4.0.0, the bitset methods _Find_first
and _Find_next
already do what you want. Specifically, they use __builtin_ctzl()
(described here), which should use the appropriate instruction. (I would guess that the same applies for older versions of gcc.)
And the nice thing is that bitset already does the right thing: single instruction if it's a bitset that fits within a single unsigned long; a loop over the longs if it uses several. In case of a loop, it's a loop whose length is known at compile time, with a handful of instructions, so it might be fully unrolled by the optimizer. I.e. it would probably be hard to beat bitset by rolling your own.