I'm not aware of any MIPS variant with a parity instruction, but there's a bit-twiddling trick for calculating parity faster than the obvious method of running through each of the 32 bits in turn. In C:
result = in ^ (in >> 16);
result ^= (result >> 8);
result ^= (result >> 4);
result ^= (result >> 2);
result ^= (result >> 1);
result &= 1;
After the first step, the bottom 16 bits of the result contain the parity of bits N and N+16 of the input - essentially, 16 steps of the parity calculation have been performed in one go. Writing result{N}
to mean "bit N of result
":
result{0} = in{0} ^ in{16}
result{1} = in{1} ^ in{17}
result{2} = in{2} ^ in{18}
...
result{7} = in{7} ^ in{23}
result{8} = in{8} ^ in{24}
...
result{15} = in{15} ^ in{31}
(and the remaining top 16 bits of result
can now be ignored; they serve no useful purpose in the remainder of the calculation).
After the second step, the bottom 8 bits of result
contain the parity of bits N, N+8, N+16, N+24 of the original input:
result{0} = result{0} ^ result{8} = in{0} ^ in{8} ^ in{16} ^ in{24}
result{1} = result{1} ^ result{9} = in{1} ^ in{9} ^ in{17} ^ in{25}
...
result{7} = result{7} ^ result{15} = in{7} ^ in{15} ^ in{23} ^ in{31}
(and again, the remaining bits can be ignored from here onwards).
...and so on, until the parity of all the bits of the original input ends up in the bottom bit of result
:
result{0} = in{0} ^ in{1} ^ in{2} ^ ... ^ in{30} ^ in{31}
This is easy to translate directly to MIPS assembly; it's 11 instructions:
# input in $a0, output in $v0, $t0 corrupted
srl $t0, $a0, 16
xor $v0, $a0, $t0
srl $t0, $v0, 8
xor $v0, $v0, $t0
srl $t0, $v0, 4
xor $v0, $v0, $t0
srl $t0, $v0, 2
xor $v0, $v0, $t0
srl $t0, $v0, 1
xor $v0, $v0, $t0
and $v0, $v0, 1
A possible improvement might be to use a lookup table. For example, after the first two steps, we have:
result{0} = in{0} ^ in{8} ^ in{16} ^ in{24}
result{1} = in{1} ^ in{9} ^ in{17} ^ in{25}
...
result{7} = in{7} ^ in{15} ^ in{23} ^ in{31}
so we could use a 256-byte lookup table at this point. In C:
result = in ^ (in >> 16);
result ^= (result >> 8);
result = lookup_table[result & 0xff];
where lookup_table[n]
has been precalculated, e.g.:
for (i = 0; i < 256; i++) {
n = i ^ (i >> 4);
n ^= (n >> 2);
n ^= (n >> 1);
lookup_table[i] = n & 1;
}
This is 7 MIPS instructions, not counting loading the lookup table base address into a register:
# input in $a0, lookup table address in $a1, output in $v0, $t0 corrupted
srl $t0, $a0, 16
xor $v0, $a0, $t0
srl $t0, $v0, 8
xor $v0, $v0, $t0
andi $v0, $v0, 0xff
addu $t0, $a1, $v0
lbu $v0, 0($t0)
However, this is 7 instructions which include a memory access, versus 11 instructions which are purely register operations; it may or may not be faster. (This sort of micro-optimisation always needs to be profiled!)