mmx

Help with simple assembly mmx exercise

Given a vector of bytes with length multiple of 8, how can I, using mmx instructions, convert all 2's to 5's, for example? .data v1 BYTE 1, 2, 3, 4, 1, 2, 3, 4 Thanks. edit: 2's and 5's are just an example. They are actually parameters of a procedure. ...

Assembly mask logic question

This is very simple, but I haven't been able to figure it out yet. This question is regarding a assembly mmx, but it's pure logic. Imagine the following scenario: MM0: 04 03 02 01 04 03 02 01 <-- input MM1: 02 02 02 02 02 02 02 02 MM2: 04 03 02 01 04 03 02 01 <-- copy of input after pcmpgtw MM0, MM1 MM0: FF FF 00 00 FF FF 00 0...

Common SIMD techniques

Hi! Where can I find information about common SIMD tricks? I have an instruction set and know, how to write non-tricky SIMD code, but I know, SIMD now is much more powerful. It can hold complex conditional branchless code. For example (ARMv6), the following sequence of instructions sets each byte of Rd equal to the unsigned minimum of t...

Benefit of using multiple SIMD instruction sets simultaneously

I'm writing a highly parallel application that's multithreaded. I've already got an SSE accelerated thread class written. If I were to write an MMX accelerated thread class, then run both at the same time (one SSE thread and one MMX thread per core) would the performance improve noticeably? I would think that this setup would help hide...

C/C++ usage of special CPU features

Hi, I am curious, do new compilers use some extra features built into new CPUs such as MMX SSE,3DNow! and so? I mean, in original 8086 there was even no FPU, so compiler that old cannot even use it, but new compilers can, since FPU is part of every new CPU. So, does new compilers use new features of CPU? Or, it should be more right...

Stack usage with MMX intrinsics and Microsoft C++

I have an inline assembler loop that cumulatively adds elements from an int32 data array with MMX instructions. In particular, it uses the fact that the MMX registers can accommodate 16 int32s to calculate 16 different cumulative sums in parallel. I would now like to convert this piece of code to MMX intrinsics but I am afraid that I wi...

Concise SSE and MMX instruction reference with latencies and throughput

I am trying to optimize some arithmetic by using the MMX and SSE instruction sets with inline assembly. However, I have been unable to find good references for the timings and usages of these enhanced instruction sets. Could you please help me find references that contain information about the throughput, latency, operands, and perhaps s...

SSE enhanced libtiff/CCITT Fax4 encoder

Does anyone know of an SSE ehanced version of libtiff? Even just an SSE enhanced version of a CCITT Group4 encoder would do, I could do the work of sliding that one in libtiff myself. I only need to work with bitonal images. Thank you ...