Software
Speed
Optimization
Case Studies
Services
Platforms
· x86
· Itanium
· Embedded
· Other
Technology
· SIMD / MMX / SSE /
SSE2 / SSE3 / 3DNow!
About
Contact

Technology: SIMD / MMX / SSE / SSE2 / SSE3 / 3DNow!
An increasing number of newer processors come with extensions to their
instruction set commonly referred to as SIMD instructions. SIMD stands for
Single Instruction Multiple Data meaning that a single instruction, such
as "add", operates on a number of data items in parallel. A
typical SIMD instruction for example will add 8 16 bit values in parallel.
Obviously this can increase execution speed dramatically which is why
these instruction set extensions were created.
You can either read this page sequentially or jump directly to the
following subsections:
[The 3DNow! instruction set extension is not covered, however, the SSE
instruction set is similar.]
In 1997 Intel launched the Pentium Processor with MMX Technology.
Subsequently the Pentium II, III and 4 were launched and other processors,
such as the AMD Athlon and Duron, added this instruction set extension.
The MMX instruction set introduces 8 new 64 bit wide registers named
mm0 .. mm7; however, these are mapped on the FPU registers, so FPU and MMX
instructions can not be intermixed. The registers can hold the following
data types:
Data type 
MMXRegister 
1x 64 bit (raw) integer 
Bits 63 .. 0 
2x 32 bit integers 
32 bit word 1 
16 bit word 0 
4x 16 bit integers 
16 bit word 3 
16 bit word 2 
16 bit word 1 
16 bit word 0 
8x 8 bit integers 
Byte 7 
Byte 6 
Byte 5 
Byte 4 
Byte 3 
Byte 2 
Byte 1 
Byte 0 
Corresponding integer instructions were introduced:
Instruction class

Instruction

Operation

Data movement 
movq 
64 bit move from/to mmx register 
movd 
32 bit move from/to mmx register 
Data format conversion / movement 
packsswb/dw 
Parallel pack signed words into bytes / dwords into words with
signed saturation 
packuswb/dw 
Parallel pack unsigned words into bytes / dwords into words with
unsigned saturation 
punpckhbw/wd/dq 
Parallel unpack high 32 bits: bytes to words / words to dwords /
dwords to qwords 
punpcklbw/wd/dq 
Parallel unpack low 32 bits: bytes to words / words to dwords /
dwords to qwords 
Arithmetical operations 
paddb/w/d 
Parallel add for bytes / words / dwords 
psubb/w/d 
Parallel subtraction for bytes / words / dwords 
paddsb/w 
Parallel saturated add for signed bytes / words 
paddusb/w 
Parallel saturated add for unsigned bytes / words 
psubsb/w 
Parallel saturated subtraction for signed bytes / words 
psubusb/w 
Parallel saturated subtraction for unsigned bytes / words 
pmaddwd 
Parallel multiply signed words and add the results 
pmullhw 
Parallel multiply signed words and store high 16 bits of results 
pmulllw 
Parallel multiply signed words and store low 16 bits of results 
pcmpeqb/w/d 
Parallel compare signed bytes / words / dwords for equality 
pcmpgtb/w/d 
Parallel compare signed bytes / words / dwords for "greater
than" 
Logical operations 
pand 
Parallel and operation on all 64 bits 
pandn 
Parallel and not operation on all 64 bits 
por 
Parallel or operation on all 64 bits 
pxor 
Parallel xor operation on all 64 bits 
Shift operations 
psllw/d/q 
Parallel shift logical left words / dwords / qwords 
psraw/d 
Parallel shift right signed words / dwords 
psrlw/d/q 
Parallel shift right unsigned words / dwords / qwords 
Enable FPU instructions 
emms 
Resets the FPU register states to enable FPU instructions 
All operations are either register, register or register, memory
operations. No memory, register or memory, memory operations are
available.
Here come some examples:
Instruction

Operation

pxor mm0,mm0 
Clear mm0 
paddb mm0,mm1 
Parallel add bytes from mm1 to mm0 
psubusw mm7,[eax] 
Parallel add unsigned words with saturation from [eax] to mm7 
psslw mm5,3 
Parallel shift left word in mm5 by 3 positions, filling in zeros 
As can be seen from the above instruction list there are a number of limitations on this
initial set of instructions:
 Not all data types are supported for all operations, i.e. unsigned
multiplication, bytewide shifts are missing etc.
 Various interesting operations are missing, i.e. min/max
 The registers are shared with the FPU registers, prohibiting any FPU
operations in parallel
 No floating point operations are possible
 The registers are only 64 bits wide
 There are only 8 registers
 There are not prefetch or writethrough instructions
 There are no real provisions for unaligned access
 Constants are not supported
The subsequent introduction of the SSE and SSE2 instruction set
extensions remove many of these limitations.
Jump to top of page
The Pentium III processor, introduced in 1999, was the first processor
with the SSE instruction set extension. However, AMD at that time already
had added the similar 3DNow! instruction set extension to their AMD K6 CPU.
The SSE instruction set in fact consists of 3 distinct enhancements:
 Additional MMX instructions such as min/max
 Prefetch and writethrough instructions for optimizing data movement
from and to the L2/L3 caches and main memory
 New 128 bit XMM registers and corresponding 32 bit floating point
(single precision) instructions
The SSE instruction set introduces 8 new 128 bit wide registers named xmm0 ..
xmm7. The registers can hold the following
data types:
Data type 
XMMRegister 
1x 128 bit raw integer 
Bits 127 .. 0 
4x 32 bit floating point values 
32 bit floating point value 3 
32 bit floating point value 2 
32 bit floating point value 1 
32 bit floating point value 0 
The SSE instruction set includes the MMX instruction set; however, the
following table lists only the additional SSE instructions:
Instruction class

Instruction

Operation

Prefetch and writethrough instructions 
prefetchx 
Prefetch cache line into L1/L2/L3 cache 
movntq 
Store data directly into main memory, bypassing the caches (MMX
only) 
movntps 
Store data directly into main memory, bypassing the caches (XMM
only) 
maskmovq 
Store individual bytes directly into main memory, bypassing the
caches (MMX only) 
XMM data movement 
movaps 
128 bit move from/to xmm register, memory data must be aligned 
movups 
128 bit move from/to xmm register, memory data may be unaligned 
movss 
32 bit move from/to xmm register 
movlps 
64 bit move from/to lower 64 bit of xmm register 
movhps 
64 bit move from/to high 64 bit of xmm register 
movlhps 
64 bit move from lower 64 bit of xmm register to high 64 bit of
xmm register 
movhlps 
64 bit move from higher 64 bit of xmm register to lower 64 bit of
xmm register 
movmskps 
Extract upper bits of all dwords and store in a general register 
Data format conversion / movement 
pextrw 
Extract word into general register 
pinsrw 
Insert word from general register 
pmovmskb 
Extract upper bits of all bytes and store in a general register 
pshufw 
Shuffle words (MMX only) 
pshufps 
Shuffle dwords (XMM only) 
unpckhps 
Parallel unpack and interleave high dwords (XMM only) 
unpcklps 
Parallel unpack and interleave low dwords (XMM only) 
cvtpi2ps 
Parallel convert integer dwords to single precision floating point
values (XMM only) 
cvtpi2ss 
Convert integer dword to single precision floating point value (XMM only) 
cvtps2si 
Parallel convert single precision floating point
values to integer dwords (XMM only) 
cvtss2si 
Convert single precision floating point value to integer dword (XMM only) 
MMX arithmetical operations 
pavgb/w 
Parallel average for unsigned bytes / words 
pminub 
Parallel minimum for unsigned bytes 
pmaxub 
Parallel maximum for unsigned bytes 
pminsw 
Parallel minimum for signed words 
pmaxsw 
Parallel maximum for signed words 
pmulhuw 
Parallel multiply unsigned words and store high 16 bits of results 
psadbw 
Parallel absolute difference of unsigned bytes and sum up the results 
XMM arithmetical operations 
addps 
Parallel add single precision floating point values 
addss 
Add single precision floating point value in lower 32 bits 
subps 
Parallel subtract single precision floating point values 
subss 
Subtract single precision floating point value in lower 32 bits 
mulps 
Parallel multiply single precision floating point values 
mulss 
Multiply single precision floating point value in lower 32 bits 
divps 
Parallel divide single precision floating point values 
divss 
Divide single precision floating point value in lower 32 bits 
rcpps 
Parallel approximate reciprocal single precision floating point
values 
rcpss 
Approximate reciprocal single precision floating point value in
lower 32 bits 
sqrtps 
Parallel square root single precision floating point values 
sqrtss 
Square root single precision floating point value in lower 32 bits 
rsqrtps 
Parallel approximate reciprocal square root single precision
floating point values 
rsqrtss 
Approximate reciprocal square root single precision floating point
value in lower 32 bits 
minps 
Parallel minimum of single precision floating point values 
minss 
Minimum of precision floating point value in lower 32 bits 
maxps 
Parallel maximum of single precision floating point values 
maxss 
Maximum of precision floating point value in lower 32 bits 
cmpps 
Parallel ompare single precision floating point values and return masks as
result 
cmpss 
Compare single precision floating point value in lower 32 bits and
return mask as result 
comiss 
Compare single precision floating point value in lower 32 bits and
return result in flag register 
ucomiss 
Compare unordered single precision floating point value in lower
32 bits and return result in flag register 
XMM Logical operations 
andps 
Parallel and operation on all 128 bits 
andnps 
Parallel and not operation on all 128 bits 
orps 
Parallel or operation on all 128 bits 
xorps 
Parallel xor operation on all 128 bits 
Here come some examples:
Instruction

Operation

prefetchnta [esi+64] 
Prefetch data into L1 (or L2) cache from [esi+64] 
pminub mm3,mm4 
Parallel minimum of unsigned bytes of mm3 and mm4 
mulps xmm0,xmm5 
Parallel single precision floating point add xmm5 to xmm0 
maxps xmm1,[edi] 
Parallel single precision floating point maximum of xmm1 and data
at [edi] 
As can be seen from the above instruction list there are still a number of limitations
on this set of instructions:
 Not all data types are supported for all operations, i.e. 32 bit
integer multiplies, bytewide shifts are missing etc.
 Only 32 bit floating point data types are available
 There are no integer instructions for the XMM registers
 There are only 8 registers
 There are no real provisions for unaligned access
 Constants are not supported
The subsequent introduction of the SSE2 instruction set extension
removes some of these limitations.
Jump to top of page
The Pentium 4 processor introduced the SSE2 instruction set extensions
in 2000.
Two important improvements were made:
 Support 64 bit (double precision) floating point data types
 Integer instruction for the XMM registers
Because the SSE2 instruction set introduces integer instructions to the
XMM registers and double precision floating data types the XMM registers
can now hold following data types:
Data type 
XMMxRegister 
1x 128 bit raw integer 
Bits 127 .. 0 
2x 64 bit integer 
64 bit word 1 
64 bit word 0 
4x 32 bit integer 
32 bit word 3 
32 bit word 2 
32 bit word 1 
32 bit word 0 
8x 16 bit integer 
16 bit word 7 
16 bit word 6 
16 bit word 5 
16 bit word 4 
16 bit word 3 
16 bit word 2 
16 bit word 1 
16 bit word 0 
16x 8 bit integer 
Byte 15 
Byte 14 
Byte 13 
Byte 12 
Byte 11 
Byte 10 
Byte 9 
Byte 8 
Byte 7 
Byte 6 
Byte 5 
Byte 4 
Byte 3 
Byte 2 
Byte 1 
Byte 0 
2x 64 bit floating point values 
64 bit floating point value 1 
64 bit floating point value 0 
4x 32 bit floating point values 
32 bit floating point value 3 
32 bit floating point value 2 
32 bit floating point value 1 
32 bit floating point value 0 
The SSE2 instruction set includes the SSE instruction set (the MMX and
SSE integer instructions also apply to the XMM registers now and the
single precision floating point instructions of the SSE instruction set
also apply to the double precision data types); the following table lists
only the additional SSE2 instructions:
Instruction class

Instruction

Operation

Prefetch and writethrough instructions 
movntdq / movntpd 
Store data directly into main memory, bypassing the caches (XMM
only) 
movnti 
Store data in general register directly into main memory, bypassing the caches 
maskmovdqu 
Store individual bytes directly into main memory, bypassing the
caches (XMM only) 
XMM data movement 
movapd / movdqa 
128 bit move from/to xmm register, memory data must be aligned 
movupd / movdqu 
128 bit move from/to xmm register, memory data may be unaligned 
movsd 
64 bit move from/to xmm register 
movlpd 
64 bit move from/to lower 64 bit of xmm register 
movhpd 
64 bit move from/to high 64 bit of xmm register 
movlhpd 
64 bit move from lower 64 bit of xmm register to high 64 bit of
xmm register 
movhlpd 
64 bit move from higher 64 bit of xmm register to lower 64 bit of
xmm register 
movmskpd 
Extract upper bits of all qwords and store in a general register 
Data format conversion / movement 
pmovmskpd 
Extract upper bits of all qwords and store in a general register 
pshuflw 
Shuffle words in lower 64 bits (XMM only) 
pshufhw 
Shuffle words in high 64 bits (XMM only) 
pshufpd 
Shuffle dwords (XMM only) 
unpckhpd 
Parallel unpack and interleave high qwords (XMM only) 
unpcklpd 
Parallel unpack and interleave low qwords (XMM only) 
cvtpi2pd 
Parallel convert integer dwords to double precision floating point
values (XMM only) 
punpckhqdq 
Parallel unpack high qwords 
punpcklqdq 
Parallel unpack lower qwords 
cvtpi2sd 
Convert integer dword to double precision floating point value
(XMM only) 
cvtpd2si 
Parallel convert double precision floating point
values to integer qwords (XMM only) 
cvtsd2si 
Convert double precision floating point value to integer qword
(XMM only) 
cvtps2pd 
Parallel convert single precision floating point values to double
precision floating point values 
cvtpd2ps 
Parallel convert double precision floating point values to single
precision floating point values 
cvtss2sd 
Convert single precision floating point value to double precision
floating point value 
cvtsd2ss 
Convert double precision floating point value to single precision
floating point value 
cvtps2dq 
Parallel convert single precision floating point values to signed
dword integers 
cvtdq2ps 
Parallel convert signed dword integers to single precision
floating point values 
movq2dq 
Move MMX register value to XMM register 
movdq2q 
Move lower 64 bit of XMM register to MMX register 
Integer arithmetical operations 
paddq 
Parallel add for 64 bit integers 
psubq 
Parallel subtraction for 64 bit integers 
pmuludq 
Parallel multiply 32 bit unsigned integers and return 64 bit
result 
XMM arithmetical operations 
addpd 
Parallel add double precision floating point values 
addsd 
Add double precision floating point value in lower 64 bits 
subpd 
Parallel subtract double precision floating point values 
subsd 
Subtract double precision floating point value in lower 64 bits 
mulpd 
Parallel multiply double precision floating point values 
mulsd 
Multiply double precision floating point value in lower 64 bits 
divpd 
Parallel divide double precision floating point values 
divsd 
Divide double precision floating point value in lower 64 bits 
rcppd 
Parallel approximate reciprocal double precision floating point
values 
rcpsd 
Approximate reciprocal double precision floating point value in
lower 64 bits 
sqrtpd 
Parallel square root double precision floating point values 
sqrtsd 
Square root double precision floating point value in lower 64 bits 
rsqrtpd 
Parallel approximate reciprocal square root double precision
floating point values 
rsqrtsd 
Approximate reciprocal square root double precision floating point
value in lower 64 bits 
minpd 
Parallel minimum of double precision floating point values 
minsd 
Minimum of precision floating point value in lower 64 bits 
maxpd 
Parallel maximum of double precision floating point values 
maxsd 
Maximum of precision floating point value in lower 64 bits 
cmpps 
Compare double precision floating point values and return masks as
result 
cmpsd 
Compare double precision floating point value in lower 64 bits and
return mask as result 
comisd 
Compare double precision floating point value in lower 64 bits and
return result in flag register 
ucomisd 
Compare unordered double precision floating point value in lower
64 bits and return result in flag register 
XMM logical operations 
andpd 
Parallel and operation on all 128 bits 
andnpd 
Parallel and not operation on all 128 bits 
orpd 
Parallel or operation on all 128 bits 
xorpd 
Parallel xor operation on all 128 bits 
XMM shift operations 
pslldq 
Shift all 128 bits left 
psrldq 
Shift all 128 bits rights, filling in zeros 
Here come some examples:
Instruction

Operation

paddd xmm0,xmm6 
Parallel add integer dwords from xmm6 to xmm0 
pmaddwd xmm2,[edx] 
Parallel multiply signed integer words in xmm2 and at [edi] and
add results 
sqrtpd xmm0,xmm0 
Parallel double precision floating point square root of values in
xmm0 
addpd xmm2,[ebx+16] 
Parallel add double precision floating point values from [ebx+16]
to xmm2 
psrldq xmm5,1 
Shift right all 128 bits by one, filling in a zero 
As can be seen from the above instruction list there are still a number of limitations
on this set of instructions:
 Not all data types are supported for all operations, i.e. 32 bit
saturated adds, bytewide shifts are missing etc.
 There are only 8 registers
 There are no real provisions for unaligned access
 Constants are not supported
It is unclear whether future instruction set extensions will remove
some of these limitations.
Nevertheless the SSE2 instruction set in effect is a parallel version of
the typical integer and floating point instructions found in the basic
instruction set of most processors, thereby enabling impressive speedups.
Jump to top of page
The basic characteristic of SIMD instructions is that they operate on n
data items in parallel, n typically being 1, 2, 4, 8 or 16. There are
mainly 5 problems with this:
 If the data layout does not match the SIMD requirements SIMD
instructions can either not be used or data rearrangement code is
necessary
 In case of unaligned data the performance will suffer
dramatically; this problem leads to massively more complicated
alignment code being necessary
 If the problem does not match the possible values of n, for example
in case of a filter which would require n = 5, the solution is
rather complicated
 (Parallel) Table lookups are not possible
 Compare operations are limited
Basically there are solutions to these problems, but they require some
clever or additional code. Due to these problems even compilers which generate SIMD code will in many cases not do so or not be able to do so.
Jump to top of page
SIMD instructions have the potential to speedup software by factors
of 2 .. 16 or even more. This especially applies to
computeintensive software dealing with arrays of data.
Please also see the case studies which
give numbers on speedup factors.
Jump to top of page
SIMD instructions have the potential to speedup software by factors
of 2 .. 16 or even more.
However, as most compilers do not utilize the (full) potential of these
instructions these speedups can generally only be achieved by an
optimization expert.
Hayes Technologies offers services to help
your software realize the high potential of the SIMD instruction sets.
Jump to top of page
