|
Software
Speed
Optimization
Case Studies
Services
Platforms
· x86
· Itanium
· Embedded
· Other
Technology
· SIMD / MMX / SSE /
SSE2 / SSE3 / 3DNow!
About
Contact
|
Technology: SIMD / MMX / SSE / SSE2 / SSE3 / 3DNow!
An increasing number of newer processors come with extensions to their
instruction set commonly referred to as SIMD instructions. SIMD stands for
Single Instruction Multiple Data meaning that a single instruction, such
as "add", operates on a number of data items in parallel. A
typical SIMD instruction for example will add 8 16 bit values in parallel.
Obviously this can increase execution speed dramatically which is why
these instruction set extensions were created.
You can either read this page sequentially or jump directly to the
following subsections:
[The 3DNow! instruction set extension is not covered, however, the SSE
instruction set is similar.]
In 1997 Intel launched the Pentium Processor with MMX Technology.
Subsequently the Pentium II, III and 4 were launched and other processors,
such as the AMD Athlon and Duron, added this instruction set extension.
The MMX instruction set introduces 8 new 64 bit wide registers named
mm0 .. mm7; however, these are mapped on the FPU registers, so FPU and MMX
instructions can not be intermixed. The registers can hold the following
data types:
| Data type |
MMX-Register |
| 1x 64 bit (raw) integer |
Bits 63 .. 0 |
| 2x 32 bit integers |
32 bit word 1 |
16 bit word 0 |
| 4x 16 bit integers |
16 bit word 3 |
16 bit word 2 |
16 bit word 1 |
16 bit word 0 |
| 8x 8 bit integers |
Byte 7 |
Byte 6 |
Byte 5 |
Byte 4 |
Byte 3 |
Byte 2 |
Byte 1 |
Byte 0 |
Corresponding integer instructions were introduced:
Instruction class
|
Instruction
|
Operation
|
| Data movement |
movq |
64 bit move from/to mmx register |
| movd |
32 bit move from/to mmx register |
| Data format conversion / movement |
packsswb/dw |
Parallel pack signed words into bytes / dwords into words with
signed saturation |
| packuswb/dw |
Parallel pack unsigned words into bytes / dwords into words with
unsigned saturation |
| punpckhbw/wd/dq |
Parallel unpack high 32 bits: bytes to words / words to dwords /
dwords to qwords |
| punpcklbw/wd/dq |
Parallel unpack low 32 bits: bytes to words / words to dwords /
dwords to qwords |
| Arithmetical operations |
paddb/w/d |
Parallel add for bytes / words / dwords |
| psubb/w/d |
Parallel subtraction for bytes / words / dwords |
| paddsb/w |
Parallel saturated add for signed bytes / words |
| paddusb/w |
Parallel saturated add for unsigned bytes / words |
| psubsb/w |
Parallel saturated subtraction for signed bytes / words |
| psubusb/w |
Parallel saturated subtraction for unsigned bytes / words |
| pmaddwd |
Parallel multiply signed words and add the results |
| pmullhw |
Parallel multiply signed words and store high 16 bits of results |
| pmulllw |
Parallel multiply signed words and store low 16 bits of results |
| pcmpeqb/w/d |
Parallel compare signed bytes / words / dwords for equality |
| pcmpgtb/w/d |
Parallel compare signed bytes / words / dwords for "greater
than" |
| Logical operations |
pand |
Parallel and operation on all 64 bits |
| pandn |
Parallel and not operation on all 64 bits |
| por |
Parallel or operation on all 64 bits |
| pxor |
Parallel xor operation on all 64 bits |
| Shift operations |
psllw/d/q |
Parallel shift logical left words / dwords / qwords |
| psraw/d |
Parallel shift right signed words / dwords |
| psrlw/d/q |
Parallel shift right unsigned words / dwords / qwords |
| Enable FPU instructions |
emms |
Resets the FPU register states to enable FPU instructions |
All operations are either register, register or register, memory
operations. No memory, register or memory, memory operations are
available.
Here come some examples:
|
Instruction
|
Operation
|
| pxor mm0,mm0 |
Clear mm0 |
| paddb mm0,mm1 |
Parallel add bytes from mm1 to mm0 |
| psubusw mm7,[eax] |
Parallel add unsigned words with saturation from [eax] to mm7 |
| psslw mm5,3 |
Parallel shift left word in mm5 by 3 positions, filling in zeros |
As can be seen from the above instruction list there are a number of limitations on this
initial set of instructions:
- Not all data types are supported for all operations, i.e. unsigned
multiplication, byte-wide shifts are missing etc.
- Various interesting operations are missing, i.e. min/max
- The registers are shared with the FPU registers, prohibiting any FPU
operations in parallel
- No floating point operations are possible
- The registers are only 64 bits wide
- There are only 8 registers
- There are not prefetch or write-through instructions
- There are no real provisions for unaligned access
- Constants are not supported
The subsequent introduction of the SSE and SSE2 instruction set
extensions remove many of these limitations.
Jump to top of page
The Pentium III processor, introduced in 1999, was the first processor
with the SSE instruction set extension. However, AMD at that time already
had added the similar 3DNow! instruction set extension to their AMD K6 CPU.
The SSE instruction set in fact consists of 3 distinct enhancements:
- Additional MMX instructions such as min/max
- Prefetch and write-through instructions for optimizing data movement
from and to the L2/L3 caches and main memory
- New 128 bit XMM registers and corresponding 32 bit floating point
(single precision) instructions
The SSE instruction set introduces 8 new 128 bit wide registers named xmm0 ..
xmm7. The registers can hold the following
data types:
| Data type |
XMM-Register |
| 1x 128 bit raw integer |
Bits 127 .. 0 |
| 4x 32 bit floating point values |
32 bit floating point value 3 |
32 bit floating point value 2 |
32 bit floating point value 1 |
32 bit floating point value 0 |
The SSE instruction set includes the MMX instruction set; however, the
following table lists only the additional SSE instructions:
Instruction class
|
Instruction
|
Operation
|
| Prefetch and write-through instructions |
prefetchx |
Prefetch cache line into L1/L2/L3 cache |
| movntq |
Store data directly into main memory, bypassing the caches (MMX
only) |
| movntps |
Store data directly into main memory, bypassing the caches (XMM
only) |
| maskmovq |
Store individual bytes directly into main memory, bypassing the
caches (MMX only) |
| XMM data movement |
movaps |
128 bit move from/to xmm register, memory data must be aligned |
| movups |
128 bit move from/to xmm register, memory data may be unaligned |
| movss |
32 bit move from/to xmm register |
| movlps |
64 bit move from/to lower 64 bit of xmm register |
| movhps |
64 bit move from/to high 64 bit of xmm register |
| movlhps |
64 bit move from lower 64 bit of xmm register to high 64 bit of
xmm register |
| movhlps |
64 bit move from higher 64 bit of xmm register to lower 64 bit of
xmm register |
| movmskps |
Extract upper bits of all dwords and store in a general register |
| Data format conversion / movement |
pextrw |
Extract word into general register |
| pinsrw |
Insert word from general register |
| pmovmskb |
Extract upper bits of all bytes and store in a general register |
| pshufw |
Shuffle words (MMX only) |
| pshufps |
Shuffle dwords (XMM only) |
| unpckhps |
Parallel unpack and interleave high dwords (XMM only) |
| unpcklps |
Parallel unpack and interleave low dwords (XMM only) |
| cvtpi2ps |
Parallel convert integer dwords to single precision floating point
values (XMM only) |
| cvtpi2ss |
Convert integer dword to single precision floating point value (XMM only) |
| cvtps2si |
Parallel convert single precision floating point
values to integer dwords (XMM only) |
| cvtss2si |
Convert single precision floating point value to integer dword (XMM only) |
| MMX arithmetical operations |
pavgb/w |
Parallel average for unsigned bytes / words |
| pminub |
Parallel minimum for unsigned bytes |
| pmaxub |
Parallel maximum for unsigned bytes |
| pminsw |
Parallel minimum for signed words |
| pmaxsw |
Parallel maximum for signed words |
| pmulhuw |
Parallel multiply unsigned words and store high 16 bits of results |
| psadbw |
Parallel absolute difference of unsigned bytes and sum up the results |
| XMM arithmetical operations |
addps |
Parallel add single precision floating point values |
| addss |
Add single precision floating point value in lower 32 bits |
| subps |
Parallel subtract single precision floating point values |
| subss |
Subtract single precision floating point value in lower 32 bits |
| mulps |
Parallel multiply single precision floating point values |
| mulss |
Multiply single precision floating point value in lower 32 bits |
| divps |
Parallel divide single precision floating point values |
| divss |
Divide single precision floating point value in lower 32 bits |
| rcpps |
Parallel approximate reciprocal single precision floating point
values |
| rcpss |
Approximate reciprocal single precision floating point value in
lower 32 bits |
| sqrtps |
Parallel square root single precision floating point values |
| sqrtss |
Square root single precision floating point value in lower 32 bits |
| rsqrtps |
Parallel approximate reciprocal square root single precision
floating point values |
| rsqrtss |
Approximate reciprocal square root single precision floating point
value in lower 32 bits |
| minps |
Parallel minimum of single precision floating point values |
| minss |
Minimum of precision floating point value in lower 32 bits |
| maxps |
Parallel maximum of single precision floating point values |
| maxss |
Maximum of precision floating point value in lower 32 bits |
| cmpps |
Parallel ompare single precision floating point values and return masks as
result |
| cmpss |
Compare single precision floating point value in lower 32 bits and
return mask as result |
| comiss |
Compare single precision floating point value in lower 32 bits and
return result in flag register |
| ucomiss |
Compare unordered single precision floating point value in lower
32 bits and return result in flag register |
| XMM Logical operations |
andps |
Parallel and operation on all 128 bits |
| andnps |
Parallel and not operation on all 128 bits |
| orps |
Parallel or operation on all 128 bits |
| xorps |
Parallel xor operation on all 128 bits |
Here come some examples:
|
Instruction
|
Operation
|
| prefetchnta [esi+64] |
Prefetch data into L1 (or L2) cache from [esi+64] |
| pminub mm3,mm4 |
Parallel minimum of unsigned bytes of mm3 and mm4 |
| mulps xmm0,xmm5 |
Parallel single precision floating point add xmm5 to xmm0 |
| maxps xmm1,[edi] |
Parallel single precision floating point maximum of xmm1 and data
at [edi] |
As can be seen from the above instruction list there are still a number of limitations
on this set of instructions:
- Not all data types are supported for all operations, i.e. 32 bit
integer multiplies, byte-wide shifts are missing etc.
- Only 32 bit floating point data types are available
- There are no integer instructions for the XMM registers
- There are only 8 registers
- There are no real provisions for unaligned access
- Constants are not supported
The subsequent introduction of the SSE2 instruction set extension
removes some of these limitations.
Jump to top of page
The Pentium 4 processor introduced the SSE2 instruction set extensions
in 2000.
Two important improvements were made:
- Support 64 bit (double precision) floating point data types
- Integer instruction for the XMM registers
Because the SSE2 instruction set introduces integer instructions to the
XMM registers and double precision floating data types the XMM registers
can now hold following data types:
| Data type |
XMMx-Register |
| 1x 128 bit raw integer |
Bits 127 .. 0 |
| 2x 64 bit integer |
64 bit word 1 |
64 bit word 0 |
| 4x 32 bit integer |
32 bit word 3 |
32 bit word 2 |
32 bit word 1 |
32 bit word 0 |
| 8x 16 bit integer |
16 bit word 7 |
16 bit word 6 |
16 bit word 5 |
16 bit word 4 |
16 bit word 3 |
16 bit word 2 |
16 bit word 1 |
16 bit word 0 |
| 16x 8 bit integer |
Byte 15 |
Byte 14 |
Byte 13 |
Byte 12 |
Byte 11 |
Byte 10 |
Byte 9 |
Byte 8 |
Byte 7 |
Byte 6 |
Byte 5 |
Byte 4 |
Byte 3 |
Byte 2 |
Byte 1 |
Byte 0 |
| 2x 64 bit floating point values |
64 bit floating point value 1 |
64 bit floating point value 0 |
| 4x 32 bit floating point values |
32 bit floating point value 3 |
32 bit floating point value 2 |
32 bit floating point value 1 |
32 bit floating point value 0 |
The SSE2 instruction set includes the SSE instruction set (the MMX and
SSE integer instructions also apply to the XMM registers now and the
single precision floating point instructions of the SSE instruction set
also apply to the double precision data types); the following table lists
only the additional SSE2 instructions:
Instruction class
|
Instruction
|
Operation
|
| Prefetch and write-through instructions |
movntdq / movntpd |
Store data directly into main memory, bypassing the caches (XMM
only) |
| movnti |
Store data in general register directly into main memory, bypassing the caches |
| maskmovdqu |
Store individual bytes directly into main memory, bypassing the
caches (XMM only) |
| XMM data movement |
movapd / movdqa |
128 bit move from/to xmm register, memory data must be aligned |
| movupd / movdqu |
128 bit move from/to xmm register, memory data may be unaligned |
| movsd |
64 bit move from/to xmm register |
| movlpd |
64 bit move from/to lower 64 bit of xmm register |
| movhpd |
64 bit move from/to high 64 bit of xmm register |
| movlhpd |
64 bit move from lower 64 bit of xmm register to high 64 bit of
xmm register |
| movhlpd |
64 bit move from higher 64 bit of xmm register to lower 64 bit of
xmm register |
| movmskpd |
Extract upper bits of all qwords and store in a general register |
| Data format conversion / movement |
pmovmskpd |
Extract upper bits of all qwords and store in a general register |
| pshuflw |
Shuffle words in lower 64 bits (XMM only) |
| pshufhw |
Shuffle words in high 64 bits (XMM only) |
| pshufpd |
Shuffle dwords (XMM only) |
| unpckhpd |
Parallel unpack and interleave high qwords (XMM only) |
| unpcklpd |
Parallel unpack and interleave low qwords (XMM only) |
| cvtpi2pd |
Parallel convert integer dwords to double precision floating point
values (XMM only) |
| punpckhqdq |
Parallel unpack high qwords |
| punpcklqdq |
Parallel unpack lower qwords |
| cvtpi2sd |
Convert integer dword to double precision floating point value
(XMM only) |
| cvtpd2si |
Parallel convert double precision floating point
values to integer qwords (XMM only) |
| cvtsd2si |
Convert double precision floating point value to integer qword
(XMM only) |
| cvtps2pd |
Parallel convert single precision floating point values to double
precision floating point values |
| cvtpd2ps |
Parallel convert double precision floating point values to single
precision floating point values |
| cvtss2sd |
Convert single precision floating point value to double precision
floating point value |
| cvtsd2ss |
Convert double precision floating point value to single precision
floating point value |
| cvtps2dq |
Parallel convert single precision floating point values to signed
dword integers |
| cvtdq2ps |
Parallel convert signed dword integers to single precision
floating point values |
| movq2dq |
Move MMX register value to XMM register |
| movdq2q |
Move lower 64 bit of XMM register to MMX register |
| Integer arithmetical operations |
paddq |
Parallel add for 64 bit integers |
| psubq |
Parallel subtraction for 64 bit integers |
| pmuludq |
Parallel multiply 32 bit unsigned integers and return 64 bit
result |
| XMM arithmetical operations |
addpd |
Parallel add double precision floating point values |
| addsd |
Add double precision floating point value in lower 64 bits |
| subpd |
Parallel subtract double precision floating point values |
| subsd |
Subtract double precision floating point value in lower 64 bits |
| mulpd |
Parallel multiply double precision floating point values |
| mulsd |
Multiply double precision floating point value in lower 64 bits |
| divpd |
Parallel divide double precision floating point values |
| divsd |
Divide double precision floating point value in lower 64 bits |
| rcppd |
Parallel approximate reciprocal double precision floating point
values |
| rcpsd |
Approximate reciprocal double precision floating point value in
lower 64 bits |
| sqrtpd |
Parallel square root double precision floating point values |
| sqrtsd |
Square root double precision floating point value in lower 64 bits |
| rsqrtpd |
Parallel approximate reciprocal square root double precision
floating point values |
| rsqrtsd |
Approximate reciprocal square root double precision floating point
value in lower 64 bits |
| minpd |
Parallel minimum of double precision floating point values |
| minsd |
Minimum of precision floating point value in lower 64 bits |
| maxpd |
Parallel maximum of double precision floating point values |
| maxsd |
Maximum of precision floating point value in lower 64 bits |
| cmpps |
Compare double precision floating point values and return masks as
result |
| cmpsd |
Compare double precision floating point value in lower 64 bits and
return mask as result |
| comisd |
Compare double precision floating point value in lower 64 bits and
return result in flag register |
| ucomisd |
Compare unordered double precision floating point value in lower
64 bits and return result in flag register |
| XMM logical operations |
andpd |
Parallel and operation on all 128 bits |
| andnpd |
Parallel and not operation on all 128 bits |
| orpd |
Parallel or operation on all 128 bits |
| xorpd |
Parallel xor operation on all 128 bits |
| XMM shift operations |
pslldq |
Shift all 128 bits left |
| psrldq |
Shift all 128 bits rights, filling in zeros |
Here come some examples:
|
Instruction
|
Operation
|
| paddd xmm0,xmm6 |
Parallel add integer dwords from xmm6 to xmm0 |
| pmaddwd xmm2,[edx] |
Parallel multiply signed integer words in xmm2 and at [edi] and
add results |
| sqrtpd xmm0,xmm0 |
Parallel double precision floating point square root of values in
xmm0 |
| addpd xmm2,[ebx+16] |
Parallel add double precision floating point values from [ebx+16]
to xmm2 |
| psrldq xmm5,1 |
Shift right all 128 bits by one, filling in a zero |
As can be seen from the above instruction list there are still a number of limitations
on this set of instructions:
- Not all data types are supported for all operations, i.e. 32 bit
saturated adds, byte-wide shifts are missing etc.
- There are only 8 registers
- There are no real provisions for unaligned access
- Constants are not supported
It is unclear whether future instruction set extensions will remove
some of these limitations.
Nevertheless the SSE2 instruction set in effect is a parallel version of
the typical integer and floating point instructions found in the basic
instruction set of most processors, thereby enabling impressive speed-ups.
Jump to top of page
The basic characteristic of SIMD instructions is that they operate on n
data items in parallel, n typically being 1, 2, 4, 8 or 16. There are
mainly 5 problems with this:
- If the data layout does not match the SIMD requirements SIMD
instructions can either not be used or data rearrangement code is
necessary
- In case of unaligned data the performance will suffer
dramatically; this problem leads to massively more complicated
alignment code being necessary
- If the problem does not match the possible values of n, for example
in case of a filter which would require n = 5, the solution is
rather complicated
- (Parallel) Table lookups are not possible
- Compare operations are limited
Basically there are solutions to these problems, but they require some
clever or additional code. Due to these problems even compilers which generate SIMD code will in many cases not do so or not be able to do so.
Jump to top of page
SIMD instructions have the potential to speed-up software by factors
of 2 .. 16 or even more. This especially applies to
compute-intensive software dealing with arrays of data.
Please also see the case studies which
give numbers on speed-up factors.
Jump to top of page
SIMD instructions have the potential to speed-up software by factors
of 2 .. 16 or even more.
However, as most compilers do not utilize the (full) potential of these
instructions these speed-ups can generally only be achieved by an
optimization expert.
Hayes Technologies offers services to help
your software realize the high potential of the SIMD instruction sets.
Jump to top of page
|