|
|
Software Speed Optimization · Performance Optimization · High Performance Computing · Number Crunching · C/C++ · Assembly/Assembler · SIMD · MMX · SSE · SSE2 · SSE3 · 3DNow! |
|
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Platforms |
Case StudiesThis page details a number of Software Speed Optimization cases. For a better understanding the examples have been boiled down to the core of the task. In these examples only the speed-up factors for optimized assembly code has been given. Optimized C/C++ code would lead to somewhat lower numbers. Please select from one of the case studies:
More case studies will be added, so please check this page regularly!
Image Processing: Double Threshold BinarizationThis is an example from machine vision. The task is to generate a binarized image using 2 thresholds (lower and upper); the thresholds are given per pixel in form of images. If a source image pixel is between these 2 thresholds the binarized image pixel shall be 0x0, otherwise 0xff. The following is the C core function for handling (part of) a horizontal line:
void BinarizeWithDoubleThreshold(
const unsigned char sourceimage[],
const size_t nrpixels,
const unsigned char lowerthreshold[],
const unsigned char upperthreshold[],
unsigned char binarizedimage[]
)
{
for( size_t i = 0; i < nrpixels; i++ )
binarizedimage[ i ] = (unsigned char)(
(sourceimage[ i ] >= lowerthreshold[ i ] &&
sourceimage[ i ] < upperthreshold[ i ]) ? 0x0 : 0xff);
}
These are the measurement results:
* Specialized, separate functions for cached and
uncached data
Here comes a variant: The thresholds are constant for each pixel: void BinarizeWithDoubleThreshold(
const unsigned char sourceimage[],
const size_t nrpixels,
const unsigned char lowerthreshold,
const unsigned char upperthreshold,
unsigned char binarizedimage[]
)
{
for( size_t i = 0; i < nrpixels; i++ )
binarizedimage[ i ] = (unsigned char)(
(sourceimage[ i ] >= lowerthreshold &&
sourceimage[ i ] < upperthreshold) ? 0x0 : 0xff);
}
These are the corresponding measurement results:
* Specialized, separate functions for cached and
uncached data
As can be seen from these examples the speed-up factors can be impressively high, especially if the memory system is not bogging the processor down. What also can be seen is that slower CPUs profit more from optimization in case of memory-intensive algorithms where the data is uncached.
Digital Signal Processing: Artificial Neural NetworkThe (core) task here is to compute the raw* activation (or output) level of a neuron. This is done by summing up the products of each input neuron and its corresponding weight, followed by a rounded normalization. The input neurons have a value range of 0 .. +127 and the weights of -127 .. +127. So basically this is a vector dot product of an unsigned byte array with a signed byte array. * The term raw refers to the fact that this is not the output value of the neuron. The output value is generated by computing some sort of sigmoid function, taking the raw value as input. class Neuron_t
{
const unsigned char * const InputNeuronValues; // Value range == 0 .. +127
const signed char * const Weights; // Value range = -127 .. +127
const size_t NrInputNeurons;
public:
Neuron_t(
const unsigned char * const inputneuronvalues,
const signed char * const weights,
const size_t nrinputneurons ) :
InputNeuronValues( inputneuronvalues ),
Weights( weights ),
NrInputNeurons( nrinputneurons )
{
}
// Compute raw, normalized, rounded output value
// Return value = -127 .. +127
signed char ComputeRawOutputValue( void )
{
long sum = 0;
for( size_t z = 0; z < NrInputNeurons; z++ )
sum += InputNeuronValues[ z ] * Weights[ z ];
return( (sum + NrInputNeurons * SCHAR_MAX / 2) /
(NrInputNeurons * SCHAR_MAX) );
}
};
The following measurements were made:
* The SSE2 instruction set extension is only availably on the Pentium 4 and later processors |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Platforms: x86 · Pentium · Pentium MMX · Pentium II · Pentium III · Pentium 4 · Core · Core 2 · Xeon · Itanium · Athlon · DSPs · Embedded CPUs · Windows · Linux · RTOSs Especially Benefiting Application Areas: Image Processing · Signal processing · High Performance Computing / Number Crunching · Simulations · Compression · Games · 3D Software · Device Drivers · Multi-processor Systems · Multi-Computer Systems / Clusters · Embedded Devices · Real-time Systems · Interactive Systems · And many more... |
|
Last change: Oct. 29, 2006
|