Hayes Technologies - Software Speed Optimization

Software Speed Optimization · Performance Optimization · High Performance Computing · Number Crunching · C/C++ · Assembly/Assembler · SIMD · MMX · SSE · SSE2 · SSE3 · 3DNow!

Case Studies

This page details a number of Software Speed Optimization cases. For a better understanding the examples have been boiled down to the core of the task.

In these examples only the speed-up factors for optimized assembly code has been given. Optimized C/C++ code would lead to somewhat lower numbers.

Please select from one of the case studies:

 

More case studies will be added, so please check this page regularly!

 

Image Processing: Double Threshold Binarization

This is an example from machine vision. The task is to generate a binarized image using 2 thresholds (lower and upper); the thresholds are given per pixel in form of images. If a source image pixel is between these 2 thresholds the binarized image pixel shall be 0x0, otherwise 0xff.

The following is the C core function for handling (part of) a horizontal line:

 

void BinarizeWithDoubleThreshold(
  const unsigned char sourceimage[],
  const size_t nrpixels,
  const unsigned char lowerthreshold[],
  const unsigned char upperthreshold[],
  unsigned char binarizedimage[]
)
{
  for( size_t i = 0; i < nrpixels; i++ )
    binarizedimage[ i ] = (unsigned char)(
    (sourceimage[ i ] >= lowerthreshold[ i ]  &&
    sourceimage[ i ] < upperthreshold[ i ]) ? 0x0 : 0xff);
}

 

These are the measurement results:

System

CPU Pentium III, 1 GHz Mobile Pentium III, 600 MHz Pentium 4, 1.5 GHz Athlon, 1.2 GHz Itanium, 733 MHz
RAM 256 MB SDRAM, 133 MHz 256 MB SDRAM, 100 MHz 256 MB RDRAM, 800 MHz 256 MB SDRAM, 133 MHz 1 GB SDRAM, 133 MHz
Operating System MS Windows 2000 Pro MS Windows XP Pro MS Windows 2000 Pro MS Windows XP Pro MS Windows 2000 Server IA64

Software

Compiler MS Visual C/C++ 6.0 Intel C/C++ 5.0
Optimization switches /G6 /Ox /Ot /Og /Oi /Ob2 /Ox /Ot /Og /Oi /Oa***
Function parameter(s) sourceimage[] = random values
lowerthreshold[] = random values
upperthreshold[] = random values
nrpixels = 1024

Results

Data cached
MPixels / s (C++) 61.49 31.08 17.95 20.17 144.51
MPixels / s (MMX ASM*) 2016.49 988.37 927.30 432.37 NA
Speed-Up Factor 32.79 31.81 51.66 21.44 NA
MPixels / s (SSE2 ASM*) NA** NA** 1439.56 NA** NA**
Speed-Up Factor NA** NA** 85.18 NA** NA**
Data uncached
MPixels / s (C++) 40.87 25.80 16.87 16.46 48.94
MPixels / s (MMX ASM*) 109.99 131.51 339.41 92.62 NA
Speed-Up Factor 2.69 5.10 20.12 5.63 NA
MPixels / s (SSE2 ASM*) NA** NA** 449.90 NA** NA**
Speed-Up Factor NA** NA** 26.67 NA** NA**

* Specialized, separate functions for cached and uncached data
** The SSE2 instruction set extension is only availably on the Pentium 4 and later processors
*** With aliasing allowed the results are much lower

 

Here comes a variant: The thresholds are constant for each pixel:

 
void BinarizeWithDoubleThreshold(
  const unsigned char sourceimage[],
  const size_t nrpixels,
  const unsigned char lowerthreshold,
  const unsigned char upperthreshold,
  unsigned char binarizedimage[]
)
{
  for( size_t i = 0; i < nrpixels; i++ )
    binarizedimage[ i ] = (unsigned char)(
    (sourceimage[ i ] >= lowerthreshold  &&
    sourceimage[ i ] < upperthreshold) ? 0x0 : 0xff);
}

 

These are the corresponding measurement results:

System

CPU Pentium III, 1 GHz Mobile Pentium III, 600 MHz Pentium 4, 1.5 GHz Athlon, 1.2 GHz Itanium, 733 MHz
RAM 256 MB SDRAM, 133 MHz 256 MB SDRAM, 100 MHz 256 MB RDRAM, 800 MHz 256 MB SDRAM, 133 MHz 1 GB SDRAM, 133 MHz
Operating System MS Windows 2000 Pro MS Windows XP Pro MS Windows 2000 Pro MS Windows XP Pro MS Windows 2000 Server IA64

Software

Compiler MS Visual C/C++ 6.0 Intel C/C++ 5.0
Optimization switches /G6 /Ox /Ot /Og /Oi /Ob2 /Ox /Ot /Og /Oi /Oa***
Function parameter(s) sourceimage[] = random values
lowerthreshold = 128
upperthreshold = 192
nrpixels = 1024

Results

Data cached
MPixels / s (C++) 65.69 33.10 34.61 68.57 165.34
MPixels / s (MMX ASM*) 3046.02 1524.74 2199.49 5270.02 NA
Speed-Up Factor 46.37 46.07 63.54 76.85 NA
MPixels / s (SSE2 ASM*) NA** NA** 3535.08 NA** NA**
Speed-Up Factor NA** NA** 102.13 NA** NA**
Data uncached
MPixels / s (C++) 50.84 29.19 30.66 46.95 66.27
MPixels / s (MMX ASM*) 259.40 272.42 772.75 352.68 NA
Speed-Up Factor 5.10 9.33 25.21 7.51 NA
MPixels / s (SSE2 ASM*) NA** NA** 993.53 NA** NA**
Speed-Up Factor NA** NA** 32.41 NA** NA**

* Specialized, separate functions for cached and uncached data
** The SSE2 instruction set extension is only availably on the Pentium 4 and later processors

*** With aliasing allowed the results are much lower

 

As can be seen from these examples the speed-up factors can be impressively high, especially if the memory system is not bogging the processor down. What also can be seen is that slower CPUs profit more from optimization in case of memory-intensive algorithms where the data is uncached.

Jump to top of page

 

Digital Signal Processing: Artificial Neural Network

The (core) task here is to compute the raw* activation (or output) level of a neuron. This is done by summing up the products of each input neuron and its corresponding weight, followed by a rounded normalization. The input neurons have a value range of 0 .. +127 and the weights of -127 .. +127. So basically this is a vector dot product of an unsigned byte array with a signed byte array.

* The term raw refers to the fact that this is not the output value of the neuron. The output value is generated by computing some sort of sigmoid function, taking the raw value as input.

 
class Neuron_t
{
  const unsigned char * const InputNeuronValues; // Value range == 0 .. +127
  const signed char * const Weights; // Value range = -127 .. +127
  const size_t NrInputNeurons;
public:
  Neuron_t(
  const unsigned char * const inputneuronvalues,
  const signed char * const weights,
  const size_t nrinputneurons ) :
  InputNeuronValues( inputneuronvalues ),
  Weights( weights ),
  NrInputNeurons( nrinputneurons )
  {
  }

  // Compute raw, normalized, rounded output value
  // Return value = -127 .. +127
  signed char ComputeRawOutputValue( void )
  {
    long sum = 0;
    for( size_t z = 0; z < NrInputNeurons; z++ )
      sum += InputNeuronValues[ z ] * Weights[ z ];
    return( (sum + NrInputNeurons * SCHAR_MAX / 2) /
    (NrInputNeurons * SCHAR_MAX) );
  }
};

 

The following measurements were made:

System
CPU Pentium III, 1 GHz Mobile Pentium III, 600 MHz Pentium 4, 1.5 GHz Athlon, 1.2 GHz Itanium, 733 MHz
RAM 256 MB SDRAM, 133 MHz 256 MB SDRAM, 100 MHz 256 MB RDRAM, 800 MHz 256 MB SDRAM, 133 MHz 1 GB SDRAM, 133 MHz
Operating System MS Windows 2000 Pro MS Windows XP Pro MS Windows 2000 Pro MS Windows XP Pro MS Windows 2000 Server IA64
Software
Compiler MS Visual C/C++ 6.0 Intel C/C++ 5.0
Optimization switches /G6 /Ox /Ot /Og /Oi /Ob2 /Ox /Ot /Og /Oi
Class member(s) NrInputNeurons = 1024
Results
Data cached
Mega-Multiply-Adds / s (C++) 242.79 122.06 121.32 197.77 246.03
Mega-Multiply-Adds / s (MMX/SSE1 ASM) 1108.46 560.82 575.55 1160.11 NA
Speed-Up Factor 4.57 4.59 4.74 5.87 NA
Mega-Multiply-Adds / s (SSE2 ASM) NA* NA* 1052.62 NA* NA*
Speed-Up Factor NA* NA* 8.68 NA* NA*
Data uncached
Mega-Multiply-Adds / s (C++) 101.84 71.14 120.36 84.09 95.81
Mega-Multiply-Adds / s (MMX/SSE1 ASM) 283.82 332.46 632.07 235.70 NA
Speed-Up Factor 2.79 4.67 5.25 2.80 NA
Mega-Multiply-Adds / s (SSE2 ASM) NA* NA* 893.39 NA* NA*
Speed-Up Factor NA* NA* 7.42 NA* NA*

* The SSE2 instruction set extension is only availably on the Pentium 4 and later processors

Jump to top of page

Platforms: x86 · Pentium · Pentium MMX · Pentium II · Pentium III · Pentium 4 · Core · Core 2 · Xeon · Itanium · Athlon · DSPs · Embedded CPUs · Windows · Linux · RTOSs

Especially Benefiting Application Areas: Image Processing · Signal processing · High Performance Computing / Number Crunching · Simulations · Compression · Games · 3D Software · Device Drivers · Multi-processor Systems · Multi-Computer Systems / Clusters · Embedded Devices · Real-time Systems · Interactive Systems · And many more...