Convert signed short to float in C++ SIMD

  • A+
Category:Languages

I have an array of signed short that I want to divide by 2048 and get an array of float as a result.

I found SSE: convert short integer to float that allows to convert unsigned shorts to floats, but I want to handle signed shorts also.

The code below works but only for positive shorts.

// We want to divide some signed short by 2048 and get a float. const auto floatScale = _mm256_set1_ps(2048);  short* shortsInput = /* values from somewhere */; float* floatsOutput = /* initialized */;  __m128i* m128iInput = (__m128i*)&shortsInput[0];  // Converts the short vectors to 2 float vectors. This works, but only for positive shorts. __m128i m128iLow = _mm_unpacklo_epi16(m128iInput[0], _mm_setzero_si128()); __m128i m128iHigh = _mm_unpackhi_epi16(m128iInput[0], _mm_setzero_si128()); __m128 m128Low = _mm_cvtepi32_ps(m128iLow); __m128 m128High = _mm_cvtepi32_ps(m128iHigh);  // Puts the 2 __m128 vectors into 1 __m256. __m256 singleComplete = _mm256_castps128_ps256(m128Low); singleComplete = _mm256_insertf128_ps(singleComplete, m128High, 1);  // Finally do the math __m256 scaledVect = _mm256_div_ps(singleComplete, floatScale);  // and puts the result where needed. _mm256_storeu_ps(floatsOutput[0], scaledVect); 

How can I convert my signed shorts to floats? Or maybe there's a better way to tackle this problem?


EDIT: I tried the different answers compared to a non SIMD algorithm, doing it 10M times over a 2048 array, on an AMD Ryzen 7 2700 at ~3.2GHz. I'm using Visual 15.7.3 with mostly the default config:

/permissive- /Yu"stdafx.h" /GS /GL /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /sdl  /Fd"x64/Release/vc141.pdb" /Zc:inline /fp:precise /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /arch:AVX2 /Gd /Oi /MD /openmp /FC /Fa"x64/Release/" /EHsc /nologo /Fo"x64/Release/" /Fp"x64/Release/test.pch" /diagnostics:classic  

Note that I'm very new to SIMD and haven't use C++ for ages. Here's what I get (I reran each test separately and not one after the other and got better results like that):

  • No SIMD: 7300ms
  • wim's answer: 2300ms
  • chtz's SSE2 answer: 1650ms
  • chtz's AVX2 answer: 2100ms

So I get a nice speedup by using SIMD, and chtz's SSE2 answer, though being more verbose and complex to understand, is faster. (At least when compiled with AVX enabled, so it avoids extra instructions to copy registers by using 3-operand VEX-coded instructions. On Intel CPUs, the AVX2 versions should be significantly faster than the 128-bit version.)

Here's my test code:

const int size = 2048; const int loopSize = (int)1e7;  float* noSimd(short* shortsInput) {     float* floatsOutput = new float[size];      auto startTime = std::chrono::high_resolution_clock::now();      for (int i = 0; i < loopSize; i++) {         for (int j = 0; j < size; j++) {             floatsOutput[j] = shortsInput[j] / 2048.0f;         }     }      auto stopTime = std::chrono::high_resolution_clock::now();     long long totalTime = (stopTime - startTime).count();      printf("%lld noSimd/n", totalTime);      return floatsOutput; }  float* wimMethod(short* shortsInput) {     const auto floatScale = _mm256_set1_ps(1.0f / 2048.0f);     float* floatsOutput = new float[size];      auto startTime = std::chrono::high_resolution_clock::now();      for (int i = 0; i < loopSize; i++) {         for (int j = 0; j < size; j += 8) {             __m128i short_vec = _mm_loadu_si128((__m128i*)&shortsInput[j]);             __m256i int_vec = _mm256_cvtepi16_epi32(short_vec);             __m256  singleComplete = _mm256_cvtepi32_ps(int_vec);              // Finally do the math             __m256 scaledVect = _mm256_mul_ps(singleComplete, floatScale);              // and puts the result where needed.             _mm256_storeu_ps(&floatsOutput[j], scaledVect);         }     }      auto stopTime = std::chrono::high_resolution_clock::now();     long long totalTime = (stopTime - startTime).count();      printf("%lld wimMethod/n", totalTime);      return floatsOutput; }  float* chtzMethodSSE2(short* shortsInput) {     float* floatsOutput = new float[size];      auto startTime = std::chrono::high_resolution_clock::now();      for (int i = 0; i < loopSize; i++) {         for (int j = 0; j < size; j += 8) {             // get input:             __m128i val = _mm_loadu_si128((__m128i*)&shortsInput[j]);             // add 0x8000 to wrap to unsigned short domain:             val = _mm_add_epi16(val, const0x8000);             // interleave with upper part of float(1<<23)/2048.f:             __m128i lo = _mm_unpacklo_epi16(val, const0x4580);             __m128i hi = _mm_unpackhi_epi16(val, const0x4580);             // interpret as float and subtract float((1<<23) + (0x8000))/2048.f             __m128 lo_f = _mm_sub_ps(_mm_castsi128_ps(lo), constFloat);             __m128 hi_f = _mm_sub_ps(_mm_castsi128_ps(hi), constFloat);             // store:             _mm_storeu_ps(&floatsOutput[j], lo_f);             _mm_storeu_ps(&floatsOutput[j] + 4, hi_f);         }     }      auto stopTime = std::chrono::high_resolution_clock::now();     long long totalTime = (stopTime - startTime).count();      printf("%lld chtzMethod/n", totalTime);      return floatsOutput; }  float* chtzMethodAVX2(short* shortsInput) {     const auto floatScale = _mm256_set1_ps(1.0f / 2048.0f);     float* floatsOutput = new float[size];      auto startTime = std::chrono::high_resolution_clock::now();      for (int i = 0; i < loopSize; i++) {         for (int j = 0; j < size; j += 8) {              // get input:             __m128i val = _mm_loadu_si128((__m128i*)&shortsInput[j]);             // interleave with 0x0000             __m256i val_unpacked = _mm256_cvtepu16_epi32(val);              // 0x4580'8000             const __m256 magic = _mm256_set1_ps(float((1 << 23) + (1 << 15)) / 2048.f);             const __m256i magic_i = _mm256_castps_si256(magic);              /// convert by xor-ing and subtracting magic value:             // VPXOR avoids port5 bottlenecks on Intel CPUs before SKL             __m256 val_f = _mm256_castsi256_ps(_mm256_xor_si256(val_unpacked, magic_i));             __m256 converted = _mm256_sub_ps(val_f, magic);             // store:             _mm256_storeu_ps(&floatsOutput[j], converted);         }     }      auto stopTime = std::chrono::high_resolution_clock::now();     long long totalTime = (stopTime - startTime).count();      printf("%lld chtzMethod2/n", totalTime);      return floatsOutput; } 

 


You can replace the standard way of doing convert epi16->epi32->float and multiplying by 1.f/2048.f, by manually composing a float.

This works because the divisor is a power of 2, so manually composing the float just means a different exponent.

Thanks to @PeterCordes, here is an optimized AVX2 version of this idea, using XOR to set the upper bytes of the 32-bit float at the same time as flipping the sign bit of the integer value. FP SUB turns those low bits of the mantissa into the correct FP value:

// get input: __m128i val = _mm_loadu_si128((__m128i*)input); // interleave with 0x0000 __m256i val_unpacked = _mm256_cvtepu16_epi32(val);  // 0x4580'8000 const __m256 magic = _mm256_set1_ps(float((1<<23) + (1<<15))/2048.f); const __m256i magic_i = _mm256_castps_si256(magic);  /// convert by xor-ing and subtracting magic value: // VPXOR avoids port5 bottlenecks on Intel CPUs before SKL __m256 val_f = _mm256_castsi256_ps(_mm256_xor_si256(val_unpacked, magic_i)); __m256 converted = _mm256_sub_ps(val_f, magic); // store: _mm256_storeu_ps(output, converted); 

See it on the Godbolt compiler explorer with gcc and clang; on Skylake i7-6700k, a 2048 element loop that's hot in cache takes ~360 clock cycles, the same speed (to within measurement error) as @wim's version that does the standard sign-extend/convert/multiply (with a similar amount of loop unrolling). Tested by @PeterCordes with Linux perf. But on Ryzen, this could be significantly faster, because we avoid _mm256_cvtepi32_ps (Ryzen has 1 per 2 clock throughput for vcvtdq2ps ymm: http://agner.org/optimize/.)

The xor of 0x8000 with the lower half is equivalent to adding/subtracting 0x8000, since overflow/carry is ignored. And coincidentally, this allows to use the same magic constant for XOR-ing and subtracting.

Strangely, gcc and clang prefer to replace the subtraction with an addition of -magic, which will not re-use the constant ... They prefer using add because it's commutative, but in this case there's no benefit because they're not using it with a memory operand.


Here's an SSE2 version that does the signed/unsigned flip separately from setting the high 2 bytes of a 32-bit FP bit-pattern.

We're using one _mm_add_epi16, two _mm_unpackXX_epi16 and two _mm_sub_ps for 8 values (the _mm_castsi128_ps are no-ops, and the _mm_set would be cached in registers):

// get input: __m128i val = _mm_loadu_si128((__m128i*)input); // add 0x8000 to wrap to unsigned short domain: // val = _mm_add_epi16(val, _mm_set1_epi16(0x8000)); val = _mm_xor_si128(val, _mm_set1_epi16(0x8000));  // PXOR runs on more ports, avoids competing with FP add/sub or unpack on Sandybridge/Haswell.  // interleave with upper part of float(1<<23)/2048.f: __m128i lo = _mm_unpacklo_epi16(val, _mm_set1_epi16(0x4580)); __m128i hi = _mm_unpackhi_epi16(val, _mm_set1_epi16(0x4580)); // interpret as float and subtract float((1<<23) + (0x8000))/2048.f __m128 lo_f = _mm_sub_ps(_mm_castsi128_ps(lo), _mm_set_ps1(float((1<<23) + (1<<15))/2048.f)); __m128 hi_f = _mm_sub_ps(_mm_castsi128_ps(hi), _mm_set_ps1(float((1<<23) + (1<<15))/2048.f)); // store: _mm_storeu_ps(output, lo_f); _mm_storeu_ps(output+4, hi_f); 

Usage demonstration: https://ideone.com/b8BfJd

If your input would have been unsigned short, the _mm_add_epi16 would not be necessary (and the 1<<15 in the _mm_sub_ps would need to be removed, of course). Then you'd have Marat's answer on SSE: convert short integer to float.

This can easily be ported to AVX2 with twice as many conversions per iteration, but care has to be taken regarding the order of output elements (thanks @wim for pointing this out).


Also, for a pure SSE solution, one could simply use _mm_cvtpi16_ps, but that's an Intel library function. There's no single instruction that does this.

// cast input pointer: __m64* input64 = (__m64*)input; // convert and scale: __m128 lo_f = _mm_mul_ps(_mm_cvtpi16_ps(input64[0]), _mm_set_ps1(1.f/2048.f)); __m128 hi_f = _mm_mul_ps(_mm_cvtpi16_ps(input64[1]), _mm_set_ps1(1.f/2048.f)); 

I did not benchmark any solution (nor check for theoretical throughputs or latencies)

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: