Crash with icc: can the compiler invent writes where none existed in the abstract machine?

  • A+
Category:Languages

Consider the following simple program:

#include <cstring> #include <cstdio> #include <cstdlib>  void replace(char *str, size_t len) {     for (size_t i = 0; i < len; i++) {         if (str[i] == '/') {             str[i] = '_';         }     } }  const char *global_str = "the quick brown fox jumps over the lazy dog";  int main(int argc, char **argv) {   const char *str = argc > 1 ? argv[1] : global_str;   replace(const_cast<char *>(str), std::strlen(str));   puts(str);   return EXIT_SUCCESS; } 

It takes an (optional) string on the command line and prints it, with / characters replaced by _. This replacement functionality is implemented by the c_repl function1. For example, a.out foo/bar prints:

foo_bar 

Elementary stuff so far, right?

If you don't specify a string, it conveniently uses the global string the quick brown fox jumps over the lazy dog, which doesn't contain any / characters, and so doesn't undergo any replacement.

Of course, string constants are const char[], so I need to cast away the constness first - that's the const_cast you see. Since the string is never actually modified, I am under the impression this is legal.

gcc and clang compile a binary that has the expected behavior, with or without passing a string on the command line. icc crashes, when you don't provide a string, however:

icc -xcore-avx2 char_replace.cpp && ./a.out Segmentation fault (core dumped) 

The underlying cause is the main loop for c_repl which looks like this:

  400c0c:       vmovdqu ymm2,YMMWORD PTR [rsi]   400c10:       add    rbx,0x20   400c14:       vpcmpeqb ymm3,ymm0,ymm2   400c18:       vpblendvb ymm4,ymm2,ymm1,ymm3   400c1e:       vmovdqu YMMWORD PTR [rsi],ymm4   400c22:       add    rsi,0x20   400c26:       cmp    rbx,rcx   400c29:       jb     400c0c <main+0xfc> 

It is a vectorized loop. The basic idea is that 32 bytes are loaded, and then compared against the / character, forming a mask value with a byte set for each byte that matched, and then the existing string is blended against a vector containing 32 _ characters, effectively replacing only the / characters. Finally, the updated register is written back to the string, with the vmovdqu YMMWORD PTR [rsi],ymm4 instruction.

This final store crashes, because the string is read-only and allocated in the .rodata section of the binary, which is loaded using read-only pages. Of course, the store was a logical "no op", writing back the same characters it read, but the CPU doesn't care!

Is my code legal C++ and therefore I should blame icc for miscompiling this, or I am wading into the UB swamp somewhere?


1 The same crash from the same issue occurs with std::replace on a std::string rather than my "C-like" code, but I wanted to simplify the analysis as much as possible and make it entirely self-contained.

 


ICC's behaviour here is not evidence for UB in ISO C++ or C. I think your reasoning is sound, and this is well-defined. You've found an ICC bug. If anyone cares, report it on their forums: https://software.intel.com/en-us/forums/intel-c-compiler. Existing bug reports in that section of their forum have been accepted by developers, e.g. this one.


We can construct an example where it auto-vectorizes the same way (with unconditional and non-atomic read/maybe-modify/rewrite) where it's clearly illegal, because the read / rewrite is happening on a 2nd string that the C abstract machine doesn't even read.

Thus, we can't trust ICC's code-gen to tell us anything about when we've caused UB, because it will make crashing code even in clearly legal cases.

Godbolt: ICC19.0.1 -O2 -march=skylake (Older ICC only understood options like -xcore-avx2, but modern ICC understands the same -march as GCC/clang.)

#include <stddef.h>  void replace(const char *str1, char *str2, size_t len) {     for (size_t i = 0; i < len; i++) {         if (str1[i] == '/') {             str2[i] = '_';         }     } } 

It checks for overlap between str1[0..len-1] and str2[0..len-1], but for large enough len and no overlap it will use this inner loop:

..B1.15:                        # Preds ..B1.15 ..B1.14                //do{         vmovdqu   ymm2, YMMWORD PTR [rsi+r8]                    #6.13   // load from str2         vpcmpeqb  ymm3, ymm0, YMMWORD PTR [rdi+r8]              #5.24   // compare vs. str1         vpblendvb ymm4, ymm2, ymm1, ymm3                        #6.13   // blend         vmovdqu   YMMWORD PTR [r8+rsi], ymm4                    #6.13   // store to str2         add       r8, 32                                        #4.5    // i+=32         cmp       r8, rax                                       #4.5         jb        ..B1.15       # Prob 82%                      #4.5   // }while(i<len); 

For thread-safety, it's well known that inventing write via non-atomic read/rewrite is unsafe.

The C++ abstract machine never touches str2 at all, so that invalidates any arguments for the one-string version about data-race UB being impossible because reading str at the same time another thread is writing it was already UB. Even C++20 std::atomic_ref doesn't change that, because we're reading through a non-atomic pointer.

But even worse than that, str2 can be nullptr. Or pointing to close to the end of an object (which happens to be stored near the end of a page), with str1 containing chars such that no writes past the end of str2 / the page will happen. We could even arrange for only the very last byte (str2[len-1]) to be in a new page, so that it's one-past-the-end of a valid object. It's even legal to construct such a pointer (as long as you don't deref). But it would be legal to pass str2=nullptr; code behind an if() that doesn't run doesn't cause UB.

Or another thread is running the same search/replace function in parallel, with a different key/replacement that will only write different elements of str2. The non-atomic load/store of unmodified values will step on modified values from the other thread. It's definitely allowed, according to the C++11 memory model, for different threads to simultaneously touch different elements of the same array. C++ memory model and race conditions on char arrays. (This is why char must be as large as the smallest unit of memory the target machine can write without a non-atomic RMW. An internal atomic RMW for a byte stores into cache is fine, though, and doesn't stop byte-store instructions from being useful.)

(This example is only legal with the separate str1/str2 version, because reading every element means the threads would be reading array elements the other thread could be in the middle of writing, which is data-race UB.)

As Herb Sutter mentioned in atomic<> Weapons: The C++ Memory Model and Modern Hardware Part 2: Restrictions on compilers and hardware (incl. common bugs); code generation and performance on x86/x64, IA64, POWER, ARM, and more; relaxed atomics; volatile: weeding out non-atomic RMW code-gen has been an ongoing issue for compilers after C++11 was standardized. We're most of the way there, but highly-aggressive and less-mainstream compilers like ICC clearly still have bugs.


Some less-plausible (to see in a real program) examples that this would also break:

Besides nullptr, you could pass a pointer to (an array of) std::atomic<T> or a mutex where a non-atomic read/rewrite breaks things by inventing writes. (char* can alias anything).

Or str2 points to a buffer that you've carved up for dynamic allocation, and the early part of str1 will have some matches, but later parts of str1 won't have any matches, and that part of str2 is being used by other threads. (And for some reason you can't easily calculate a length that stops the loop short).


For future readers: If you want to let compilers auto-vectorize this way:

You can write source like str2[i] = x ? replacement : str2[i]; that always writes the string in the C++ abstract machine. IIRC, that lets gcc/clang vectorize the way ICC does after doing its unsafe if-conversion to blend.

In theory an optimizing compiler can turn it back into a conditional branch in the scalar cleanup or whatever to avoid dirtying memory unnecessarily. (Or if targeting an ISA like ARM32 where a predicated store is possible, instead of only ALU select operations like x86 cmov, PowerPC isel, or AArch64 csel. ARM32 predicated instructions are architecturally a NOP if the predicate is false).

Or if an x86 compiler chose to use AVX512 masked stores, that would also make it safe to vectorize the way ICC does: masked stores do fault suppression, and never actually store to elements where the mask is false. (When using a mask register with AVX-512 load and stores, is a fault raised for invalid accesses to masked out elements?).

vpcmpeqb k1, zmm0, [rdi]   ; compare from memory into mask vmovdqu8 [rsi]{k1}, zmm1   ; masked store that only writes elements where the mask is true 

ICC19 actually does basically this (but with indexed addressing modes) with -march=skylake-avx512. But with ymm vectors because 512-bit lowers max turbo too much to be worth it unless your whole program is heavily using AVX512, on Skylake Xeons anyway.

So I think ICC19 is safe when vectorizing this with AVX512, but not AVX2. Unless there are problems in its cleanup code where it does something more complicated with vpcmpuq and kshift / kor, a zero-masked load, and a masked compare into another mask reg.

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: