Iterating over linked list in C++ is slower than in Go with analogous memory access

  • A+
Category:Languages

In a variety of contexts I've observed that linked list iteration is consistently slower in C++ than in Go by 10-15%. My first attempt at resolving this mystery on Stack Overflow is here. The example I coded up was problematic because:

1) memory access was unpredictable because of heap allocations, and

2) because there was no actual work being done, some people's compilers were optimizing away the main loop.

To resolve these issues I have a new program with implementations in C++ and Go. The C++ version takes 1.75 secs compared to 1.48 secs for the Go version. This time, I do one large heap allocation before timing begins and use it to operate an object pool from which I release and acquire nodes for the linked list. This way the memory access should be completely analogous between the two implementations.

Hopefully this makes the mystery more reproducible!

C++:

#include <iostream> #include <sstream> #include <fstream> #include <string> #include <vector> #include <boost/timer.hpp>  using namespace std;  struct Node {     Node *next; // 8 bytes     int age;   // 4 bytes };  // Object pool, where every free slot points to the previous free slot template<typename T, int n> struct ObjPool {     typedef T*       pointer;     typedef pointer* metapointer;      ObjPool() :         _top(NULL),         _size(0)     {         pointer chunks = new T[n];         for (int i=0; i < n; i++) {             release(&chunks[i]);         }     }      // Giver an available pointer to the object pool     void release(pointer ptr)     {         // Store the current pointer at the given address         *(reinterpret_cast<metapointer>(ptr)) = _top;          // Advance the pointer         _top = ptr;          // Increment the size         ++_size;     }      // Pop an available pointer off the object pool for program use     pointer acquire(void)     {         if(_size == 0){throw std::out_of_range("");}          // Pop the top of the stack         pointer retval = _top;          // Step back to the previous address         _top = *(reinterpret_cast<metapointer>(_top));          // Decrement the size         --_size;          // Return the next free address         return retval;     }      unsigned int size(void) const {return _size;}  protected:     pointer _top;      // Number of free slots available     unsigned int _size; };  Node *nodes = nullptr; ObjPool<Node, 1000> p;  void processAge(int age) {     // If the object pool is full, pop off the head of the linked list and release     // it from the pool     if (p.size() == 0) {         Node *head = nodes;         nodes = nodes->next;         p.release(head);     }      // Insert the new Node with given age in global linked list. The linked list is sorted by age, so this requires iterating through the nodes.     Node *node = nodes;     Node *prev = nullptr;     while (true) {         if (node == nullptr || age < node->age) {             Node *newNode = p.acquire();             newNode->age = age;             newNode->next = node;              if (prev == nullptr) {                 nodes = newNode;             } else {                 prev->next = newNode;             }              return;         }          prev = node;         node = node->next;     } }  int main() {     Node x = {};     std::cout << "Size of struct: " << sizeof(x) << "/n"; // 16 bytes      boost::timer t;     for (int i=0; i<1000000; i++) {         processAge(i);     }      std::cout << t.elapsed() << "/n"; } 

Go:

package main  import (     "time"     "fmt"     "unsafe" )  type Node struct {     next *Node // 8 bytes     age int32 // 4 bytes }  // Every free slot points to the previous free slot type NodePool struct {     top *Node     size int }  func NewPool(n int) NodePool {     p := NodePool{nil, 0}     slots := make([]Node, n, n)     for i := 0; i < n; i++ {         p.Release(&slots[i])     }      return p }  func (p *NodePool) Release(l *Node) {     // Store the current top at the given address     *((**Node)(unsafe.Pointer(l))) = p.top     p.top = l     p.size++ }  func (p *NodePool) Acquire() *Node {     if p.size == 0 {         fmt.Printf("Attempting to pop from empty pool!/n")     }     retval := p.top      // Step back to the previous address in stack of addresses     p.top = *((**Node)(unsafe.Pointer(p.top)))     p.size--     return retval }  func processAge(age int32) {     // If the object pool is full, pop off the head of the linked list and release     // it from the pool     if p.size == 0 {         head := nodes         nodes = nodes.next         p.Release(head)     }      // Insert the new Node with given age in global linked list. The linked list is sorted by age, so this requires iterating through the nodes.     node := nodes     var prev *Node = nil     for true {         if node == nil || age < node.age {             newNode := p.Acquire()             newNode.age = age             newNode.next = node              if prev == nil {                 nodes = newNode             } else {                 prev.next = newNode             }             return         }          prev = node         node = node.next     } }  // Linked list of nodes, in ascending order by age var nodes *Node = nil var p NodePool = NewPool(1000)  func main() {     x := Node{};     fmt.Printf("Size of struct: %d/n", unsafe.Sizeof(x)) // 16 bytes      start := time.Now()     for i := 0; i < 1000000; i++ {         processAge(int32(i))     }      fmt.Printf("Time elapsed: %s/n", time.Since(start)) } 

Output:

clang++ -std=c++11 -stdlib=libc++ minimalPool.cpp -O3; ./a.out Size of struct: 16 1.7548  go run minimalPool.go Size of struct: 16 Time elapsed: 1.487930629s 


The big difference between your two programs is that your Go code ignores errors (and will panic or segfault, if you're lucky, if you empty the pool), while your C++ code propagates errors via exception. Compare:

if p.size == 0 {     fmt.Printf("Attempting to pop from empty pool!/n") } 

vs.

if(_size == 0){throw std::out_of_range("");} 

There are at least three ways1 to make the comparison fair:

  1. Can change the C++ code to ignore the error, as you do in Go,
  2. Change both versions to panic/abort on error.
  3. Change the Go version to handle errors idiomatically,2 as you do in C++.

So, let's do all of them and compare the results3:

  • C++ ignoring error: 1.059329s wall, 1.050000s user + 0.000000s system = 1.050000s CPU (99.1%)
  • C++ abort on error: 1.081585s wall, 1.060000s user + 0.000000s system = 1.060000s CPU (98.0%)
  • Go panic on error: Time elapsed: 1.152942427s
  • Go ignoring error: Time elapsed: 1.196426068s
  • Go idiomatic error handling: Time elapsed: 1.322005119s
  • C++ exception: 1.373458s wall, 1.360000s user + 0.000000s system = 1.360000s CPU (99.0%)

So:

  • Without error handling, C++ is faster than Go.
  • With panicking, Go gets faster,4 but still not as fast as C++.
  • With idiomatic error handling, C++ slows down a lot more than Go.

Why? This exception never actually happens in your test run, so the actual error-handling code never runs in either language. But clang can't prove that it doesn't happen. And, since you never catch the exception anywhere, that means it has to emit exception handlers and stack unwinders for every non-elided frame all the way up the stack. So it's doing more work on each function call and return—not much more work, but then your function is doing so little real work that the unnecessary extra work adds up.


1. You could also change the C++ version to do C-style error handling, or to use an Option type, and probably other possibilities.

2. This, of course, requires a lot more changes: you need to import errors, change the return type of Acquire to (*Node, error), change the return type of processAge to error, change all your return statements, and add at least two if err != nil { … } checks. But that's supposed to be a good thing about Go, right?

3. While I was at it, I replaced your legacy boost::timer with boost::auto_cpu_timer, so we're now seeing wall clock time (as with Go) as well as CPU time.

4. I won't attempt to explain why, because I don't understand it. From a quick glance at the assembly, it's clearly optimized out some checks, but I can't see why it couldn't optimize out those same checks without the panic.

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: