STL random_shuffle generates highly correlated sequences

  • A+
Category:Languages

Notice the accepted answer pointed out the problem lies in reseeding. Reseeding is not the reason. Tests without reseeding also resulted in high correlations before posting. See Note1.

I generated 1,000,000 uniform random numbers in R, sorted the sequence, and called std::random_shuffle() to permute a copy of this sequence 100 times. The 100 permuted sequences turned out extremely correlated. However, if I do not sort the uniform numbers in the first place, the 100 permuted sequences are more or less uncorrelated. Below are the code.

// [[Rcpp::export]] IntegerVector testRandomShuffle(IntegerVector x, int rd) // rd is the seed {   IntegerVector y(x.begin(), x.end()); // copy   std::srand(rd); // seeding   std::random_shuffle(y.begin(), y.end());   return y; }   /***R v = runif(1000000) vSorted = sort(v) sqc = 1L : length(v) # indexes rd = sample.int(length(v), 100) # random seeds   # Compute correlation matrices corMatForUnsorted = cor(as.data.frame(lapply(rd, function(x)    v[testRandomShuffle(sqc, x)]))) corMatForSorted = cor(as.data.frame(lapply(rd, function(x)    vSorted[testRandomShuffle(sqc, x)])))   # plot histograms par(mfrow = c(1, 2))  hist(corMatForUnsorted[abs(corMatForUnsorted) < 1], breaks = 200, xlab =    "Correlation for unsorted") hist(corMatForSorted[abs(corMatForSorted) < 1], breaks = 200, xlab =    "Correlation for sorted") */ 

STL random_shuffle generates highly correlated sequences

Did I do something very wrong? I simply expect shuffling sorted and unsorted sequences yields more or less the same distributions of correlations. How small should those correlations be is another story. The same experiment with R's native function sample.int() for permutation yields low correlations in both scenarios.

Thank you!

Note1: the problem is that I am on Windows using Rtools 3.4 that comes with g++ 4.9.3. The shuffle function in this version of C++ library works incorrectly.

Note2: confirmed that Rcpp::sample() works in multithreading. A small test case:

// [[Rcpp::depends(RcppParallel)]] # include <RcppParallel.h> # include <Rcpp.h> using namespace Rcpp;   struct testSampleInPara: public RcppParallel::Worker {   IntegerVector tmp;   List rst;     void operator() (std::size_t st, std::size_t end)   {     if(st == 0)     {       // is tmp / rst a copy or a reference ?       std::cout << std::to_string((std::size_t)&tmp[0]) + "/n";       IntegerVector rst0 = Rcpp::sample(tmp, 5);       rst[0] = rst0; // assume rst not a copy     }     else // if(st == 1)     {       std::cout << std::to_string((std::size_t)&tmp[0]) + "/n";       IntegerVector rst1 = Rcpp::sample(tmp, 10);       rst[1] = rst1;     }   }     testSampleInPara(IntegerVector tmp, List rst):     tmp(tmp), rst(rst)   {     RcppParallel::parallelFor(0, 2, *this);   } };   // [[Rcpp::export]] List testIfSampleCopy(IntegerVector tmp) {   List rst(2);   testSampleInPara(tmp, rst);   return rst; }  /***R testIfSampleCopy(1L : 10L) # printout: # 356036792 # 356036792 # [[1]] # [1] 10  5  9  7  8 #  # [[2]] #  [1] 10  3  7  6  2  1  8  4  9  5 */ 

My experience with Rcpp containers is bad regarding performance in multithreading. I usually create pointers or array of pointers to the beginning elements of Rcpp containers, share these pointers and the containers' sizes among threads. Notice Rcpp::sample() takes and returns Rcpp containers.


Your statistics intuition and use of a random number generator is not quite right. If I take your code, add the missing include for Rcpp.h and the namespace directive and simply comment out the reseeding then the two histograms overlap as you expected.

STL random_shuffle generates highly correlated sequences

Code below.

#include <Rcpp.h>  using namespace Rcpp;  // [[Rcpp::export]] IntegerVector testRandomShuffle(IntegerVector x, int rd) { // rd is the seed   IntegerVector y(x.begin(), x.end()); // copy   //std::srand(rd); // seeding   std::random_shuffle(&y[0], &*y.end());   return y; }   /***R #v = runif(1000000) v = runif(10000) vSorted = sort(v) sqc = 1L : length(v) # indexes rd = sample.int(length(v), 100) # random seeds   # Compute correlation matrices corMatForUnsorted = cor(as.data.frame(lapply(rd, function(x)    v[testRandomShuffle(sqc, x)]))) corMatForSorted = cor(as.data.frame(lapply(rd, function(x)    vSorted[testRandomShuffle(sqc, x)])))   # plot histograms par(mfrow = c(1, 2))  hist(corMatForUnsorted[abs(corMatForUnsorted) < 1], breaks = 200, xlab =    "Correlation for unsorted") hist(corMatForSorted[abs(corMatForSorted) < 1], breaks = 200, xlab =    "Correlation for sorted") */ 

I also lowered N by two orders of magnitude. Good enough.

Edit: And for completeness, a pure Rcpp version using only one RNG which work whereever Rcpp works, including Windows with g++-4.9.3.

#include <Rcpp.h> using namespace Rcpp;  // [[Rcpp::export]] IntegerVector testRandomShuffle(IntegerVector x, int rd) { // rd is the seed   IntegerVector y(x.begin(), x.end()); // copy   std::random_shuffle(&y[0], &*y.end());   return y; }  // [[Rcpp::export]] IntegerVector testRandomSample(IntegerVector x) { // rd is the seed   IntegerVector y(x.begin(), x.end()); // copy   return sample(y, y.size()); }  /***R set.seed(123)  # now we're reproducible v <- runif(10000) vSorted <- sort(v) sqc <- 1L : length(v) # indexes rd <- sample.int(length(v), 100) # random seeds  # Compute correlation matrices corMatForUnsorted = cor(as.data.frame(lapply(rd, function(x)    v[testRandomSample(sqc)]))) corMatForSorted = cor(as.data.frame(lapply(rd, function(x)    vSorted[testRandomSample(sqc)])))  # plot histograms par(mfrow = c(1, 2))  hist(corMatForUnsorted[abs(corMatForUnsorted) < 1], breaks = 200,       xlab = "Correlation for unsorted", main="Unsorted") hist(corMatForSorted[abs(corMatForSorted) < 1], breaks = 200,       xlab = "Correlation for sorted", main="Sorted") */ 

It still contains the unused older variant. The result plot is now

STL random_shuffle generates highly correlated sequences

And for completeness, in a benchmark the sample() routine from Rcpp is faster too:

R> library(rbenchmark) R> benchmark(testRandomShuffle(x, 1), testRandomSample(x))[,1:4]                      test replications elapsed relative 2     testRandomSample(x)          100   1.402    1.000 1 testRandomShuffle(x, 1)          100   1.868    1.332 R>  

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: