# STL random_shuffle generates highly correlated sequences

• A+
Category：Languages

Notice the accepted answer pointed out the problem lies in reseeding. Reseeding is not the reason. Tests without reseeding also resulted in high correlations before posting. See Note1.

I generated 1,000,000 uniform random numbers in R, sorted the sequence, and called `std::random_shuffle()` to permute a copy of this sequence 100 times. The 100 permuted sequences turned out extremely correlated. However, if I do not sort the uniform numbers in the first place, the 100 permuted sequences are more or less uncorrelated. Below are the code.

``// [[Rcpp::export]] IntegerVector testRandomShuffle(IntegerVector x, int rd) // rd is the seed {   IntegerVector y(x.begin(), x.end()); // copy   std::srand(rd); // seeding   std::random_shuffle(y.begin(), y.end());   return y; }   /***R v = runif(1000000) vSorted = sort(v) sqc = 1L : length(v) # indexes rd = sample.int(length(v), 100) # random seeds   # Compute correlation matrices corMatForUnsorted = cor(as.data.frame(lapply(rd, function(x)    v[testRandomShuffle(sqc, x)]))) corMatForSorted = cor(as.data.frame(lapply(rd, function(x)    vSorted[testRandomShuffle(sqc, x)])))   # plot histograms par(mfrow = c(1, 2))  hist(corMatForUnsorted[abs(corMatForUnsorted) < 1], breaks = 200, xlab =    "Correlation for unsorted") hist(corMatForSorted[abs(corMatForSorted) < 1], breaks = 200, xlab =    "Correlation for sorted") */ `` Did I do something very wrong? I simply expect shuffling sorted and unsorted sequences yields more or less the same distributions of correlations. How small should those correlations be is another story. The same experiment with R's native function `sample.int()` for permutation yields low correlations in both scenarios.

Thank you!

Note1: the problem is that I am on Windows using Rtools 3.4 that comes with g++ 4.9.3. The shuffle function in this version of C++ library works incorrectly.

Note2: confirmed that `Rcpp::sample()` works in multithreading. A small test case:

``// [[Rcpp::depends(RcppParallel)]] # include <RcppParallel.h> # include <Rcpp.h> using namespace Rcpp;   struct testSampleInPara: public RcppParallel::Worker {   IntegerVector tmp;   List rst;     void operator() (std::size_t st, std::size_t end)   {     if(st == 0)     {       // is tmp / rst a copy or a reference ?       std::cout << std::to_string((std::size_t)&tmp) + "/n";       IntegerVector rst0 = Rcpp::sample(tmp, 5);       rst = rst0; // assume rst not a copy     }     else // if(st == 1)     {       std::cout << std::to_string((std::size_t)&tmp) + "/n";       IntegerVector rst1 = Rcpp::sample(tmp, 10);       rst = rst1;     }   }     testSampleInPara(IntegerVector tmp, List rst):     tmp(tmp), rst(rst)   {     RcppParallel::parallelFor(0, 2, *this);   } };   // [[Rcpp::export]] List testIfSampleCopy(IntegerVector tmp) {   List rst(2);   testSampleInPara(tmp, rst);   return rst; }  /***R testIfSampleCopy(1L : 10L) # printout: # 356036792 # 356036792 # [] #  10  5  9  7  8 #  # [] #   10  3  7  6  2  1  8  4  9  5 */ ``

My experience with `Rcpp` containers is bad regarding performance in multithreading. I usually create pointers or array of pointers to the beginning elements of `Rcpp` containers, share these pointers and the containers' sizes among threads. Notice `Rcpp::sample()` takes and returns `Rcpp` containers.

Your statistics intuition and use of a random number generator is not quite right. If I take your code, add the missing include for `Rcpp.h` and the `namespace` directive and simply comment out the reseeding then the two histograms overlap as you expected. Code below.

``#include <Rcpp.h>  using namespace Rcpp;  // [[Rcpp::export]] IntegerVector testRandomShuffle(IntegerVector x, int rd) { // rd is the seed   IntegerVector y(x.begin(), x.end()); // copy   //std::srand(rd); // seeding   std::random_shuffle(&y, &*y.end());   return y; }   /***R #v = runif(1000000) v = runif(10000) vSorted = sort(v) sqc = 1L : length(v) # indexes rd = sample.int(length(v), 100) # random seeds   # Compute correlation matrices corMatForUnsorted = cor(as.data.frame(lapply(rd, function(x)    v[testRandomShuffle(sqc, x)]))) corMatForSorted = cor(as.data.frame(lapply(rd, function(x)    vSorted[testRandomShuffle(sqc, x)])))   # plot histograms par(mfrow = c(1, 2))  hist(corMatForUnsorted[abs(corMatForUnsorted) < 1], breaks = 200, xlab =    "Correlation for unsorted") hist(corMatForSorted[abs(corMatForSorted) < 1], breaks = 200, xlab =    "Correlation for sorted") */ ``

I also lowered `N` by two orders of magnitude. Good enough.

Edit: And for completeness, a pure Rcpp version using only one RNG which work whereever Rcpp works, including Windows with `g++-4.9.3`.

``#include <Rcpp.h> using namespace Rcpp;  // [[Rcpp::export]] IntegerVector testRandomShuffle(IntegerVector x, int rd) { // rd is the seed   IntegerVector y(x.begin(), x.end()); // copy   std::random_shuffle(&y, &*y.end());   return y; }  // [[Rcpp::export]] IntegerVector testRandomSample(IntegerVector x) { // rd is the seed   IntegerVector y(x.begin(), x.end()); // copy   return sample(y, y.size()); }  /***R set.seed(123)  # now we're reproducible v <- runif(10000) vSorted <- sort(v) sqc <- 1L : length(v) # indexes rd <- sample.int(length(v), 100) # random seeds  # Compute correlation matrices corMatForUnsorted = cor(as.data.frame(lapply(rd, function(x)    v[testRandomSample(sqc)]))) corMatForSorted = cor(as.data.frame(lapply(rd, function(x)    vSorted[testRandomSample(sqc)])))  # plot histograms par(mfrow = c(1, 2))  hist(corMatForUnsorted[abs(corMatForUnsorted) < 1], breaks = 200,       xlab = "Correlation for unsorted", main="Unsorted") hist(corMatForSorted[abs(corMatForSorted) < 1], breaks = 200,       xlab = "Correlation for sorted", main="Sorted") */ ``

It still contains the unused older variant. The result plot is now And for completeness, in a benchmark the `sample()` routine from Rcpp is faster too:

``R> library(rbenchmark) R> benchmark(testRandomShuffle(x, 1), testRandomSample(x))[,1:4]                      test replications elapsed relative 2     testRandomSample(x)          100   1.402    1.000 1 testRandomShuffle(x, 1)          100   1.868    1.332 R>  ``