Efficiently create arrays from a next n elements from a array

  • A+
Category:Languages

Short version:

I'm trying to efficiently create a array like x:

input = [0, 1, 2, 3, 4, 5, 6]  x = [ [0,1,2], [1,2,3], [2,3,4], [3,4,5], [4,5,6] ] 

I've tried simple for looping and it takes too long for the real usecase.

Long version:

(extends short version)

I've got a 400k rows long dataframe, which I need to partition into arrays of a next n elements from the element currently iterated over. Currently I group it just like presented below in the process_data function.

A simple for based iteration takes forever here (2.5min on my hardware to be specific). I've searched itertools and pandas documentation, tried searching here too and couldn't find any fitting solution.

My current super time consuming implementation:

class ModelInputParsing(object):     def __init__(self, data):         self.parsed_dataframe = data.fillna(0)      def process_data(self, lb=50):         self.X, self.Y = [],[]         for i in range(len(self.parsed_dataframe)-lb):             self.X.append(self.parsed_dataframe.iloc[i:(i+lb),-2])             self.Y.append(self.parsed_dataframe.iloc[(i+lb),-1])         return (np.array(self.X), np.array(self.Y)) 

The input data looks like this (where Bid is the mentioned input):

    Bid     Changes     Expected 0   1.20102 NaN         0.000000 1   1.20102 0.000000    0.000000 2   1.20102 0.000000    0.000042 3   1.20102 0.000000    0.000017 4   1.20102 0.000000    0.000025 5   1.20102 0.000000    0.000025 6   1.20102 0.000000    0.000100 ... 

And the output should look like this:

array([[  0.00000000e+00,   0.00000000e+00,   0.00000000e+00, ...,           8.34465027e-06,  -8.34465027e-06,   0.00000000e+00],        [  0.00000000e+00,   0.00000000e+00,   0.00000000e+00, ...,          -8.34465027e-06,   0.00000000e+00,   3.33786011e-05],        [  0.00000000e+00,   0.00000000e+00,   0.00000000e+00, ...,           0.00000000e+00,   3.33786011e-05,   0.00000000e+00],        ...,         [  0.00000000e+00,   8.34465027e-06,   1.66893005e-05, ...,          -8.34465027e-06,   0.00000000e+00,   0.00000000e+00],        [  8.34465027e-06,   1.66893005e-05,  -8.34465027e-06, ...,           0.00000000e+00,   0.00000000e+00,   0.00000000e+00],        [  1.66893005e-05,  -8.34465027e-06,   0.00000000e+00, ...,           0.00000000e+00,   0.00000000e+00,   1.66893005e-05]], dtype=float32) len(x) 399950 

Below I've presented x[0] and x[1]. Key here is how the the values move one position back in the next array. For example a first non-zero value moved from 7 to 6 position (0 based position).

The first element:

x[0] array([  0.00000000e+00,   0.00000000e+00,   0.00000000e+00,          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,          0.00000000e+00,  -4.16040421e-05,   2.49147415e-05,         -8.34465027e-06,   0.00000000e+00,  -7.49230385e-05,          ...,          2.50339508e-05,  -8.34465027e-06,   3.33786011e-05,         -2.50339508e-05,  -8.34465027e-06,   8.34465027e-06,         -8.34465027e-06,   0.00000000e+00], dtype=float32) len(x[0]) 50 

The second element:

x[1] array([  0.00000000e+00,   0.00000000e+00,   0.00000000e+00,          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,         -4.16040421e-05,   2.49147415e-05,  -8.34465027e-06,          0.00000000e+00,  -7.49230385e-05,  -1.58131123e-04,          ....,         -8.34465027e-06,   3.33786011e-05,  -2.50339508e-05,         -8.34465027e-06,   8.34465027e-06,  -8.34465027e-06,          0.00000000e+00,   3.33786011e-05], dtype=float32) len(x[1]) 50 

I'm curious if there is a way to get this done more efficiently as I'm soon planning to parse +20m rows long datasets.


zip() plus some slicing can do that:

>>> list(zip(input[0:], input[1:], input[2:])) [(0, 1, 2), (1, 2, 3), (2, 3, 4), (3, 4, 5), (4, 5, 6)] 

if you need the list elements to be lists, use this:

>>> list(map(list, zip(input[0:], input[1:], input[2:]))) [[0, 1, 2], [1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6]] 

In general, if you need n-tuples instead of triples, you can do:

>>> list(zip(*(input[i:] for i in range(3)))) [(0, 1, 2), (1, 2, 3), (2, 3, 4), (3, 4, 5), (4, 5, 6)] 

or

>>> list(map(list, zip(*(input[i:] for i in range(3))))) [[0, 1, 2], [1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6]] 

Another way to do it:

>>> [input[i:i+3] for i in range(len(input)-3+1)] [[0, 1, 2], [1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6]] 

Some benchmarks:

Setup:

import timeit  def ff1(input):     return list(map(list, zip(input[0:], input[1:], input[2:])))  def ff2(input):     return list(map(list, zip(*(input[i:] for i in range(3)))))  def ff3(input):     return [input[i:i+3] for i in range(len(input)-3+1)]  def jg(input):     for i in range(0, len(input) - 2):         yield input[i:i+3]  def jg1(input):     return list(jg(input))  import itertools  def n(input, n=3):     i = list(itertoopls.tee(input, n))     for p, it in enumerate(i):         next(itertools.slice(it, p, p), None)     return zip(*i)  def n1(input, _n=3):     return list(map(list, n(input, _n)))  from numpy.lib.stride_tricks import as_strided  def strided_groupby(n, l=3):     s = n.strides[0]     return as_strided(n, shape=(n.size-l+1,l), strides=(s,s)) 

Results:

>>> input = list(range(10000)) >>> timeit.timeit(stmt='ff1(input)', globals=globals(), number=1000) 1.4750333260162733 >>> timeit.timeit(stmt='ff2(input)', globals=globals(), number=1000) 1.486136345018167 >>> timeit.timeit(stmt='ff3(input)', globals=globals(), number=1000) 1.6864491199958138 >>> timeit.timeit(stmt='jg1(input)', globals=globals(), number=1000) 2.300399674975779 >>> timeit.timeit(stmt='n1(input)', globals=globals(), number=1000) 2.2269885840360075 >>> input_arr = np.array(input) >>> timeit.timeit(stmt='strided_groupby(input_arr)', globals=globals(), number=1000) 0.01855822204379365 

Note that the inner list conversion waste a significant amount of CPU cycles. If you can afford to have tuples instead of lists, as the innermost sequences (i.e. (0,1,2), (1,2,3), ...) that is going to perform better.

For fairness of comparison I applied the same list conversion to all algorithms.

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: