Get cumulative count per 2d array

  • A+
Category:Languages

I have general data, e.g. strings:

np.random.seed(343)  arr = np.sort(np.random.randint(5, size=(10, 10)), axis=1).astype(str) print (arr) [['0' '1' '1' '2' '2' '3' '3' '4' '4' '4']  ['1' '2' '2' '2' '3' '3' '3' '4' '4' '4']  ['0' '2' '2' '2' '2' '3' '3' '4' '4' '4']  ['0' '1' '2' '2' '3' '3' '3' '4' '4' '4']  ['0' '1' '1' '1' '2' '2' '2' '2' '4' '4']  ['0' '0' '1' '1' '2' '3' '3' '3' '4' '4']  ['0' '0' '2' '2' '2' '2' '2' '2' '3' '4']  ['0' '0' '1' '1' '1' '2' '2' '2' '3' '3']  ['0' '1' '1' '2' '2' '2' '3' '4' '4' '4']  ['0' '1' '1' '2' '2' '2' '2' '2' '4' '4']] 

I need count with reset if difference for counter of cumulative values, so is used pandas.

First create DataFrame:

df = pd.DataFrame(arr) print (df)    0  1  2  3  4  5  6  7  8  9 0  0  1  1  2  2  3  3  4  4  4 1  1  2  2  2  3  3  3  4  4  4 2  0  2  2  2  2  3  3  4  4  4 3  0  1  2  2  3  3  3  4  4  4 4  0  1  1  1  2  2  2  2  4  4 5  0  0  1  1  2  3  3  3  4  4 6  0  0  2  2  2  2  2  2  3  4 7  0  0  1  1  1  2  2  2  3  3 8  0  1  1  2  2  2  3  4  4  4 9  0  1  1  2  2  2  2  2  4  4 

How it working for one column:

First compare shifted data and add cumulative sum:

a = (df[0] != df[0].shift()).cumsum() print (a) 0    1 1    2 2    3 3    3 4    3 5    3 6    3 7    3 8    3 9    3 Name: 0, dtype: int32 

And then call GroupBy.cumcount:

b = a.groupby(a).cumcount() + 1 print (b) 0    1 1    1 2    1 3    2 4    3 5    4 6    5 7    6 8    7 9    8 dtype: int64 

If want apply solution to all columns is possible use apply:

print (df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumcount() + 1))    0  1  2  3  4  5  6  7  8  9 0  1  1  1  1  1  1  1  1  1  1 1  1  1  1  2  1  2  2  2  2  2 2  1  2  2  3  1  3  3  3  3  3 3  2  1  3  4  1  4  4  4  4  4 4  3  2  1  1  1  1  1  1  5  5 5  4  1  2  2  2  1  1  1  6  6 6  5  2  1  1  3  1  1  1  1  7 7  6  3  1  1  1  2  2  2  2  1 8  7  1  2  1  1  3  1  1  1  1 9  8  2  3  2  2  4  1  1  2  2 

But it is slow, because large data. Is possible create some fast numpy solution?

I find solutions working only for 1d array.

 


And the numba solution. For such tricky problem, it always wins, here by a 7x factor vs numpy, since only one pass on res is done.

from numba import njit  @njit def thefunc(arrc):     m,n=arrc.shape     res=np.empty((m+1,n),np.uint32)     res[0]=1     for i in range(1,m+1):         for j in range(n):             if arrc[i-1,j]:                 res[i,j]=res[i-1,j]+1             else : res[i,j]=1     return res   def numbering(arr):return thefunc(arr[1:]==arr[:-1]) 

I need to externalize arr[1:]==arr[:-1] since numba doesn't support strings.

In [75]: %timeit numbering(arr) 13.7 µs ± 373 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  In [76]: %timeit grp_range_2dcol(arr) 111 µs ± 18.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 

For bigger array (100 000 rows x 100 cols), the gap is not so wide :

In [168]: %timeit a=grp_range_2dcol(arr) 1.54 s ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  In [169]: %timeit a=numbering(arr) 625 ms ± 43.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 

If arr can be convert to 'S8', we can win a lot of time :

In [398]: %timeit arr[1:]==arr[:-1] 584 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  In [399]: %timeit arr.view(np.uint64)[1:]==arr.view(np.uint64)[:-1] 196 ms ± 18.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: