Cumulative apply within window defined by other columns

  • A+
Category:Languages

I am trying to apply a function, cumulatively, to values that lie within a window defined by 'start' and 'finish' columns. So, 'start' and 'finish' define the intervals where the value is 'active'; for each row, I want to get a sum of all 'active' values at the time.

Here is a 'bruteforce' example that does what I am after - is there a more elegant, faster or more memory efficient way of doing this?

df = pd.DataFrame(data=[[1,3,100], [2,4,200], [3,6,300], [4,6,400], [5,6,500]],     columns=['start', 'finish', 'val']) df['dummy'] = 1 df = df.merge(df, on=['dummy'], how='left') df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])] val = df.groupby('start_x')['val_y'].sum() 

Originally, df is:

  start  finish  val 0   1      3     100 1   2      4     200 2   3      6     300 3   4      6     400 4   5      6     500 

The result I am after is:

1   100 2   300 3   500 4   700 5   1200 

 


numba

from numba import njit  @njit def pir_numba(S, F, V):   mn = S.min()   mx = F.max()   out = np.zeros(mx)   for s, f, v in zip(S, F, V):     out[s:f] += v   return out[mn:]  pir_numba(*[df[c].values for c in ['start', 'finish', 'val']]) 

np.bincount

s, f, v = [df[col].values for col in ['start', 'finish', 'val']] np.bincount([i - 1 for r in map(range, s, f) for i in r], v.repeat(f - s))  array([ 100.,  300.,  500.,  700., 1200.]) 

Comprehension

This depends on the index being unique

pd.Series({     (k, i): v     for i, s, f, v in df.itertuples()     for k in range(s, f) }).sum(level=0)  1     100 2     300 3     500 4     700 5    1200 dtype: int64 

With no dependence on index

pd.Series({     (k, i): v     for i, (s, f, v) in enumerate(zip(*map(df.get, ['start', 'finish', 'val'])))     for k in range(s, f) }).sum(level=0) 

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: