Fast looping through Python dataframe with previous row reference

  • A+
Category:Languages

Suppose I have a pandas dataframe with two columns: ID and Days. DataFrame is sorted by ascending order in both variables. For example:

# Initial dataset data = pd.DataFrame({'id': np.repeat([1, 2 ,3], 4),                  'day': [1, 2, 10, 11, 3, 4, 12, 15, 1, 20, 21, 24]})      id  day 0   1   1 1   1   2 2   1   10 3   1   11 4   2   3 5   2   4 6   2   12 7   2   15 8   3   1 9   3   20 10  3   21 11  3   24 

I want to add a third column, which would give a "session" number for every ID*day. By "session" i mean a sequence of days with difference less than 2 days between days of one session. For example, sequence 5,6,7 will be considered as one session, while 5,6,9 will be considered as two sessions and should be marked as 0, 0, 1, i.e. day 5 and 6 were refered toi session #0, while day 9 was refered to session #1. Session numbers shuould start from 0 for each new ID.

In other words, what i want to get is:

    id  day session 0   1   1   0 1   1   2   0 2   1   10  1 3   1   11  1 4   2   3   0 5   2   4   0 6   2   12  1 7   2   15  2   8   3   1   0 9   3   20  1 10  3   21  1 11  3   24  2 

To solve this task I use basic for loop. In this loop i iteratevely run through all unique IDs, then subset a block of data from initial dataset and assign session numbers for every day of a specific ID. The problem i have - since the initial dataset is millions of lines - loop takes a looot of time! For example, for 1 mln of lines my loop spends about a minute, which is too much.

How to improve the speed? Any method is good! If you know how to achieve desired result, for instance, with some numpy matrix manipulations which will reduce time - also good ...

My code for the loop:

# Get sessions for every id sessions = [] for i in data.id.unique():     id_data = data['day'][data['id']==i].reset_index(drop=True)     for ind in id_data.index:         if ind == 0:             temp = [0]         elif ((id_data[ind] - id_data[ind - 1]) < 2):             temp.append(temp[ind - 1])         else:             temp.append(temp[ind - 1] + 1)     sessions.extend(temp)  # Add sessions to the table data['session'] = sessions  

 


You can use groupby() twice with np.where(), diff() and cumsum():

data['session'] = np.where(data.groupby('id')['day'].diff().fillna(0)>1, 1, 0) data['session'] = data.groupby('id')['session'].cumsum() 

Yields:

    id  day  session 0    1    1        0 1    1    2        0 2    1   10        1 3    1   11        1 4    2    3        0 5    2    4        0 6    2   12        1 7    2   15        2 8    3    1        0 9    3   20        1 10   3   21        1 11   3   24        2 

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: