How do I unnest a column in a pandas DataFrame?

  • A+
Category:Languages

I have the following DataFrame where one of the columns is an object (list type cell):

df=pd.DataFrame({'A':[1,2],'B':[[1,2],[1,2]]}) df Out[458]:     A       B 0  1  [1, 2] 1  2  [1, 2] 

My expected output is:

   A  B 0  1  1 1  1  2 3  2  1 4  2  2 

What should I do to achieve this?

 



As an user with both R and python and spent one year in this site, I have seen this type of question couple times.


Since in R they have the build-in function from package tidyr so called unnest, But in Python(pandas) there is no build-in function for this type of question.


I know object columns type always make the data hard to convert by pandas' function. When I received the data like this , the first thing come into my mind is to 'flatten' or unnesting the columns .


Method 1 apply + pd.Series (easy to understand by in term of performance not recommended . )

df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:'B'}) Out[463]:     A  B 0  1  1 1  1  2 0  2  1 1  2  2 

Method 2 using repeat with DataFrame constructor , re-create your dataframe (good at performance, not good at multiple columns )

df=pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)}) df Out[465]:     A  B 0  1  1 0  1  2 1  2  1 1  2  2 

Method 2.1 for example besides A we have A.1 .....A.n, if we still using the method(Method 2) above it is hard for us to re-create the columns one by one .

Solution : join or merge with the index after 'unnest' the single columns

s=pd.DataFrame({'B':np.concatenate(df.B.values)},index=df.index.repeat(df.B.str.len())) s.join(df.drop('B',1),how='left') Out[477]:     B  A 0  1  1 0  2  1 1  1  2 1  2  2 

If you need the column order exactly same as before , adding reindex at the end

s.join(df.drop('B',1),how='left').reindex(columns=df.columns) 

Method 3 recreate the list

pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns) Out[488]:     A  B 0  1  1 1  1  2 2  2  1 3  2  2 

If more than two columns

s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y]) s.merge(df,left_on=0,right_index=True) Out[491]:     0  1  A       B 0  0  1  1  [1, 2] 1  0  2  1  [1, 2] 2  1  1  2  [1, 2] 3  1  2  2  [1, 2] 

Method 4 using reindex or loc

df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values)) Out[554]:     A  B 0  1  1 0  1  2 1  2  1 1  2  2  #df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values)) 

Method 5 when the list only contain unique values:

df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]]}) from collections import ChainMap d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A']))) pd.DataFrame(list(d.items()),columns=df.columns[::-1]) Out[574]:     B  A 0  1  1 1  2  1 2  3  2 3  4  2 

Special case have two columns type object

df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]]}) df Out[592]:     A       B       C 0  1  [1, 2]  [1, 2] 1  2  [3, 4]  [3, 4] 

Self-def function

def unnesting(df, explode):     idx=df.index.repeat(df[explode[0]].str.len())     df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)     df1.index=idx     return df1.join(df.drop(explode,1),how='left')  unnesting(df,['B','C']) Out[609]:     B  C  A 0  1  1  1 0  2  2  1 1  3  3  2 1  4  4  2 

Summary :

I am using pandas and python function for this type of question , if you worry about the speed of above solutions I provided , you can check user3483203's answer , since he is using numpy and most of the time numpy is faster . Just a suggestion if the speed is do matter for your case , I will recommend Cpython and numba

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: