Skip to content Skip to sidebar Skip to footer

Combine Pandas String Columns With Missing Values

I need to concat the strings in 2 or more columns of a pandas dataframe. I found this answer, which works fine if you don't have any missing value. Unfortunately, I have, and this

Solution 1:

You can use apply with if-else:

df = df.apply(lambda x: Noneif x.isnull().all() else';'.join(x.dropna()), axis=1)
print (df)
0    val_A;val_B
1          val_B
2          val_A
3None
dtype: object

For faster solution is possible use:

#add separator and replace NaN to empty space#convert to lists
arr = df.add('; ').fillna('').values.tolist()
#list comprehension, replace empty spaces to NaN
s = pd.Series([''.join(x).strip('; ') for x in arr]).replace('^$', np.nan, regex=True)
#replace NaN to None
s = s.where(s.notnull(), None)
print (s)
0    val_A;val_B
1          val_B
2          val_A
3None
dtype: object

#40000rows
df = pd.concat([df]*10000).reset_index(drop=True)

In [70]: %%timeit
    ...: arr = df.add('; ').fillna('').values.tolist()
    ...: s = pd.Series([''.join(x).strip('; ') for x in arr]).replace('^$', np.nan, regex=True)
    ...: s.where(s.notnull(), None)
    ...: 
10 loops, best of3: 74 ms per loop


In [71]: %%timeit
    ...: df.apply(lambda x: None if x.isnull().all() else';'.join(x.dropna()), axis=1)
    ...: 
1 loop, best of3: 12.7 s per loop

#another solution, but slowier a bit
In [72]: %%timeit
     ...: arr = df.add('; ').fillna('').values  
     ...: s = [''.join(x).strip('; ') for x in arr]
     ...: pd.Series([y if y !=''elseNonefor y in s])
     ...: 
     ...: 
10 loops, best of3: 119 ms per loop

Post a Comment for "Combine Pandas String Columns With Missing Values"