Skip to content Skip to sidebar Skip to footer

Python Pandas : Compare Two Data-frames Along One Column And Return Content Of Rows Of Both Data Frames In Another Data Frame

I am working with two csv files and imported as dataframe, df1 and df2 df1 has 50000 rows and df2 has 150000 rows. I want to compare (iterate through each row) the 'time' of df2

Solution 1:

You're looking for pandas.merge_asof. It allows you to combine 2 DataFrames on a key, in this case time, without the requirement that they are an exact match. You can choose a direction for prioritizing the match, but in this case it's obvious that you want nearest

A “nearest” search selects the row in the right DataFrame whose ‘on’ key is closest in absolute distance to the left’s key.

One caveat is that you need to sort things for merge_asof to work.

import pandas as pd

pd.merge_asof(df2.sort_values('time'), df1.sort_values('time'), on='time', direction='nearest')
#          time  velocity   yawrate  velocity_x       yaw
#0  35427009860   12.6556 -0.074351     12.5451 -0.078781
#1  35427029728   12.6556 -0.074351     12.5451 -0.078781
#2  35427049705   12.6444 -0.074351     12.5451 -0.078781
#3  35427929709   12.6583 -0.075049     12.5351 -0.079489
#4  35427949712   12.6556 -0.075049     12.5401 -0.079591

Just be careful about which DataFrame you choose as the left or right frame, as that changes the result. In this case I'm selecting the time in df1 which is closest in absolute distance to the time in df2.

You also need to be careful if you have duplicated on keys in the right df because for exact matches, merge_asof only merges the last sorted row of the right df to the left df, instead of creating multiple entries for each exact match. If that's a problem, you can instead merge the exact keys first to get all of the combinations, and then merge the remainder with asof.


Solution 2:

just a side note (as not an answer)

    min_delta=100000
    for indexer, rows in df2.iterrows():
        if abs(float(row['time'])-float(rows['time']))<min_delta:
            min_delta = abs(float(row['time'])-float(rows['time']))
            #storing the position
            pos = indexer

can be written as

    diff = np.abs(row['time'] - df2['time'])
    pos = np.argmin(diff)

(always avoid for loops)

and don't call your vars with a built-in name (min)


Post a Comment for "Python Pandas : Compare Two Data-frames Along One Column And Return Content Of Rows Of Both Data Frames In Another Data Frame"