Skip to content Skip to sidebar Skip to footer

GroupBy On Multiple Columns And Apply Moving Function

Let's suppose that I have this dataset: Country_id Company_id Date Company_value 1 1 01/01/2018 1 1 1 02/01/2018 0 1 1 03/01/2018 2 1 1 04/01/2018 NA 1 2

Solution 1:

Updated with additional information

data:

import pandas as pd
import numpy as np

df = pd.DataFrame({'date':['2018-01-01', '2018-02-01', '2018-03-01', '2018-04-01']*4,
              'country_id':[1]*8+[2]*8,
              'company_id':[1]*4+[2]*4+[1]*4+[2]*4,
              'value':[1, 0, 2, np.nan, 1, 2, np.nan, np.nan, 3, 0, 2, np.nan, 1, 2, np.nan, np.nan]})

Create a rolling sum within just country_id

df['rolling_sum'] = df.groupby('country_id').apply(lambda x: x.value.rolling(window=2, min_periods=1).sum()).reset_index(drop=True)

Create a rolling count within just country_id

df['sum_records'] = df.groupby('country_id').apply(lambda x: x.value.rolling(window=2, min_periods=1).count()).reset_index(drop=True)

Now groupby within country_id and date, to sum the sums, and divide by sum of counts

summarized_df = df.groupby(['country_id', 'date']).apply(lambda x: x.rolling_sum.sum()/x.sum_records.sum()).reset_index()

country_id  date      
1           2018-01-01    1.000000
            2018-02-01    1.000000
            2018-03-01    1.333333
            2018-04-01    2.000000
2           2018-01-01    2.000000
            2018-02-01    1.500000
            2018-03-01    1.333333
            2018-04-01    2.000000

Lets look at this in further detail. Since we are grouping by country_id, we'll subset out a single country id to practice this methodology on:

if we take just one piece of this, say country_id == 1:

df2 = df[df['country_id'] == 1]

         date  country_id  company_id  value
0  2018-01-01           1           1    1.0
1  2018-02-01           1           1    0.0
2  2018-03-01           1           1    2.0
3  2018-04-01           1           1    NaN
4  2018-01-01           1           2    1.0
5  2018-02-01           1           2    2.0
6  2018-03-01           1           2    NaN
7  2018-04-01           1           2    NaN

If we want the rolling averages for this one, we can just do:

df2.value.rolling(window=2, min_periods=1).mean()
0    1.0
1    0.5
2    1.0
3    2.0
4    1.0
5    1.5
6    2.0
7    NaN

We can see here that the values from our subset country_id == 1 dataframe and how they relate to the rolling averages:

0    1.0  = (1)/1 = 1
1    0.0  = (0 + 1)/2 = 0.5
2    2.0  = (2 + 0)/2 = 1
3    NaN  = (Nan + 2)/1 = 2
4    1.0  = (1 + Nan)/1 = 1
5    2.0  = (2 + 1)/2 = 1.5
6    NaN  = (Nan + 2)/1 = 2
7    NaN  = (Nan + Nan)/0 = Nan

This is how we get our rolling averages for a single grouping of country_id

If we wanted to get groupings by date, and we went the route of grouping it first by country_id, then date, a single group would look like:

df3 = df[(df['country_id'] == 1) & (df['date'] == '2018-03-01')]

df3.value
2    2.0
6    NaN

df3.value.rolling(window=2, min_periods=1).mean()
2    2.0
6    2.0

df3.value
2    2.0 = (2)/1 = 2
6    NaN = (Nan + 2)/1 = 2

The problem here, is that you want the rolling averages first by country_id, not grouping with date. Then after you find the rolling averages by country, you want to take those values and average them. If we were to take the rolling averages, and then average those, it would come out incorrect.

So lets go back to the original rolling averages we created for country_id == 1, and look at the dates:

2018-01-01    1.0  = (1)/1 =         1
2018-02-01    0.0  = (0 + 1)/2 =     0.5
2018-03-01    2.0  = (2 + 0)/2 =     1
2018-04-01    NaN  = (Nan + 2)/1 =   2
2018-01-01    1.0  = (1 + Nan)/1 =   1
2018-02-01    2.0  = (2 + 1)/2 =     1.5
2018-03-01    NaN  = (Nan + 2)/1 =   2
2018-04-01    NaN  = (Nan + Nan)/0 = Nan

Now the tricky part here is that at this point we can't just average them together because for example, if you look at 2018-03-01 rolling average values, we have 1 and 2 which is 3. dividing that by 2 would give us 1.5.

We have to first sum the rolling values, and then divide by the count of records.


Solution 2:

You can achieve the result you want to the following way:

# get company value by date
avg = df.groupby(["Country_id", "Date", "Company_id"]).sum().unstack(level=2).loc[:, "Company_value"]
avg = pd.concat([avg, avg.shift(1)], axis=1)
avg["sum"] = avg.apply("sum", axis=1)

# get company count by date
counts = df.groupby(["Country_id", "Date"]).count().loc[:, "Company_value"]
counts2 = counts + counts.shift(1)

# get the "mean"
result = avg["sum"] / counts2.fillna(counts)

Post a Comment for "GroupBy On Multiple Columns And Apply Moving Function"