GroupBy On Multiple Columns And Apply Moving Function
Solution 1:
Updated with additional information
data:
import pandas as pd
import numpy as np
df = pd.DataFrame({'date':['2018-01-01', '2018-02-01', '2018-03-01', '2018-04-01']*4,
              'country_id':[1]*8+[2]*8,
              'company_id':[1]*4+[2]*4+[1]*4+[2]*4,
              'value':[1, 0, 2, np.nan, 1, 2, np.nan, np.nan, 3, 0, 2, np.nan, 1, 2, np.nan, np.nan]})
Create a rolling sum within just country_id
df['rolling_sum'] = df.groupby('country_id').apply(lambda x: x.value.rolling(window=2, min_periods=1).sum()).reset_index(drop=True)
Create a rolling count within just country_id
df['sum_records'] = df.groupby('country_id').apply(lambda x: x.value.rolling(window=2, min_periods=1).count()).reset_index(drop=True)
Now groupby within country_id and date, to sum the sums, and divide by sum of counts
summarized_df = df.groupby(['country_id', 'date']).apply(lambda x: x.rolling_sum.sum()/x.sum_records.sum()).reset_index()
country_id  date      
1           2018-01-01    1.000000
            2018-02-01    1.000000
            2018-03-01    1.333333
            2018-04-01    2.000000
2           2018-01-01    2.000000
            2018-02-01    1.500000
            2018-03-01    1.333333
            2018-04-01    2.000000
Lets look at this in further detail. Since we are grouping by country_id, we'll subset out a single country id to practice this methodology on:
if we take just one piece of this, say country_id == 1:
df2 = df[df['country_id'] == 1]
         date  country_id  company_id  value
0  2018-01-01           1           1    1.0
1  2018-02-01           1           1    0.0
2  2018-03-01           1           1    2.0
3  2018-04-01           1           1    NaN
4  2018-01-01           1           2    1.0
5  2018-02-01           1           2    2.0
6  2018-03-01           1           2    NaN
7  2018-04-01           1           2    NaN
If we want the rolling averages for this one, we can just do:
df2.value.rolling(window=2, min_periods=1).mean()
0    1.0
1    0.5
2    1.0
3    2.0
4    1.0
5    1.5
6    2.0
7    NaN
We can see here that the values from our subset country_id == 1 dataframe and how they relate to the rolling averages:
0    1.0  = (1)/1 = 1
1    0.0  = (0 + 1)/2 = 0.5
2    2.0  = (2 + 0)/2 = 1
3    NaN  = (Nan + 2)/1 = 2
4    1.0  = (1 + Nan)/1 = 1
5    2.0  = (2 + 1)/2 = 1.5
6    NaN  = (Nan + 2)/1 = 2
7    NaN  = (Nan + Nan)/0 = Nan
This is how we get our rolling averages for a single grouping of country_id
If we wanted to get groupings by date, and we went the route of grouping it first by country_id, then date, a single group would look like:
df3 = df[(df['country_id'] == 1) & (df['date'] == '2018-03-01')]
df3.value
2    2.0
6    NaN
df3.value.rolling(window=2, min_periods=1).mean()
2    2.0
6    2.0
df3.value
2    2.0 = (2)/1 = 2
6    NaN = (Nan + 2)/1 = 2
The problem here, is that you want the rolling averages first by country_id, not grouping with date. Then after you find the rolling averages by country, you want to take those values and average them. If we were to take the rolling averages, and then average those, it would come out incorrect. 
So lets go back to the original rolling averages we created for country_id == 1, and look at the dates:
2018-01-01    1.0  = (1)/1 =         1
2018-02-01    0.0  = (0 + 1)/2 =     0.5
2018-03-01    2.0  = (2 + 0)/2 =     1
2018-04-01    NaN  = (Nan + 2)/1 =   2
2018-01-01    1.0  = (1 + Nan)/1 =   1
2018-02-01    2.0  = (2 + 1)/2 =     1.5
2018-03-01    NaN  = (Nan + 2)/1 =   2
2018-04-01    NaN  = (Nan + Nan)/0 = Nan
Now the tricky part here is that at this point we can't just average them together because for example, if you look at 2018-03-01 rolling average values, we have 1 and 2 which is 3. dividing that by 2 would give us 1.5.
We have to first sum the rolling values, and then divide by the count of records.
Solution 2:
You can achieve the result you want to the following way:
# get company value by date
avg = df.groupby(["Country_id", "Date", "Company_id"]).sum().unstack(level=2).loc[:, "Company_value"]
avg = pd.concat([avg, avg.shift(1)], axis=1)
avg["sum"] = avg.apply("sum", axis=1)
# get company count by date
counts = df.groupby(["Country_id", "Date"]).count().loc[:, "Company_value"]
counts2 = counts + counts.shift(1)
# get the "mean"
result = avg["sum"] / counts2.fillna(counts)
Post a Comment for "GroupBy On Multiple Columns And Apply Moving Function"