Skip to content Skip to sidebar Skip to footer

Pandas Int Or Float Column To Percentage Distribution

I have a pandas dataframe df: import pandas as pd import numpy as np data = {'A':[250,100,400,np.nan,300]} df = pd.DataFrame(data) print(df) A 0 250.0 1 100.0 2 400.0 3

Solution 1:

Here is one way using pd.cut

s=pd.cut(df.A,values).dropna()
x=s.map(lambda x : x.left).astype(int).to_frame('V')
y=s.map(lambda x : x.right).astype(int).to_frame('V')
x['r']=(df.A-x.V)/(y.V-x.V)
y['r']=(y.V-df.A)/(y.V-x.V)
df1=pd.concat([x,y]).set_index('V',append=True).\
       r.unstack(fill_value=0).\
        reindex(columns=values,index=df.index,fill_value=0)
df1
Out[110]: 
V  0     200   400  600
0  0.0  0.25  0.75  0.0
1  0.5  0.50  0.00  0.0
2  0.0  1.00  0.00  0.0
3  0.0  0.00  0.00  0.0
4  0.0  0.50  0.50  0.0

Solution 2:

Setup

There is quite a bit to do here. I wanted to come up with a completely "vectorized" method, but the overhead may make this not the most performant solution for you to use. Make sure you include this setup in any timings before choosing an approach.

u = df['A'].values
v = np.asarray(values)
nan_u = np.isnan(u)
a = np.where(nan_u, 0, u)

d = v[1:] - v[:-1]
diffs = np.concatenate((d[:1], d))
s = np.searchsorted(v, a)

You can use vectorized methods to calculate the ratio if a value fell anywhere, and then mask between valid ranges.

This makes the assumption that all values that are non-null fall between min(values) and max(values). If this requirement is not met then the step where you create the mask m will fail with an index error.


m = np.zeros((u.shape[0], v.shape[0]), dtype='i1')
ar = np.arange(m.shape[0])

m[ar, s] = 1
m[ar, s-1] = 1

f = np.abs((v - a[:, None]) / diffs)

final = f * m
final[nan_u] = 0

final

array([[0.  , 0.25, 0.75, 0.  ],
       [0.5 , 0.5 , 0.  , 0.  ],
       [0.  , 1.  , 0.  , 0.  ],
       [0.  , 0.  , 0.  , 0.  ],
       [0.  , 0.5 , 0.5 , 0.  ]])

The one step I am not happy about in this answer is where I create the differences:

d = v[1:] - v[:-1]
diffs = np.concatenate((d[:1], d))

You really only have 3 "differences", but since you need 4 for shape reasons, you can append either the first difference or last difference to the beginning or end of the array respectively. It seems very clunky, and there may be a way to avoid this.


Post a Comment for "Pandas Int Or Float Column To Percentage Distribution"