How To Implement Kernel Density Estimation In Multivariate/3d
Solution 1:
Interesting problem. You have a few options:
- Continue with scikit-learn
- Use a different library. For instance, if the kernel you are interested in is the gaussian - then you could use scipy.gaussian_kde which is arguably easier to understand / apply. There is a very good example of this technique in this question.
- roll your own from first principles. This is very difficult and I don't recommend it
This blog post goes into detail about the relative merits of various library implementations of Kernel Density Estimation (KDE).
I'm going to show you what (in my opinion - yes this is a bit opinion based) is the simplest way, which I think is option 2 in your case.
NOTE This method uses a rule of thumb as described in the linked docs to determine bandwidth. The exact rule used is Scott's rule. Your mention of the Σ matrix makes me think rule of thumb bandwidth selection is OK for you, but you also talk about optimal bandwidth and the example you present uses cross-validation to determine the best bandwidth. Therefore, if this method doesn't suit your purposes - let me know in comments.
import numpy as np
from scipy import stats
data = np.array([[1, 4, 3], [2, .6, 1.2], [2, 1, 1.2],
[2, 0.5, 1.4], [5, .5, 0], [0, 0, 0],
[1, 4, 3], [5, .5, 0], [2, .5, 1.2]])
data = data.T #The KDE takes N vectors of length K for K data points#rather than K vectors of length N
kde = stats.gaussian_kde(data)
# You now have your kde!! Interpreting it / visualising it can be difficult with 3D data# You might like to try 2D data first - then you can plot the resulting estimated pdf# as the height in the third dimension, making visualisation easier.# Here is the basic way to evaluate the estimated pdf on a regular n-dimensional mesh# Create a regular N-dimensional grid with (arbitrary) 20 points in each dimension
minima = data.T.min(axis=0)
maxima = data.T.max(axis=0)
space = [np.linspace(mini,maxi,20) for mini, maxi in zip(minima,maxima)]
grid = np.meshgrid(*space)
#Turn the grid into N-dimensional coordinates for each point#Note - coords will get very large as N increases...
coords = np.vstack(map(np.ravel, grid))
#Evaluate the KD estimated pdf at each coordinate
density = kde(coords)
#Do what you like with the density values here..#plot them, output them, use them elsewhere...
Caveat
this may give terrible results, depending on your particular problem. Things to bear in mind are obviously:
as your number of dimensions goes up, the number of observed data points you want will have to go up exponentially - your sample data of 9 points in 3 dimensions is pretty sparse - although I assume the dots indicate that in fact you have many more.
As mentioned in the main body - the bandwidth is selected in a particular way - this may result in over- (or conceivably but unlikely under-) smoothing of the estimated pdf.
Post a Comment for "How To Implement Kernel Density Estimation In Multivariate/3d"