Skip to content Skip to sidebar Skip to footer

How To Implement Kernel Density Estimation In Multivariate/3d

I have dataset like the following fromat and im trying to find out the Kernel density estimation with optimal bandwidth. data = np.array([[1, 4, 3], [2, .6, 1.2], [2, 1, 1.2],

Solution 1:

Interesting problem. You have a few options:

  1. Continue with scikit-learn
  2. Use a different library. For instance, if the kernel you are interested in is the gaussian - then you could use scipy.gaussian_kde which is arguably easier to understand / apply. There is a very good example of this technique in this question.
  3. roll your own from first principles. This is very difficult and I don't recommend it

This blog post goes into detail about the relative merits of various library implementations of Kernel Density Estimation (KDE).


I'm going to show you what (in my opinion - yes this is a bit opinion based) is the simplest way, which I think is option 2 in your case.

NOTE This method uses a rule of thumb as described in the linked docs to determine bandwidth. The exact rule used is Scott's rule. Your mention of the Σ matrix makes me think rule of thumb bandwidth selection is OK for you, but you also talk about optimal bandwidth and the example you present uses cross-validation to determine the best bandwidth. Therefore, if this method doesn't suit your purposes - let me know in comments.

import numpy as np
from scipy import stats
data = np.array([[1, 4, 3], [2, .6, 1.2], [2, 1, 1.2],
         [2, 0.5, 1.4], [5, .5, 0], [0, 0, 0],
         [1, 4, 3], [5, .5, 0], [2, .5, 1.2]])

data = data.T #The KDE takes N vectors of length K for K data points#rather than K vectors of length N

kde = stats.gaussian_kde(data)

# You now have your kde!!  Interpreting it / visualising it can be difficult with 3D data# You might like to try 2D data first - then you can plot the resulting estimated pdf# as the height in the third dimension, making visualisation easier.# Here is the basic way to evaluate the estimated pdf on a regular n-dimensional mesh# Create a regular N-dimensional grid with (arbitrary) 20 points in each dimension
minima = data.T.min(axis=0)
maxima = data.T.max(axis=0)
space = [np.linspace(mini,maxi,20) for mini, maxi in zip(minima,maxima)]
grid = np.meshgrid(*space)

#Turn the grid into N-dimensional coordinates for each point#Note - coords will get very large as N increases...
coords = np.vstack(map(np.ravel, grid))

#Evaluate the KD estimated pdf at each coordinate
density = kde(coords)

#Do what you like with the density values here..#plot them, output them, use them elsewhere...

Caveat

this may give terrible results, depending on your particular problem. Things to bear in mind are obviously:

  1. as your number of dimensions goes up, the number of observed data points you want will have to go up exponentially - your sample data of 9 points in 3 dimensions is pretty sparse - although I assume the dots indicate that in fact you have many more.

  2. As mentioned in the main body - the bandwidth is selected in a particular way - this may result in over- (or conceivably but unlikely under-) smoothing of the estimated pdf.

Post a Comment for "How To Implement Kernel Density Estimation In Multivariate/3d"