Dataframe Into Numpy Array With Values Comma Seperated
The Scenario I've read a csv (which is \t seperated) into a Dataframe, which is now needed to be in a numpy array format for clustering without changing type The Problem So far as
Solution 1:
Use label-based selection and the .values
attribute of the resulting pandas
objects, which will be some sort of numpy
array:
>>>df
uid iid rat
0 196 242 3.0
1 186 302 3.0
2 22 377 1.0
>>>df.loc[:,['iid','rat']]
iid rat
0 242 3.0
1 302 3.0
2 377 1.0
>>>df.loc[:,['iid','rat']].values
array([[ 242., 3.],
[ 302., 3.],
[ 377., 1.]])
Note, your integer column will get promoted to float.
Also note, this particular selection could be approached in different ways:
>>>df.iloc[:, 1:] # integer-position based
iid rat
0 242 3.0
1 302 3.0
2 377 1.0
>>>df[['iid','rat']] # plain indexing performs column-based selection
iid rat
0 242 3.0
1 302 3.0
2 377 1.0
I like label-based because it is more explicit.
Edit
The reason you aren't seeing commas is an artifact of how numpy arrays are printed:
>>> df[['iid','rat']].values
array([[ 242., 3.],
[ 302., 3.],
[ 377., 1.]])
>>> print(df[['iid','rat']].values)
[[ 242. 3.]
[ 302. 3.]
[ 377. 1.]]
And actually, it is the difference between the str
and repr
results of the numpy array:
>>> print(repr(df[['iid','rat']].values))
array([[ 242., 3.],
[ 302., 3.],
[ 377., 1.]])
>>> print(str(df[['iid','rat']].values))
[[ 242. 3.]
[ 302. 3.]
[ 377. 1.]]
Solution 2:
Why don't you just import the 'csv' as a numpy array?
import numpy as np
defread_file( fname):
return np.genfromtxt( fname, delimiter="/t", comments="%", unpack=True)
Solution 3:
It seems you need read_csv
for DataFrame
first with filter only second and third column first and then convert to numpy array by values
:
import pandas as pd
from sklearn.cluster import KMeans
from pandas.compat import StringIO
temp=u"""col,iid,rat
4,1,0
5,2,4
6,3,3
7,4,1"""#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), usecols = [1,2])
print (df)
iid rat
010124233341
X = df.values
print (X)
[[10]
[24]
[33]
[41]]
kmeans = KMeans(n_clusters=2)
a = kmeans.fit(X)
print (a)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=2, n_init=10, n_jobs=1, precompute_distances='auto',
random_state=None, tol=0.0001, verbose=0)
Post a Comment for "Dataframe Into Numpy Array With Values Comma Seperated"