Skip to content Skip to sidebar Skip to footer

Dataframe Into Numpy Array With Values Comma Seperated

The Scenario I've read a csv (which is \t seperated) into a Dataframe, which is now needed to be in a numpy array format for clustering without changing type The Problem So far as

Solution 1:

Use label-based selection and the .values attribute of the resulting pandas objects, which will be some sort of numpy array:

>>>df
   uid  iid  rat
0  196  242  3.0
1  186  302  3.0
2   22  377  1.0
>>>df.loc[:,['iid','rat']]
   iid  rat
0  242  3.0
1  302  3.0
2  377  1.0
>>>df.loc[:,['iid','rat']].values
array([[ 242.,    3.],
       [ 302.,    3.],
       [ 377.,    1.]])

Note, your integer column will get promoted to float.

Also note, this particular selection could be approached in different ways:

>>>df.iloc[:, 1:] # integer-position based
   iid  rat
0  242  3.0
1  302  3.0
2  377  1.0
>>>df[['iid','rat']] # plain indexing performs column-based selection
   iid  rat
0  242  3.0
1  302  3.0
2  377  1.0

I like label-based because it is more explicit.

Edit

The reason you aren't seeing commas is an artifact of how numpy arrays are printed:

>>> df[['iid','rat']].values
array([[ 242.,    3.],
       [ 302.,    3.],
       [ 377.,    1.]])
>>> print(df[['iid','rat']].values)
[[ 242.    3.]
 [ 302.    3.]
 [ 377.    1.]]

And actually, it is the difference between the str and repr results of the numpy array:

>>> print(repr(df[['iid','rat']].values))
array([[ 242.,    3.],
       [ 302.,    3.],
       [ 377.,    1.]])
>>> print(str(df[['iid','rat']].values))
[[ 242.    3.]
 [ 302.    3.]
 [ 377.    1.]]

Solution 2:

Why don't you just import the 'csv' as a numpy array?

import numpy as np 
defread_file( fname): 
    return np.genfromtxt( fname, delimiter="/t", comments="%", unpack=True) 

Solution 3:

It seems you need read_csv for DataFrame first with filter only second and third column first and then convert to numpy array by values: import pandas as pd from sklearn.cluster import KMeans from pandas.compat import StringIO

temp=u"""col,iid,rat
4,1,0
5,2,4
6,3,3
7,4,1"""#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), usecols = [1,2])
print (df)
   iid  rat
010124233341

X = df.values 
print (X)
[[10]
 [24]
 [33]
 [41]]

kmeans = KMeans(n_clusters=2)
a = kmeans.fit(X)
print (a)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=2, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

Post a Comment for "Dataframe Into Numpy Array With Values Comma Seperated"