Skip to content Skip to sidebar Skip to footer

Pandas Dataframe Memory Python

i want to transform a sparse matrix (156060x11780) to dataframe but i get a memory error this is my code vect = TfidfVectorizer(sublinear_tf=True, analyzer='word',

Solution 1:

Try this:

from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(sublinear_tf=True, analyzer='word', stop_words='english',
                       tokenizer=tokenize,
                       strip_accents='ascii',dtype=np.float16)
X = vect.fit_transform(df.pop('Phrase'))  # NOTE: `.toarray()` was removedfor i, col inenumerate(vect.get_feature_names()):
    df[col] = pd.SparseSeries(X[:, i].toarray().reshape(-1,), fill_value=0)

UPDATE: for Pandas 0.20+ we can construct SparseDataFrame directly from sparse arrays:

from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(sublinear_tf=True, analyzer='word', stop_words='english',
                       tokenizer=tokenize,
                       strip_accents='ascii',dtype=np.float16)

df = pd.SparseDataFrame(vect.fit_transform(df.pop('Phrase')),
                        columns=vect.get_feature_names(),
                        index=df.index)

UPDATE from 2022-01-22 in modern versions of Pandas the pd.SparseDataFrame method has been deprecated, so please use pd.DataFrame.sparse.from_spmatrix() instead.

Post a Comment for "Pandas Dataframe Memory Python"