Skip to content Skip to sidebar Skip to footer

How To Pre-process Data Before Pandas.read_csv()

I have a slightly broken CSV file that I want to pre-process before reading it with pandas.read_csv(), i.e. do some search/replace on it. I tried to open the file and and do the pr

Solution 1:

After further investigation of Pandas' source, it became apparent, that it doesn't simply require an iterable, but also wants it to be a file, expressed by having a read method (is_file_like() in inference.py).

So, I built a generator the old way

classInFile(object):
def__init__(self, infile):
    self.infile = open(infile)

def__next__(self):
    return self.next()

def__iter__(self):
    return self

defread(self, *args, **kwargs):
    return self.__next__()

defnext(self):
    try:
        line: str = self.infile.readline()
        line = re.sub(r'","',r',',line) # do some fixingreturn line
    except:
        self.infile.close()
        raise StopIteration

This works in pandas.read_csv():

df = pd.read_csv(InFile("some.csv"))

To me this looks super complicated and I wonder if there is any better (→ more elegant) solution.

Solution 2:

Here's a solution that will work for smaller CSV files. All lines are first read into memory, processed, and concatenated. This will probably perform badly for larger files.

import re
from io import StringIO
import pandas as pd

withopen('file.csv') as file:
    lines = [re.sub(r'","', r',', line) for line in file]

df = pd.read_csv(StringIO('\n'.join(lines)))

Post a Comment for "How To Pre-process Data Before Pandas.read_csv()"