Skip to content Skip to sidebar Skip to footer

Best And Efficient Way To Concat Or Append Huge Multiple Xlsx Files In Pandas

New to pandas doing some progress with self learning, so I want the best and efficient way to handle this: I have 3 sometimes more than 3 excel files '.xlsx' each one is about 100M

Solution 1:

You can use multiprocessing to increase speed loading and use concat merge all dfs:

import pandas as pd
import multiprocessing
import glob
import time


defread_excel(filename):
    return pd.read_excel(filename)


if __name__ == "__main__":
    files = glob.glob("./data/*.xlsx")

    print("Sequential")
    print(f"Loading excel files: {time.strftime('%H:%M:%S', time.localtime())}")
    start = time.time()
    data = [read_excel(filename) for filename in files]
    end = time.time()
    print(f"Loaded excel files in {time.strftime('%H:%M:%S', time.gmtime(end-start))}")
    df_sq = pd.concat(data).reset_index(drop=True)

    print("Multiprocessing")
    with multiprocessing.Pool(multiprocessing.cpu_count()) as pool:
        print(f"Loading excel files: {time.strftime('%H:%M:%S', time.localtime())}")
        start = time.time()
        data = pool.map(read_excel, files)
        end = time.time()
        print(f"Loaded excel files in {time.strftime('%H:%M:%S', time.gmtime(end-start))}")
        df_mp = pd.concat(data).reset_index(drop=True)

Example: 50 files of 25MB (gain 2x)

SequentialLoading excel files:09:12:17Loadedexcelfilesin00:00:14MultiprocessingLoading excel files:09:12:33Loadedexcelfilesin00:00:07

Post a Comment for "Best And Efficient Way To Concat Or Append Huge Multiple Xlsx Files In Pandas"