Skip to content Skip to sidebar Skip to footer

Read Multiple Lines From A File Batch By Batch

I would like to know is there a method that can read multiple lines from a file batch by batch. For example: with open(filename, 'rb') as f: for n_lines in f: process(n

Solution 1:

itertools.islice and two arg iter can be used to accomplish this, but it's a little funny:

from itertools import islice

n = 5  # Or whatever chunk size you want
with open(filename, 'rb') as f:
    for n_lines in iter(lambda: tuple(islice(f, n)), ()):
        process(n_lines)

This will keep isliceing off n lines at a time (using tuple to actually force the whole chunk to be read in) until the f is exhausted, at which point it will stop. The final chunk will be less than n lines if the number of lines in the file isn't an even multiple of n. If you want all the lines to be a single string, change the for loop to be:

    # The b prefixes are ignored on 2.7, and necessary on 3.x since you opened
    # the file in binary mode
    for n_lines in iter(lambda: b''.join(islice(f, n)), b''):

Another approach is to use izip_longest for the purpose, which avoids lambda functions:

from future_builtins import map  # Only on Py2
from itertools import izip_longest  # zip_longest on Py3

    # gets tuples possibly padded with empty strings at end of file
    for n_lines in izip_longest(*[f]*n, fillvalue=b''):

    # Or to combine into a single string:
    for n_lines in map(b''.join, izip_longest(*[f]*n, fillvalue=b'')):

Solution 2:

You can actually just iterate over lines in a file (see file.next docs - this also works on Python 3) like

with open(filename) as f:
    for line in f:
        something(line)

so your code can be rewritten to

n=5 # your batch size
with open(filename) as f:
    batch=[]
    for line in f:
        batch.append(line)
        if len(batch)==n:
            process(batch)
            batch=[]
process(batch) # this batch might be smaller or even empty

but normally just processing line-by-line is more convenient (first example)

If you dont care about how many lines are read exactly for each batch but just that it is not too much memory then use file.readlines with sizehint like

size_hint=2<<24 # 16MB
with open(filename) as f:
    while f: # not sure if this check works
        process(f.readlines(size_hint))

Post a Comment for "Read Multiple Lines From A File Batch By Batch"