Python Postgres Can I Fetchall() 1 Million Rows?

March 31, 2024 Post a Comment

I am using psycopg2 module in python to read from postgres database, I need to some operation on all rows in a column, that has more than 1 million rows. I would like to know would

Solution 1:

The solution Burhan pointed out reduces the memory usage for large datasets by only fetching single rows:

row = cursor.fetchone()

However, I noticed a significant slowdown in fetching rows one-by-one. I access an external database over an internet connection, that might be a reason for it.

Having a server side cursor and fetching bunches of rows proved to be the most performant solution. You can change the sql statements (as in alecxe answers) but there is also pure python approach using the feature provided by psycopg2:

cursor = conn.cursor('name_of_the_new_server_side_cursor')
cursor.execute(""" SELECT * FROM table LIMIT 1000000 """)

whileTrue:
    rows = cursor.fetchmany(5000)
    ifnot rows:
        breakfor row in rows:
        # do something with rowpass

you find more about server side cursors in the psycopg2 wiki

Solution 2:

Consider using server side cursor:

When a database query is executed, the Psycopg cursor usually fetches all the records returned by the backend, transferring them to the client process. If the query returned an huge amount of data, a proportionally large amount of memory will be allocated by the client.
If the dataset is too large to be practically handled on the client side, it is possible to create a server side cursor. Using this kind of cursor it is possible to transfer to the client only a controlled amount of data, so that a large dataset can be examined without keeping it entirely in memory.

Here's an example:

cursor.execute("DECLARE super_cursor BINARY CURSOR FOR SELECT names FROM myTable")
while True:
    cursor.execute("FETCH 1000 FROM super_cursor")
    rows= cursor.fetchall()

    if notrows:
        break

    forrowinrows:
        doSomething(row)

Solution 3:

fetchall() fetches up to the arraysize limit, so to prevent a massive hit on your database you can either fetch rows in manageable batches, or simply step through the cursor till its exhausted:

row= cur.fetchone()
while row:
   # do something withrowrow= cur.fetchone()

Solution 4:

Here is the code to use for simple server side cursor with the speed of fetchmany management.

The principle is to use named cursor in Psycopg2 and give it a good itersize to load many rows at once like fetchmany would do but with a single loop of for rec in cursor that does an implicit fetchnone().

With this code I make queries of 150 millions rows from multi-billion rows table within 1 hour and 200 meg ram.

Solution 5:

EDIT: using fetchmany (along with fetchone() and fetchall(), even with a row limit (arraysize) will still send the entire resultset, keeping it client-side (stored in the underlying c library, I think libpq) for any additional fetchmany() calls, etc. Without using a named cursor (which would require an open transaction), you have to resort to using limit in the sql with an order-by, then analyzing the results and augmenting the next query with where (ordered_val = %(last_seen_val)s and primary_key > %(last_seen_pk)s OR ordered_val > %(last_seen_val)s)

This is misleading for the library to say the least, and there should be a blurb in the documentation about this. I don't know why it's not there.

Not sure a named cursor is a good fit without having a need to scroll forward/backward interactively? I could be wrong here.

The fetchmany loop is tedious but I think it's the best solution here. To make life easier, you can use the following:

from functools import partial
from itertools import chain

# from_iterable added >= python 2.7
from_iterable = chain.from_iterable

# util functiondefrun_and_iterate(curs, sql, parms=None, chunksize=1000):
    if parms isNone:
        curs.execute(sql)
    else:
        curs.execute(sql, parms)
    chunks_until_empty = iter(partial(fetchmany, chunksize), [])
    return from_iterable(chunks_until_empty)

# example scenariofor row in run_and_iterate(cur, 'select * from waffles_table where num_waffles > %s', (10,)):
    print'lots of waffles: %s' % (row,)

Python Courses, Training, and Tutorials