Batch Processing in SQLite: How to update [efficiently] a table with millions of records

sample run, update in batches from last id

Tl;Dr — if you want to bulk update database table, you can do it in bulk iterations. It would take more time, but will have shorter locks on the database. It’s very important in Sqlite3, while other database engines have some mechanisms to overcome the issue (eg MVCC). In addition, in order to prevent scanning of the table in each iteration, you should save last id of the updated field.

For full code see: https://gist.github.com/yuval1024/8b63174779861bb5f3dc1baf32994ae1

Let’s say you have an SQL database, and you would like to update field in bulk, eg reset it to null. What happens if you have 1000s of records? 100s of 1000s? Millions of records?

You can get the DB locked.. Some notes:

1. THIS MAINLY SPEAKS ABOUT SQLITE3; in Postgresql and other advanced DBs, we have some kind of versioning system, and thus we don’t really lock the entire table, but only some records. Eg — MVCC in Postgres. In Sqlite3, the locking issue is more present. 2. Between each run, we update random rankings. 3. Between each run, we * must purge page cache * , in order to prevent consecutives runs be faster due to cache.

Creating and populating 20M records

Let’s create and populate table:

CREATE TABLE IF NOT EXISTS domains (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    domain_name TEXT NOT NULL,
    domain_ranking INTEGER


def populate_data(self, num_records: int = 10_000_000):
    with sqlite3.connect(self.db_path) as conn:
        batch_size = 10000
        for i in range(0, num_records, batch_size):
            batch = [
                (f"domain_{j}", random.randint(1, 1000000))
                for j in range(i, min(i + batch_size, num_records))
            ]
            conn.executemany(
                "INSERT INTO domains (domain_name, domain_ranking) VALUES (?, ?)",
                batch
            )

Anti-pattern: update row by row

We update row by row; Note we didn’t go full-anti pattern; We still commit every 1000 records and not after each recod; committing after each record would be the worst possible option to do.

def reset_row_by_row(self):
    with sqlite3.connect(self.db_path) as conn:
        cursor = conn.execute("SELECT id FROM domains WHERE domain_ranking IS NOT NULL")
        for (id_,) in cursor:
            conn.execute(
                "UPDATE domains SET domain_ranking = NULL WHERE id = ?",
                (id_,)
            )
            if total_reset % 1000 == 0:
                conn.commit()

First method: using 1 SQL statement

This approach is simple and also the quickest; however, it locks the table during execution;

def reset_all(self):
    with sqlite3.connect(self.db_path) as conn:
        conn.execute("UPDATE domains SET domain_ranking = NULL")

Second method: Using loop + batch, eg reset in chunks

This approach looks better; while True, each iteration will reset 10k records; This will still lock the table, but for a shorter period of time. Total time for the whole table will be longer, but each lock time will be much shorter. However, each iteration needs to scan the table to find first relevant record.

def reset_in_chunks(self, chunk_size: int = 10_000):
    with sqlite3.connect(self.db_path) as conn:
        while True:
            cursor = conn.execute(f"""
                UPDATE domains
                SET domain_ranking = NULL
                WHERE id IN (
                    SELECT id FROM domains
                    WHERE domain_ranking IS NOT NULL
                    LIMIT {chunk_size}
                )
            """)

            if cursor.rowcount == 0:
                break

            conn.commit()

Improvement: Save last id and start from this id

In the last example, we would still have to scan the table during each iteration; Yet, we may have many pages in the cache; and we might as well have an index for the domain_ranking field; However, in cases where we don’t have index, we may save the index of the last updated record; and start from this index, in order to prevent scanning.

def reset_with_tracking(self, chunk_size: int = 10000):
    last_id = 0
    with sqlite3.connect(self.db_path) as conn:
        while True:
            cursor = conn.execute(f"""
                UPDATE domains
                SET domain_ranking = NULL
                WHERE id IN (
                    SELECT id FROM domains
                    WHERE id > {last_id}
                    AND domain_ranking IS NOT NULL
                    ORDER BY id
                    LIMIT {chunk_size}
                )
                RETURNING id
            """)

            updated_ids = cursor.fetchall()
            if not updated_ids:
                break

            last_id = max(id[0] for id in updated_ids)
            conn.commit()

For conclusion, when working with DBs, and performing heavy-duty tasks, always have in mind the locking mechanism and your DB access patterns.