PostgreSQL/Wraparound and Freeze
Two Fundamental Problems
[edit | edit source]Transactions are identified by ids which are realized as unsigned 32-bit integers and named XID. Transactions and their XIDs are known at the cluster level, covering all databases. Similar to sequences, XIDs are incremented by for every new transaction. Sooner or later, this limited space of numbers is exhausted, and it becomes necessary to restart the sequence from the beginning (the values 0, 1, and 2 are skipped because they are reserved for particular purposes). This restart of XIDs is called a wraparound and each cycle an epoch.
It's unlikely that more than transactions exist in a cluster at the same time or that a single transaction lasts for such a long time that its XID collides with the same value of the next cycle. At first glance, this cyclic usage of the ' Universe' seems to be safe and easy to implement. Nevertheless, huge problems will arise with this simple strategy. The reason is that XIDs are stored in system columns within every row (see xmin, xmax
in the MVCC chapter). And rows stay for a very long time in the database, in many cases forever.
XID Collision
[edit | edit source]The first problem is that after a wraparound, the next XIDs (3, 4, 5, ...) may collide with XIDs of the previous epoch. They are no longer unique because system columns of old rows may contain the same values. But transactions must be able to decide whether retrieved rows are modified (by other transactions) after their own start-time or a long time ago. We call this first problem XID Collision.
Sudden Death
[edit | edit source]The second problem correlates with MVCC and the timeline of transactions. Rows may exist in multiple versions. When a transaction modifies a row and stays alive a little longer - no COMMIT or ROLLBACK because of more activities -, other processes shall 'see' the version of the row as of the start-time of the transaction and not the uncommited modification. Hence, a mechanism must hide the ongoing changes and give other transactions the feeling of a stable data environment. The system realizes this by considering additional criteria with every SQL command, especially - but not only - the system column xmin
.
You can imagine of those additional criteria that the system silently supplements every query with the predicate xmin < my_xid
. (This is only an illustration in pseudo-code, the real implementation is different.) It guarantees that changes happening after the start of the requesting transaction are invisible to it.
So far, so good.
But what will happen after a wraparound? The next transactions will have very small XIDs, e.g., '5'. And what will be the result of an xmin < 5
? Near nothing. All rows with xmin
between and are no longer part of any query. In contrast to the situation a few moments ago, all data suddenly disappeared. It is still in the database, but it is unreachable. We call this second problem Sudden Death.
Solution
[edit | edit source]Step 1: At the conceptual level, the full ' Universe' is divided into two halves of numbers. One split point is the current transaction id pg_current_xact_id
(it was called txid_current
before PostgreSQL version 13) and the other is the opposite side of the circle pg_current_xact_id +
(or pg_current_xact_id -
what is the same). So, the split points are not fixed values but follow dynamically the ongoing of new transactions. One halve represents the previously used and therefore exhausted XIDs; the other halve such XIDs, which are - per definition - free. They will be allocated in the future. Please note the dynamic aspect: with every new transaction in the cluster pg_current_xact_id
and the border between 'past / future' moves forward. This is a metaphor of an endless walk through time where after some time the old problems will be forgotten or at least idealized.
The idea can be realized by a modification of the above xmin < my_xid
predicate to an if/else
block:
if (my_xid < ) return rows with: xmin < my_xid OR xmin > my_xid + else return rows with: xmin < my_xid AND xmin > my_xid +
Of course, this is a simplification and many other criteria like COMMIT status, 'is deleted', and other things must be considered. It focuses purely on the aspect of the 'past / future' metaphor.
Note: With this algorithm the 'critical point' changes from resp.
to
pg_current_xact_id +
. It is called the wraparound point and the line between pg_current_xact_id +
and pg_current_xact_id
the wraparound horizon.
Step 2: The outlined algorithm ensures the visibility of 50% of all possible XIDs. But what's going on with the others? As mentioned, rows may stay in the database forever holding very old XIDs in xmin
. This halve must also be considered. The idea of how to access the complete range of possible XIDs is to complement the previous algorithm with the introduction of a flag that marks certain rows as 'visible forever' (respectively visible until the next write operation to them). This marking is not possible as long as one or more transactions potentially get write access to them. Fortunately, the sequence of new XIDs goes strictly forward, and overtime transactions with old XIDs finish. PostgreSQL does not only know, which is the current XID pg_current_xact_id
, but also which is the oldest active XID per connection (pg_stat_activity.backend_xmin), and which is the lower bound of all unfrozen XIDs per table (pg_class.relfrozenxid) and per database (pg_database.datfrozenxid). Rows with xmin
older than oldest(pg_stat_activity.backend_xmin)
are candidates for such a flag. No running transaction has or will get write access to them, only newer ones. According to MVCC, they will create a new row version, this one keeps as it is.
It is one of the two main duties of VACUUM to perform this freezing. It marks the identified rows with a flag in their header t_infomask
as 'visible forever'. From this point on no comparisons with xmin
take place. The rows are always treated as visible, even if they are part of the 'future'. This marking is called FREEZE and the status of the row FROZEN.
Now the algorithm for retrieving rows changes to:
-- 'frozen' rows will always be returned if (my_xid < ) return rows with: frozen OR (xmin < my_xid OR xmin > my_xid + ) else return rows with: frozen OR (xmin < my_xid AND xmin > my_xid + )
With this extension, the two mentioned problems are solved. The system can generate XIDs even after a wraparound without risk of collision with old XIDs. The old ones may exist, but they are not touched in any way. Second, the algorithm finds all relevant XIDs whether there was a wraparound or not.
Wraparound Failure
[edit | edit source]It is possible that a transaction - intentionally or by an error in an application - stays alive for a long time. Over time its XID becomes the oldest one in the complete cluster and can be retrieved from pg_stat_activity.backend_xmin. As long as this situation continues, the gap between the ongoing wrapping point pg_current_xact_id +
and pg_stat_activity.backend_xmin gets smaller and smaller. If the gap would close completely, we would see all the problems described at the beginning of the chapter. This is called wraparound failure and must be avoided under all circumstances. VACUUM is doing its best to freeze as many rows as possible. But if a long-living transaction prevents freezing and the size of the gap falls below a certain limit, VACUUM runs in an 'aggressive mode' and works on all pages of affected tables, independent from the above-mentioned values; if also this fails, the cluster stops the creation of new transactions and prevents further write actions.
Details
[edit | edit source]Note: These details can be skipped by a novice reader without losing the context of the ongoing chapters.
To freeze any row version, VACUUM must check several criteria:
xmax
must be zero because only non-deleted rows can be visible forever.xmin
must be older than all currently existing transactionsoldest(pg_stat_activity.backend_xmin)
. This guarantees that no existing transaction can modify or delete the version.- The transactions of
xmin
andxmax
must be committed.
At what point in time does the freeze operation take place? Please note that there are configuration parameters with names like xxx_age
. They define distances - mostly to pg_current_xact_id
-, where the actions of VACUUM shall start. 'age' in this context doesn't imply a certain period of time, it's always a pure number that counts transactions, e.g., 50 million. Please also note, that VACUUM always reads complete physical pages and works on the row versions found there.
- When a client issues the SQL command VACUUM with its FREEZE option. In this case, all pages of the affected tables are processed that are marked in the Visibility Map as potentially having unfrozen rows.
- When a client issues the SQL command VACUUM without any option and there are XIDs older than vacuum_freeze_table_age (default: 150 million) minus vacuum_freeze_min_age (default: 50 million). As before, all pages are processed that are marked in the Visibility Map to potentially have unfrozen rows.
- When an Autovacuum process runs. Such a process acts in one of two modes: In the normal mode it skips pages with row versions that are younger than vacuum_freeze_min_age (default: 50 million) and works only on pages where all XIDs are older. The skipping of young XIDs prevents work on such pages, which are likely to be changed by one of the future SQL commands. The process switches to an aggressive mode if it recognizes that for the processed table the oldest XID exceeds vacuum_freeze_table_age (default: 150 million). In this aggressive mode, Autovacuum processes all pages of the affected table.
VACUUM and Autovacuum know to which value the oldest unfrozen XID has moved forward per table and logs the value in pg_class.relfrozenxid. The distance between this value and the pg_current_xact_id
split point becomes smaller (there are potentially unfrozen rows), and the distance to the wraparound point pg_current_xact_id +
becomes larger (there are only frozen rows). That is how the freezing follows the moving 'past' / 'future' horizon.
Note: Before version 9.4 of PostgreSQL, the freeze algorithm stored the value '2' (FrozenTransactionId) in xmin
instead of setting a flag in t_infomask
.