Skip to content

Commit 5dc0418

Browse files
committed
Prefetch data referenced by the WAL, take II.
Introduce a new GUC recovery_prefetch. When enabled, look ahead in the WAL and try to initiate asynchronous reading of referenced data blocks that are not yet cached in our buffer pool. For now, this is done with posix_fadvise(), which has several caveats. Since not all OSes have that system call, "try" is provided so that it can be enabled where available. Better mechanisms for asynchronous I/O are possible in later work. Set to "try" for now for test coverage. Default setting to be finalized before release. The GUC wal_decode_buffer_size limits the distance we can look ahead in bytes of decoded data. The existing GUC maintenance_io_concurrency is used to limit the number of concurrent I/Os allowed, based on pessimistic heuristics used to infer that I/Os have begun and completed. We'll also not look more than maintenance_io_concurrency * 4 block references ahead. Reviewed-by: Julien Rouhaud <rjuju123@gmail.com> Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com> Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com> (earlier version) Reviewed-by: Andres Freund <andres@anarazel.de> (earlier version) Reviewed-by: Justin Pryzby <pryzby@telsasoft.com> (earlier version) Tested-by: Tomas Vondra <tomas.vondra@2ndquadrant.com> (earlier version) Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com> (earlier version) Tested-by: Dmitry Dolgov <9erthalion6@gmail.com> (earlier version) Tested-by: Sait Talha Nisanci <Sait.Nisanci@microsoft.com> (earlier version) Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
1 parent 9553b41 commit 5dc0418

File tree

27 files changed

+1595
-77
lines changed

27 files changed

+1595
-77
lines changed

doc/src/sgml/config.sgml

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3657,6 +3657,70 @@ include_dir 'conf.d'
36573657
</variablelist>
36583658
</sect2>
36593659

3660+
<sect2 id="runtime-config-wal-recovery">
3661+
3662+
<title>Recovery</title>
3663+
3664+
<indexterm>
3665+
<primary>configuration</primary>
3666+
<secondary>of recovery</secondary>
3667+
<tertiary>general settings</tertiary>
3668+
</indexterm>
3669+
3670+
<para>
3671+
This section describes the settings that apply to recovery in general,
3672+
affecting crash recovery, streaming replication and archive-based
3673+
replication.
3674+
</para>
3675+
3676+
3677+
<variablelist>
3678+
<varlistentry id="guc-recovery-prefetch" xreflabel="recovery_prefetch">
3679+
<term><varname>recovery_prefetch</varname> (<type>enum</type>)
3680+
<indexterm>
3681+
<primary><varname>recovery_prefetch</varname> configuration parameter</primary>
3682+
</indexterm>
3683+
</term>
3684+
<listitem>
3685+
<para>
3686+
Whether to try to prefetch blocks that are referenced in the WAL that
3687+
are not yet in the buffer pool, during recovery. Valid values are
3688+
<literal>off</literal> (the default), <literal>on</literal> and
3689+
<literal>try</literal>. The setting <literal>try</literal> enables
3690+
prefetching only if the operating system provides the
3691+
<function>posix_fadvise</function> function, which is currently used
3692+
to implement prefetching. Note that some operating systems provide the
3693+
function, but it doesn't do anything.
3694+
</para>
3695+
<para>
3696+
Prefetching blocks that will soon be needed can reduce I/O wait times
3697+
during recovery with some workloads.
3698+
See also the <xref linkend="guc-wal-decode-buffer-size"/> and
3699+
<xref linkend="guc-maintenance-io-concurrency"/> settings, which limit
3700+
prefetching activity.
3701+
</para>
3702+
</listitem>
3703+
</varlistentry>
3704+
3705+
<varlistentry id="guc-wal-decode-buffer-size" xreflabel="wal_decode_buffer_size">
3706+
<term><varname>wal_decode_buffer_size</varname> (<type>integer</type>)
3707+
<indexterm>
3708+
<primary><varname>wal_decode_buffer_size</varname> configuration parameter</primary>
3709+
</indexterm>
3710+
</term>
3711+
<listitem>
3712+
<para>
3713+
A limit on how far ahead the server can look in the WAL, to find
3714+
blocks to prefetch. If this value is specified without units, it is
3715+
taken as bytes.
3716+
The default is 512kB.
3717+
</para>
3718+
</listitem>
3719+
</varlistentry>
3720+
3721+
</variablelist>
3722+
</sect2>
3723+
36603724
<sect2 id="runtime-config-wal-archive-recovery">
36613725

36623726
<title>Archive Recovery</title>

doc/src/sgml/monitoring.sgml

Lines changed: 84 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -328,6 +328,13 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
328328
</entry>
329329
</row>
330330

331+
<row>
332+
<entry><structname>pg_stat_recovery_prefetch</structname><indexterm><primary>pg_stat_recovery_prefetch</primary></indexterm></entry>
333+
<entry>Only one row, showing statistics about blocks prefetched during recovery.
334+
See <xref linkend="pg-stat-recovery-prefetch-view"/> for details.
335+
</entry>
336+
</row>
337+
331338
<row>
332339
<entry><structname>pg_stat_subscription</structname><indexterm><primary>pg_stat_subscription</primary></indexterm></entry>
333340
<entry>At least one row per subscription, showing information about
@@ -2979,6 +2986,78 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
29792986
copy of the subscribed tables.
29802987
</para>
29812988

2989+
<table id="pg-stat-recovery-prefetch-view" xreflabel="pg_stat_recovery_prefetch">
2990+
<title><structname>pg_stat_recovery_prefetch</structname> View</title>
2991+
<tgroup cols="3">
2992+
<thead>
2993+
<row>
2994+
<entry>Column</entry>
2995+
<entry>Type</entry>
2996+
<entry>Description</entry>
2997+
</row>
2998+
</thead>
2999+
3000+
<tbody>
3001+
<row>
3002+
<entry><structfield>prefetch</structfield></entry>
3003+
<entry><type>bigint</type></entry>
3004+
<entry>Number of blocks prefetched because they were not in the buffer pool</entry>
3005+
</row>
3006+
<row>
3007+
<entry><structfield>hit</structfield></entry>
3008+
<entry><type>bigint</type></entry>
3009+
<entry>Number of blocks not prefetched because they were already in the buffer pool</entry>
3010+
</row>
3011+
<row>
3012+
<entry><structfield>skip_init</structfield></entry>
3013+
<entry><type>bigint</type></entry>
3014+
<entry>Number of blocks not prefetched because they would be zero-initialized</entry>
3015+
</row>
3016+
<row>
3017+
<entry><structfield>skip_new</structfield></entry>
3018+
<entry><type>bigint</type></entry>
3019+
<entry>Number of blocks not prefetched because they didn't exist yet</entry>
3020+
</row>
3021+
<row>
3022+
<entry><structfield>skip_fpw</structfield></entry>
3023+
<entry><type>bigint</type></entry>
3024+
<entry>Number of blocks not prefetched because a full page image was included in the WAL</entry>
3025+
</row>
3026+
<row>
3027+
<entry><structfield>skip_rep</structfield></entry>
3028+
<entry><type>bigint</type></entry>
3029+
<entry>Number of blocks not prefetched because they were already recently prefetched</entry>
3030+
</row>
3031+
<row>
3032+
<entry><structfield>wal_distance</structfield></entry>
3033+
<entry><type>integer</type></entry>
3034+
<entry>How many bytes ahead the prefetcher is looking</entry>
3035+
</row>
3036+
<row>
3037+
<entry><structfield>block_distance</structfield></entry>
3038+
<entry><type>integer</type></entry>
3039+
<entry>How many blocks ahead the prefetcher is looking</entry>
3040+
</row>
3041+
<row>
3042+
<entry><structfield>io_depth</structfield></entry>
3043+
<entry><type>integer</type></entry>
3044+
<entry>How many prefetches have been initiated but are not yet known to have completed</entry>
3045+
</row>
3046+
</tbody>
3047+
</tgroup>
3048+
</table>
3049+
3050+
<para>
3051+
The <structname>pg_stat_recovery_prefetch</structname> view will contain
3052+
only one row. It is filled with nulls if recovery has not run or
3053+
<xref linkend="guc-recovery-prefetch"/> is not enabled. The
3054+
columns <structfield>wal_distance</structfield>,
3055+
<structfield>block_distance</structfield>
3056+
and <structfield>io_depth</structfield> show current values, and the
3057+
other columns show cumulative counters that can be reset
3058+
with the <function>pg_stat_reset_shared</function> function.
3059+
</para>
3060+
29823061
<table id="pg-stat-subscription" xreflabel="pg_stat_subscription">
29833062
<title><structname>pg_stat_subscription</structname> View</title>
29843063
<tgroup cols="1">
@@ -5199,8 +5278,11 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
51995278
all the counters shown in
52005279
the <structname>pg_stat_bgwriter</structname>
52015280
view, <literal>archiver</literal> to reset all the counters shown in
5202-
the <structname>pg_stat_archiver</structname> view or <literal>wal</literal>
5203-
to reset all the counters shown in the <structname>pg_stat_wal</structname> view.
5281+
the <structname>pg_stat_archiver</structname> view,
5282+
<literal>wal</literal> to reset all the counters shown in the
5283+
<structname>pg_stat_wal</structname> view or
5284+
<literal>recovery_prefetch</literal> to reset all the counters shown
5285+
in the <structname>pg_stat_recovery_prefetch</structname> view.
52045286
</para>
52055287
<para>
52065288
This function is restricted to superusers by default, but other users

doc/src/sgml/wal.sgml

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -803,6 +803,18 @@
803803
counted as <literal>wal_write</literal> and <literal>wal_sync</literal>
804804
in <structname>pg_stat_wal</structname>, respectively.
805805
</para>
806+
807+
<para>
808+
The <xref linkend="guc-recovery-prefetch"/> parameter can be used to reduce
809+
I/O wait times during recovery by instructing the kernel to initiate reads
810+
of disk blocks that will soon be needed but are not currently in
811+
<productname>PostgreSQL</productname>'s buffer pool.
812+
The <xref linkend="guc-maintenance-io-concurrency"/> and
813+
<xref linkend="guc-wal-decode-buffer-size"/> settings limit prefetching
814+
concurrency and distance, respectively. By default, it is set to
815+
<literal>try</literal>, which enabled the feature on systems where
816+
<function>posix_fadvise</function> is available.
817+
</para>
806818
</sect1>
807819

808820
<sect1 id="wal-internals">

src/backend/access/transam/Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ OBJS = \
3131
xlogarchive.o \
3232
xlogfuncs.o \
3333
xloginsert.o \
34+
xlogprefetcher.o \
3435
xlogreader.o \
3536
xlogrecovery.o \
3637
xlogutils.o

src/backend/access/transam/xlog.c

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,7 @@
5959
#include "access/xlog_internal.h"
6060
#include "access/xlogarchive.h"
6161
#include "access/xloginsert.h"
62+
#include "access/xlogprefetcher.h"
6263
#include "access/xlogreader.h"
6364
#include "access/xlogrecovery.h"
6465
#include "access/xlogutils.h"
@@ -133,6 +134,7 @@ int CommitDelay = 0; /* precommit delay in microseconds */
133134
int CommitSiblings = 5; /* # concurrent xacts needed to sleep */
134135
int wal_retrieve_retry_interval = 5000;
135136
int max_slot_wal_keep_size_mb = -1;
137+
int wal_decode_buffer_size = 512 * 1024;
136138
bool track_wal_io_timing = false;
137139

138140
#ifdef WAL_DEBUG

0 commit comments

Comments
 (0)