Skip to content

Commit 0b096e3

Browse files
committed
Change pg_test_timing to measure in nanoseconds not microseconds.
Most of our platforms have better-than-microsecond timing resolution, so the original definition of this program is getting less and less useful. Make it report nanoseconds not microseconds. Also, add a second output table that reports the exact observed timing durations, up to a limit of 1024 ns; and be sure to report the largest observed duration. The documentation for this program included a lot of system-specific details that now seem largely obsolete. Move all that text to the PG wiki, where perhaps it will be easier to maintain and update. Also, improve the TAP test so that it actually runs a short standard run, allowing most of the code to be exercised; its coverage before was abysmal. Author: Hannu Krosing <hannuk@google.com> Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/be0339cc-1ae1-4892-9445-8e6d8995a44d@eisentraut.org
1 parent a27893d commit 0b096e3

File tree

3 files changed

+251
-217
lines changed

3 files changed

+251
-217
lines changed

doc/src/sgml/ref/pgtesttiming.sgml

Lines changed: 93 additions & 186 deletions
Original file line numberDiff line numberDiff line change
@@ -30,11 +30,23 @@ PostgreSQL documentation
3030
<title>Description</title>
3131

3232
<para>
33-
<application>pg_test_timing</application> is a tool to measure the timing overhead
34-
on your system and confirm that the system time never moves backwards.
33+
<application>pg_test_timing</application> is a tool to measure the
34+
timing overhead on your system and confirm that the system time never
35+
moves backwards. It simply reads the system clock over and over again
36+
as fast as it can for a specified length of time, and then prints
37+
statistics about the observed differences in successive clock readings.
38+
</para>
39+
<para>
40+
Smaller (but not zero) differences are better, since they imply both
41+
more-precise clock hardware and less overhead to collect a clock reading.
3542
Systems that are slow to collect timing data can give less accurate
3643
<command>EXPLAIN ANALYZE</command> results.
3744
</para>
45+
<para>
46+
This tool is also helpful to determine if
47+
the <varname>track_io_timing</varname> configuration parameter is likely
48+
to produce useful results.
49+
</para>
3850
</refsect1>
3951

4052
<refsect1>
@@ -59,6 +71,21 @@ PostgreSQL documentation
5971
</listitem>
6072
</varlistentry>
6173

74+
<varlistentry>
75+
<term><option>-c <replaceable class="parameter">cutoff</replaceable></option></term>
76+
<term><option>--cutoff=<replaceable class="parameter">cutoff</replaceable></option></term>
77+
<listitem>
78+
<para>
79+
Specifies the cutoff percentage for the list of exact observed
80+
timing durations (that is, the changes in the system clock value
81+
from one reading to the next). The list will end once the running
82+
percentage total reaches or exceeds this value, except that the
83+
largest observed duration will always be printed. The default
84+
cutoff is 99.99.
85+
</para>
86+
</listitem>
87+
</varlistentry>
88+
6289
<varlistentry>
6390
<term><option>-V</option></term>
6491
<term><option>--version</option></term>
@@ -92,212 +119,92 @@ PostgreSQL documentation
92119
<title>Interpreting Results</title>
93120

94121
<para>
95-
Good results will show most (>90%) individual timing calls take less than
96-
one microsecond. Average per loop overhead will be even lower, below 100
97-
nanoseconds. This example from an Intel i7-860 system using a TSC clock
98-
source shows excellent performance:
99-
100-
<screen><![CDATA[
101-
Testing timing overhead for 3 seconds.
102-
Per loop time including overhead: 35.96 ns
103-
Histogram of timing durations:
104-
< us % of total count
105-
1 96.40465 80435604
106-
2 3.59518 2999652
107-
4 0.00015 126
108-
8 0.00002 13
109-
16 0.00000 2
110-
]]></screen>
122+
The first block of output has four columns, with rows showing a
123+
shifted-by-one log2(ns) histogram of timing durations (that is, the
124+
differences between successive clock readings). This is not the
125+
classic log2(n+1) histogram as it counts zeros separately and then
126+
switches to log2(ns) starting from value 1.
111127
</para>
112-
113128
<para>
114-
Note that different units are used for the per loop time than the
115-
histogram. The loop can have resolution within a few nanoseconds (ns),
116-
while the individual timing calls can only resolve down to one microsecond
117-
(us).
129+
The columns are:
130+
<itemizedlist spacing="compact">
131+
<listitem>
132+
<simpara>nanosecond value that is &gt;= the durations in this
133+
bucket</simpara>
134+
</listitem>
135+
<listitem>
136+
<simpara>percentage of durations in this bucket</simpara>
137+
</listitem>
138+
<listitem>
139+
<simpara>running-sum percentage of durations in this and previous
140+
buckets</simpara>
141+
</listitem>
142+
<listitem>
143+
<simpara>count of durations in this bucket</simpara>
144+
</listitem>
145+
</itemizedlist>
118146
</para>
119-
120-
</refsect2>
121-
<refsect2>
122-
<title>Measuring Executor Timing Overhead</title>
123-
124147
<para>
125-
When the query executor is running a statement using
126-
<command>EXPLAIN ANALYZE</command>, individual operations are timed as well
127-
as showing a summary. The overhead of your system can be checked by
128-
counting rows with the <application>psql</application> program:
129-
130-
<screen>
131-
CREATE TABLE t AS SELECT * FROM generate_series(1,100000);
132-
\timing
133-
SELECT COUNT(*) FROM t;
134-
EXPLAIN ANALYZE SELECT COUNT(*) FROM t;
135-
</screen>
148+
The second block of output goes into more detail, showing the exact
149+
timing differences observed. For brevity this list is cut off when the
150+
running-sum percentage exceeds the user-selectable cutoff value.
151+
However, the largest observed difference is always shown.
136152
</para>
137-
138153
<para>
139-
The i7-860 system measured runs the count query in 9.8 ms while
140-
the <command>EXPLAIN ANALYZE</command> version takes 16.6 ms, each
141-
processing just over 100,000 rows. That 6.8 ms difference means the timing
142-
overhead per row is 68 ns, about twice what pg_test_timing estimated it
143-
would be. Even that relatively small amount of overhead is making the fully
144-
timed count statement take almost 70% longer. On more substantial queries,
145-
the timing overhead would be less problematic.
154+
The example results below show that 99.99% of timing loops took between
155+
8 and 31 nanoseconds, with the worst case somewhere between 32768 and
156+
65535 nanoseconds. In the second block, we can see that typical loop
157+
time is 16 nanoseconds, and the readings appear to have full nanosecond
158+
precision.
146159
</para>
147160

148-
</refsect2>
149-
150-
<refsect2>
151-
<title>Changing Time Sources</title>
152161
<para>
153-
On some newer Linux systems, it's possible to change the clock source used
154-
to collect timing data at any time. A second example shows the slowdown
155-
possible from switching to the slower acpi_pm time source, on the same
156-
system used for the fast results above:
157-
158162
<screen><![CDATA[
159-
# cat /sys/devices/system/clocksource/clocksource0/available_clocksource
160-
tsc hpet acpi_pm
161-
# echo acpi_pm > /sys/devices/system/clocksource/clocksource0/current_clocksource
162-
# pg_test_timing
163-
Per loop time including overhead: 722.92 ns
163+
Testing timing overhead for 3 seconds.
164+
Per loop time including overhead: 16.40 ns
164165
Histogram of timing durations:
165-
< us % of total count
166-
1 27.84870 1155682
167-
2 72.05956 2990371
168-
4 0.07810 3241
169-
8 0.01357 563
170-
16 0.00007 3
166+
<= ns % of total running % count
167+
0 0.0000 0.0000 0
168+
1 0.0000 0.0000 0
169+
3 0.0000 0.0000 0
170+
7 0.0000 0.0000 0
171+
15 4.5452 4.5452 8313178
172+
31 95.4527 99.9979 174581501
173+
63 0.0001 99.9981 253
174+
127 0.0001 99.9982 165
175+
255 0.0000 99.9982 35
176+
511 0.0000 99.9982 1
177+
1023 0.0013 99.9994 2300
178+
2047 0.0004 99.9998 690
179+
4095 0.0000 99.9998 9
180+
8191 0.0000 99.9998 8
181+
16383 0.0002 100.0000 337
182+
32767 0.0000 100.0000 2
183+
65535 0.0000 100.0000 1
184+
185+
Observed timing durations up to 99.9900%:
186+
ns % of total running % count
187+
15 4.5452 4.5452 8313178
188+
16 58.3785 62.9237 106773354
189+
17 33.6840 96.6078 61607584
190+
18 3.1151 99.7229 5697480
191+
19 0.2638 99.9867 482570
192+
20 0.0093 99.9960 17054
193+
...
194+
38051 0.0000 100.0000 1
171195
]]></screen>
172196
</para>
173197

174-
<para>
175-
In this configuration, the sample <command>EXPLAIN ANALYZE</command> above
176-
takes 115.9 ms. That's 1061 ns of timing overhead, again a small multiple
177-
of what's measured directly by this utility. That much timing overhead
178-
means the actual query itself is only taking a tiny fraction of the
179-
accounted for time, most of it is being consumed in overhead instead. In
180-
this configuration, any <command>EXPLAIN ANALYZE</command> totals involving
181-
many timed operations would be inflated significantly by timing overhead.
182-
</para>
183-
184-
<para>
185-
FreeBSD also allows changing the time source on the fly, and it logs
186-
information about the timer selected during boot:
187-
188-
<screen>
189-
# dmesg | grep "Timecounter"
190-
Timecounter "ACPI-fast" frequency 3579545 Hz quality 900
191-
Timecounter "i8254" frequency 1193182 Hz quality 0
192-
Timecounters tick every 10.000 msec
193-
Timecounter "TSC" frequency 2531787134 Hz quality 800
194-
# sysctl kern.timecounter.hardware=TSC
195-
kern.timecounter.hardware: ACPI-fast -> TSC
196-
</screen>
197-
</para>
198-
199-
<para>
200-
Other systems may only allow setting the time source on boot. On older
201-
Linux systems the "clock" kernel setting is the only way to make this sort
202-
of change. And even on some more recent ones, the only option you'll see
203-
for a clock source is "jiffies". Jiffies are the older Linux software clock
204-
implementation, which can have good resolution when it's backed by fast
205-
enough timing hardware, as in this example:
206-
207-
<screen><![CDATA[
208-
$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
209-
jiffies
210-
$ dmesg | grep time.c
211-
time.c: Using 3.579545 MHz WALL PM GTOD PIT/TSC timer.
212-
time.c: Detected 2400.153 MHz processor.
213-
$ pg_test_timing
214-
Testing timing overhead for 3 seconds.
215-
Per timing duration including loop overhead: 97.75 ns
216-
Histogram of timing durations:
217-
< us % of total count
218-
1 90.23734 27694571
219-
2 9.75277 2993204
220-
4 0.00981 3010
221-
8 0.00007 22
222-
16 0.00000 1
223-
32 0.00000 1
224-
]]></screen></para>
225-
226198
</refsect2>
227-
228-
<refsect2>
229-
<title>Clock Hardware and Timing Accuracy</title>
230-
231-
<para>
232-
Collecting accurate timing information is normally done on computers using
233-
hardware clocks with various levels of accuracy. With some hardware the
234-
operating systems can pass the system clock time almost directly to
235-
programs. A system clock can also be derived from a chip that simply
236-
provides timing interrupts, periodic ticks at some known time interval. In
237-
either case, operating system kernels provide a clock source that hides
238-
these details. But the accuracy of that clock source and how quickly it can
239-
return results varies based on the underlying hardware.
240-
</para>
241-
242-
<para>
243-
Inaccurate time keeping can result in system instability. Test any change
244-
to the clock source very carefully. Operating system defaults are sometimes
245-
made to favor reliability over best accuracy. And if you are using a virtual
246-
machine, look into the recommended time sources compatible with it. Virtual
247-
hardware faces additional difficulties when emulating timers, and there are
248-
often per operating system settings suggested by vendors.
249-
</para>
250-
251-
<para>
252-
The Time Stamp Counter (TSC) clock source is the most accurate one available
253-
on current generation CPUs. It's the preferred way to track the system time
254-
when it's supported by the operating system and the TSC clock is
255-
reliable. There are several ways that TSC can fail to provide an accurate
256-
timing source, making it unreliable. Older systems can have a TSC clock that
257-
varies based on the CPU temperature, making it unusable for timing. Trying
258-
to use TSC on some older multicore CPUs can give a reported time that's
259-
inconsistent among multiple cores. This can result in the time going
260-
backwards, a problem this program checks for. And even the newest systems
261-
can fail to provide accurate TSC timing with very aggressive power saving
262-
configurations.
263-
</para>
264-
265-
<para>
266-
Newer operating systems may check for the known TSC problems and switch to a
267-
slower, more stable clock source when they are seen. If your system
268-
supports TSC time but doesn't default to that, it may be disabled for a good
269-
reason. And some operating systems may not detect all the possible problems
270-
correctly, or will allow using TSC even in situations where it's known to be
271-
inaccurate.
272-
</para>
273-
274-
<para>
275-
The High Precision Event Timer (HPET) is the preferred timer on systems
276-
where it's available and TSC is not accurate. The timer chip itself is
277-
programmable to allow up to 100 nanosecond resolution, but you may not see
278-
that much accuracy in your system clock.
279-
</para>
280-
281-
<para>
282-
Advanced Configuration and Power Interface (ACPI) provides a Power
283-
Management (PM) Timer, which Linux refers to as the acpi_pm. The clock
284-
derived from acpi_pm will at best provide 300 nanosecond resolution.
285-
</para>
286-
287-
<para>
288-
Timers used on older PC hardware include the 8254 Programmable Interval
289-
Timer (PIT), the real-time clock (RTC), the Advanced Programmable Interrupt
290-
Controller (APIC) timer, and the Cyclone timer. These timers aim for
291-
millisecond resolution.
292-
</para>
293-
</refsect2>
294199
</refsect1>
295200

296201
<refsect1>
297202
<title>See Also</title>
298203

299204
<simplelist type="inline">
300205
<member><xref linkend="sql-explain"/></member>
206+
<member><ulink url="https://wiki.postgresql.org/wiki/Pg_test_timing">Wiki
207+
discussion about timing</ulink></member>
301208
</simplelist>
302209
</refsect1>
303210
</refentry>

0 commit comments

Comments
 (0)