|
27 | 27 | </para>
|
28 | 28 |
|
29 | 29 | <para>
|
30 |
| - While forcing data periodically to the disk platters might seem like |
| 30 | + While forcing data to the disk platters periodically might seem like |
31 | 31 | a simple operation, it is not. Because disk drives are dramatically
|
32 | 32 | slower than main memory and CPUs, several layers of caching exist
|
33 | 33 | between the computer's main memory and the disk platters.
|
|
48 | 48 | some later time. Such caches can be a reliability hazard because the
|
49 | 49 | memory in the disk controller cache is volatile, and will lose its
|
50 | 50 | contents in a power failure. Better controller cards have
|
51 |
| - <firstterm>battery-backed unit</> (<acronym>BBU</>) caches, meaning |
| 51 | + <firstterm>battery-backup units</> (<acronym>BBU</>s), meaning |
52 | 52 | the card has a battery that
|
53 | 53 | maintains power to the cache in case of system power loss. After power
|
54 | 54 | is restored the data will be written to the disk drives.
|
|
57 | 57 | <para>
|
58 | 58 | And finally, most disk drives have caches. Some are write-through
|
59 | 59 | while some are write-back, and the same concerns about data loss
|
60 |
| - exist for write-back drive caches as exist for disk controller |
| 60 | + exist for write-back drive caches as for disk controller |
61 | 61 | caches. Consumer-grade IDE and SATA drives are particularly likely
|
62 |
| - to have write-back caches that will not survive a power failure, |
63 |
| - though <acronym>ATAPI-6</> introduced a drive cache flush command |
64 |
| - (<command>FLUSH CACHE EXT</>) that some file systems use, e.g. |
65 |
| - <acronym>ZFS</>, <acronym>ext4</>. (The SCSI command |
66 |
| - <command>SYNCHRONIZE CACHE</> has long been available.) Many |
67 |
| - solid-state drives (SSD) also have volatile write-back caches, and |
68 |
| - many do not honor cache flush commands by default. |
69 |
| - </para> |
70 |
| - |
71 |
| - <para> |
72 |
| - To check write caching on <productname>Linux</> use |
73 |
| - <command>hdparm -I</>; it is enabled if there is a <literal>*</> next |
74 |
| - to <literal>Write cache</>; <command>hdparm -W</> to turn off |
75 |
| - write caching. On <productname>FreeBSD</> use |
76 |
| - <application>atacontrol</>. (For SCSI disks use <ulink |
77 |
| - url="http://sg.danny.cz/sg/sdparm.html"><application>sdparm</></ulink> |
78 |
| - to turn off <literal>WCE</>.) On <productname>Solaris</> the disk |
79 |
| - write cache is controlled by <ulink |
80 |
| - url="http://www.sun.com/bigadmin/content/submitted/format_utility.jsp"><literal>format |
81 |
| - -e</></ulink>. (The Solaris <acronym>ZFS</> file system is safe with |
82 |
| - disk write-cache enabled because it issues its own disk cache flush |
83 |
| - commands.) On <productname>Windows</> if <varname>wal_sync_method</> |
84 |
| - is <literal>open_datasync</> (the default), write caching is disabled |
85 |
| - by unchecking <literal>My Computer\Open\{select disk |
86 |
| - drive}\Properties\Hardware\Properties\Policies\Enable write caching on |
87 |
| - the disk</>. Also on Windows, <literal>fsync</> and |
88 |
| - <literal>fsync_writethrough</> never do write caching. The |
89 |
| - <literal>fsync_writethrough</> option can also be used to disable |
90 |
| - write caching on <productname>MacOS X</>. |
91 |
| - </para> |
92 |
| - |
93 |
| - <para> |
94 |
| - Many file systems that use write barriers (e.g. <acronym>ZFS</>, |
95 |
| - <acronym>ext4</>) internally use <command>FLUSH CACHE EXT</> or |
96 |
| - <command>SYNCHRONIZE CACHE</> commands to flush data to the platters on |
97 |
| - write-back-enabled drives. Unfortunately, such write barrier file |
98 |
| - systems behave suboptimally when combined with battery-backed unit |
| 62 | + to have write-back caches that will not survive a power failure. Many |
| 63 | + solid-state drives (SSD) also have volatile write-back caches. |
| 64 | + </para> |
| 65 | + |
| 66 | + <para> |
| 67 | + These caches can typically be disabled; however, the method for doing |
| 68 | + this varies by operating system and drive type: |
| 69 | + </para> |
| 70 | + |
| 71 | + <itemizedlist> |
| 72 | + <listitem> |
| 73 | + <para> |
| 74 | + On <productname>Linux</>, IDE drives can be queried using |
| 75 | + <command>hdparm -I</command>; write caching is enabled if there is |
| 76 | + a <literal>*</> next to <literal>Write cache</>. <command>hdparm -W</> |
| 77 | + can be used to turn off write caching. SCSI drives can be queried |
| 78 | + using <ulink url="http://sg.danny.cz/sg/sdparm.html"><application>sdparm</></ulink>. |
| 79 | + Use <command>sdparm --get=WCE</command> to check |
| 80 | + whether the write cache is enabled and <command>sdparm --clear=WCE</> |
| 81 | + to disable it. |
| 82 | + </para> |
| 83 | + </listitem> |
| 84 | + |
| 85 | + <listitem> |
| 86 | + <para> |
| 87 | + On <productname>FreeBSD</>, IDE drives can be queried using |
| 88 | + <command>atacontrol</command>, and SCSI drives using |
| 89 | + <command>sdparm</command>. |
| 90 | + </para> |
| 91 | + </listitem> |
| 92 | + |
| 93 | + <listitem> |
| 94 | + <para> |
| 95 | + On <productname>Solaris</>, the disk write cache is controlled by |
| 96 | + <ulink url="http://www.sun.com/bigadmin/content/submitted/format_utility.jsp"><literal>format -e</></ulink>. |
| 97 | + (The Solaris <acronym>ZFS</> file system is safe with disk write-cache |
| 98 | + enabled because it issues its own disk cache flush commands.) |
| 99 | + </para> |
| 100 | + </listitem> |
| 101 | + |
| 102 | + <listitem> |
| 103 | + <para> |
| 104 | + On <productname>Windows</>, if <varname>wal_sync_method</> is |
| 105 | + <literal>open_datasync</> (the default), write caching can be disabled |
| 106 | + by unchecking <literal>My Computer\Open\<replaceable>disk drive</>\Properties\Hardware\Properties\Policies\Enable write caching on the disk</>. |
| 107 | + Alternatively, set <varname>wal_sync_method</varname> to |
| 108 | + <literal>fsync</> or <literal>fsync_writethrough</>, which prevent |
| 109 | + write caching. |
| 110 | + </para> |
| 111 | + </listitem> |
| 112 | + |
| 113 | + <listitem> |
| 114 | + <para> |
| 115 | + On <productname>Mac OS X</productname>, write caching can be prevented by |
| 116 | + setting <varname>wal_sync_method</> to <literal>fsync_writethrough</>. |
| 117 | + </para> |
| 118 | + </listitem> |
| 119 | + </itemizedlist> |
| 120 | + |
| 121 | + <para> |
| 122 | + Recent SATA drives (those following <acronym>ATAPI-6</> or later) |
| 123 | + offer a drive cache flush command (<command>FLUSH CACHE EXT</>), |
| 124 | + while SCSI drives have long supported a similar command |
| 125 | + <command>SYNCHRONIZE CACHE</>. These commands are not directly |
| 126 | + accessible to <productname>PostgreSQL</>, but some file systems |
| 127 | + (e.g., <acronym>ZFS</>, <acronym>ext4</>) can use them to flush |
| 128 | + data to the platters on write-back-enabled drives. Unfortunately, such |
| 129 | + file systems behave suboptimally when combined with battery-backup unit |
99 | 130 | (<acronym>BBU</>) disk controllers. In such setups, the synchronize
|
100 |
| - command forces all data from the BBU to the disks, eliminating much |
101 |
| - of the benefit of the BBU. You can run the utility |
| 131 | + command forces all data from the controller cache to the disks, |
| 132 | + eliminating much of the benefit of the BBU. You can run the utility |
102 | 133 | <filename>src/tools/fsync</> in the PostgreSQL source tree to see
|
103 | 134 | if you are affected. If you are affected, the performance benefits
|
104 |
| - of the BBU cache can be regained by turning off write barriers in |
| 135 | + of the BBU can be regained by turning off write barriers in |
105 | 136 | the file system or reconfiguring the disk controller, if that is
|
106 | 137 | an option. If write barriers are turned off, make sure the battery
|
107 |
| - remains active; a faulty battery can potentially lead to data loss. |
| 138 | + remains functional; a faulty battery can potentially lead to data loss. |
108 | 139 | Hopefully file system and disk controller designers will eventually
|
109 | 140 | address this suboptimal behavior.
|
110 | 141 | </para>
|
|
117 | 148 | ensure data integrity. Avoid disk controllers that have non-battery-backed
|
118 | 149 | write caches. At the drive level, disable write-back caching if the
|
119 | 150 | drive cannot guarantee the data will be written before shutdown.
|
| 151 | + If you use SSDs, be aware that many of these do not honor cache flush |
| 152 | + commands by default. |
120 | 153 | You can test for reliable I/O subsystem behavior using <ulink
|
121 | 154 | url="http://brad.livejournal.com/2116715.html"><filename>diskchecker.pl</filename></ulink>.
|
122 | 155 | </para>
|
|
126 | 159 | operations themselves. Disk platters are divided into sectors,
|
127 | 160 | commonly 512 bytes each. Every physical read or write operation
|
128 | 161 | processes a whole sector.
|
129 |
| - When a write request arrives at the drive, it might be for 512 bytes, |
130 |
| - 1024 bytes, or 8192 bytes, and the process of writing could fail due |
| 162 | + When a write request arrives at the drive, it might be for some multiple |
| 163 | + of 512 bytes (<productname>PostgreSQL</> typically writes 8192 bytes, or |
| 164 | + 16 sectors, at a time), and the process of writing could fail due |
131 | 165 | to power loss at any time, meaning some of the 512-byte sectors were
|
132 |
| - written, and others were not. To guard against such failures, |
| 166 | + written while others were not. To guard against such failures, |
133 | 167 | <productname>PostgreSQL</> periodically writes full page images to
|
134 | 168 | permanent WAL storage <emphasis>before</> modifying the actual page on
|
135 | 169 | disk. By doing this, during crash recovery <productname>PostgreSQL</> can
|
136 |
| - restore partially-written pages. If you have a battery-backed disk |
| 170 | + restore partially-written pages from WAL. If you have a battery-backed disk |
137 | 171 | controller or file-system software that prevents partial page writes
|
138 |
| - (e.g., ZFS), you can turn off this page imaging by turning off the |
| 172 | + (e.g., ZFS), you can safely turn off this page imaging by turning off the |
139 | 173 | <xref linkend="guc-full-page-writes"> parameter.
|
140 | 174 | </para>
|
141 | 175 | </sect1>
|
|
0 commit comments