|
1 | 1 | <sect1 id="multimaster">
|
2 | 2 | <title>multimaster</title>
|
3 | 3 | <para><emphasis role="strong">Table of Contents</emphasis></para>
|
| 4 | + <para><link linkend="multimaster-usage">Limitations</link></para> |
4 | 5 | <para><link linkend="multimaster-architecture">Architecture</link></para>
|
5 | 6 | <para><link linkend="multimaster-installation">Installation and Setup</link></para>
|
6 |
| - <para><link linkend="multimaster-usage">Using multimaster for Data Replication</link></para> |
7 | 7 | <para><link linkend="multimaster-administration">Multi-Master Cluster Administration</link></para>
|
8 | 8 | <para><link linkend="multimaster-reference">Reference</link></para>
|
9 | 9 | <para><link linkend="multimaster-compatibility">Compatibility</link></para>
|
|
57 | 57 | <filename>multimaster</filename> uses three-phase commit protocol
|
58 | 58 | and heartbeats for failure discovery. A multi-master cluster of <replaceable>N</replaceable>
|
59 | 59 | nodes can continue working while the majority of the nodes are
|
60 |
| - alive and reachable by other nodes. In most cases, three |
61 |
| - cluster nodes are enough to ensure high availability. When the node |
| 60 | + alive and reachable by other nodes. To be configured with <filename>multimaster</filename>, the cluster must include at least two nodes. In most cases, three |
| 61 | + cluster nodes are enough to ensure high availability. Since the data on all cluster nodes is the same, you do not typically need more than five cluster nodes.</para> |
| 62 | + <para>When a failed node |
62 | 63 | is reconnected to the cluster, <filename>multimaster</filename> can automatically
|
63 | 64 | fast-forward the node to the actual state based on the
|
64 |
| - Write-Ahead Log (<acronym>WAL</acronym>) data in the corresponding replication slot. If <acronym>WAL</acronym> data is no longer available for the time when the node was excluded from the cluster, you can restore the node using <application>pg_basebackup</application>. |
| 65 | + Write-Ahead Log (<acronym>WAL</acronym>) data in the corresponding replication slot. If <acronym>WAL</acronym> data is no longer available for the time when the node was excluded from the cluster, you can <link linkend="multimaster-restoring-a-node-manually">restore the node using <application>pg_basebackup</application></link>. |
65 | 66 | </para>
|
66 | 67 | <important><para>When using <filename>multimaster</filename>, make sure to take its replication restrictions into account. For details, see <xref linkend="multimaster-usage">.</para></important>
|
67 | 68 | <para>
|
68 | 69 | To learn more about the <filename>multimaster</filename> internals, see
|
69 | 70 | <xref linkend="multimaster-architecture">.
|
70 | 71 | </para>
|
| 72 | + |
| 73 | + <sect2 id="multimaster-usage"> |
| 74 | + <title>Limitations</title> |
| 75 | + <para>The <filename>multimaster</filename> extension takes care of the database replication in a fully automated way. You can perform write transactions on any node and work with temporary tables on each cluster node simultaneosly. However, make sure to take the following replication restrictions into account:</para> |
| 76 | + <itemizedlist> |
| 77 | + <listitem> |
| 78 | + <para> |
| 79 | + <filename>multimaster</filename> can only replicate one database |
| 80 | + per cluster, which is specified in the <varname>multimaster.conn_strings</varname> variable. If you try to connect to a different database, <filename>multimaster</filename> will return a corresponding error message. |
| 81 | + </para> |
| 82 | + </listitem> |
| 83 | + <listitem> |
| 84 | + <para> |
| 85 | + The replicated tables must have primary keys or replica identity. Otherwise, |
| 86 | + <filename>multimaster</filename> will not allow replication |
| 87 | + because of the logical replication restrictions. Unlogged tables are not replicated, as in the standard <productname>PostgreSQL</productname>. |
| 88 | + </para> |
| 89 | + <note><para>You can enable replication |
| 90 | +of tables without primary keys by setting the <varname>multimaster.ignore_tables_without_pk</varname> variable to <literal>false</literal>. However, take into account that |
| 91 | +<filename>multimaster</filename> does not allow update operations on such tables.</para></note> |
| 92 | + </listitem> |
| 93 | + <listitem> |
| 94 | + <para> |
| 95 | + Isolation level. The <filename>multimaster</filename> extension |
| 96 | + supports <emphasis><literal>read committed</literal></emphasis> and <emphasis><literal>repeatable read</literal></emphasis> isolation levels. <emphasis><literal>Serializable</literal></emphasis> isolation level is currently not supported.</para> |
| 97 | + <important> |
| 98 | + <para>Using <literal>repeatable read</literal> isolation level increases |
| 99 | + the probability of serialization failure at commit time. Unlike in the standard <productname>PostgreSQL</productname>, <literal>read committed</literal> level can also cause serialization failures on a multi-master cluster.</para> |
| 100 | + <para>When performing a write transaction, <filename>multimaster</filename> blocks the affected objects only on the node on which the transaction is performed. However, since write transactions are allowed on all nodes, other transactions can try to change the same objects on the neighbor nodes at the same time. In this case, the replication of the first transaction can fail because the affected objects on the neighbor nodes are already blocked by another transaction. Similarly, the latter transaction cannot be replicated to the first node. In this case, a distributed deadlock occurs. As a result, one of the transactions is automatically rolled back and needs to be repeated. The application must be ready to retry transactions. |
| 101 | + </para> |
| 102 | + <para>If your typical workload has too many rollbacks, it is recommended to use <literal>read committed</literal> isolation level. However, the <literal>read committed</literal> still does not guarantee the absence of deadlocks on a multi-master cluster. If using the <literal>read committed</literal> level does not help, you can try directing all the write transactions to a single node.</para> |
| 103 | + </important> |
| 104 | + </listitem> |
| 105 | + <listitem> |
| 106 | + <para> |
| 107 | + Sequence generation. To avoid conflicts between unique identifiers on different nodes, <filename>multimaster</filename> modifies the default behavior of sequence generators. For each node, ID generation is started with the node number and is incremented by the number of nodes. For example, in a three-node cluster, 1, 4, and 7 IDs are allocated to the objects written onto the first node, while 2, 5, and 8 IDs are reserved for the second node. If you change the number of nodes in the cluster, the incrementation interval for new IDs is adjusted accordingly. Thus, the generated sequence values are not monotonic. |
| 108 | + </para> |
| 109 | + </listitem> |
| 110 | + <listitem> |
| 111 | + <para> |
| 112 | + <acronym>DDL</acronym> replication. While <filename>multimaster</filename> |
| 113 | + replicates data on the logical level, <acronym>DDL</acronym> is replicated on the |
| 114 | + statement level, which results in distributed commits of the same |
| 115 | + statement on different nodes. As a result, complex <acronym>DDL</acronym> |
| 116 | + scenarios, such as stored procedures and temporary tables, may |
| 117 | + work differently as compared to the standard <productname>PostgreSQL</productname>. |
| 118 | + </para> |
| 119 | + </listitem> |
| 120 | + <listitem> |
| 121 | + <para> |
| 122 | + Commit latency. In the current implementation of logical |
| 123 | + replication, <filename>multimaster</filename> sends data to subscriber nodes only after the |
| 124 | + local commit, so you have to wait for transaction processing twice: first on the local node, |
| 125 | + and then on all the other nodes simultaneously. In the case of a heavy-write transaction, this may result in a noticeable delay. |
| 126 | + </para> |
| 127 | + </listitem> |
| 128 | + </itemizedlist> |
| 129 | +<para>If you have any data that must be present on one of the nodes only, you can exclude a particular table from replication, as follows: |
| 130 | + <programlisting>SELECT * FROM <function>mtm.make_table_local</function>(::regclass::oid) </programlisting> |
| 131 | + where <literal>regclass</literal> is the name of the table and <literal>oid</literal> is the unique table identifier.</para> |
| 132 | + </sect2> |
| 133 | + |
71 | 134 | <sect2 id="multimaster-architecture">
|
72 | 135 | <title>Architecture</title>
|
73 | 136 | <sect3 id="multimaster-replication">
|
|
80 | 143 | <filename>multimaster</filename> uses <link linkend="logicaldecoding-synchronous">logical replication</link> and the <link linkend="multimaster-credits">three-phase E3PC commit protocol</link>.
|
81 | 144 | </para>
|
82 | 145 | <para>
|
83 |
| - When PostgeSQL loads the <filename>multimaster</filename> shared |
| 146 | + When <productname>&productname;</productname> loads the <filename>multimaster</filename> shared |
84 | 147 | library, <filename>multimaster</filename> sets up a logical
|
85 | 148 | replication producer and consumer for each node, and hooks into
|
86 | 149 | the transaction commit pipeline. The typical data replication
|
|
125 | 188 | </listitem>
|
126 | 189 | </orderedlist>
|
127 | 190 | <important>
|
128 |
| - <para>When performing a write transaction, <filename>multimaster</filename> blocks the affected objects only on the node on which the transaction is performed. However, since write transactions are allowed on all nodes, other transactions can try to change the same objects on the neighbor nodes at the same time. In this case, the replication of the first transaction can fail because the affected objects on the neighbor nodes are already blocked by another transaction. Similarly, the latter transaction cannot be replicated to the first node. In this case, a distributed deadlock occurs, and one of the transactions needs to be rolled back and repeated. |
129 |
| - </para> |
130 |
| - <para> |
131 |
| -If your typical workload has too many rollbacks, it is recommended to use <literal>read committed</literal> isolation level. If it does not help, you can try directing all the write transactions to a single node.</para> |
| 191 | + <para><filename>multimaster</filename> currently supports the <literal>read committed</literal> and <literal>repeatable read</literal> isolation levels only, which can cause unexpected serialization failures in your workload. For details, see <xref linkend="multimaster-usage">.</para> |
132 | 192 | </important>
|
133 | 193 | <para>
|
134 | 194 | If a node crashes or gets disconnected from the cluster between
|
135 | 195 | the <literal>PREPARE</literal> and <literal>COMMIT</literal>
|
136 | 196 | phases, the <literal>PRECOMMIT</literal> phase ensures that the
|
137 | 197 | survived nodes have enough information to complete the prepared
|
138 |
| - transaction. The <literal>PRECOMMITTED</literal> messages help you |
139 |
| - avoid the situation when the crashed node have already committed |
140 |
| - or aborted the transaction, but have not notified other nodes |
| 198 | + transaction. The <literal>PRECOMMITTED</literal> messages help |
| 199 | + avoid the situation when the crashed node has already committed |
| 200 | + or aborted the transaction, but has not notified other nodes |
141 | 201 | about the transaction status. In a two-phase commit (2PC), such a
|
142 | 202 | transaction would block resources (hold locks) until the recovery
|
143 | 203 | of the crashed node. Otherwise, you could get data inconsistencies
|
@@ -242,20 +302,21 @@ If your typical workload has too many rollbacks, it is recommended to use <liter
|
242 | 302 | <application>pg_basebackup</application>.
|
243 | 303 | </para></note>
|
244 | 304 | <para><emphasis role="strong">See Also</emphasis></para>
|
245 |
| - <para><link linkend="multimaster-restoring-a-node-manually">Restoring a Node Manually</link></para> |
| 305 | + <para><link linkend="multimaster-restoring-a-node-manually">Restoring a Cluster Node</link></para> |
246 | 306 | </sect3>
|
247 | 307 | </sect2>
|
248 | 308 | <sect2 id="multimaster-installation">
|
249 | 309 | <title>Installation and Setup</title>
|
250 | 310 | <para>
|
251 | 311 | To use <filename>multimaster</filename>, you need to install
|
252 | 312 | <productname>&productname;</productname> on all nodes of your cluster. <productname>&productname;</productname> includes all the required dependencies and
|
253 |
| - extensions. |
| 313 | + extensions. |
254 | 314 | </para>
|
255 | 315 | <sect3 id="multimaster-setting-up-a-multi-master-cluster">
|
256 | 316 | <title>Setting up a Multi-Master Cluster</title>
|
257 | 317 | <para>After installing <productname>&productname;</productname> on all nodes, you need to
|
258 |
| - configure the cluster with <filename>multimaster</filename>. Suppose |
| 318 | + configure the cluster with <filename>multimaster</filename>.</para> |
| 319 | + <para>Suppose |
259 | 320 | you are setting up a cluster of three nodes, with
|
260 | 321 | <literal>node1</literal>, <literal>node2</literal>, and
|
261 | 322 | <literal>node3</literal> domain names. First, set up the database to be replicated, and make sure you have a user with superuser rights to perform replication:
|
@@ -365,7 +426,7 @@ multimaster.conn_strings = 'dbname=mydb user=myuser host=node1, dbname=mydb user
|
365 | 426 | # of connection strings
|
366 | 427 | # to neighbor nodes
|
367 | 428 | </programlisting>
|
368 |
| -<para>The <literal>max_nodes</literal> variable defines the cluster size. In most cases, three cluster nodes are enough to ensure high availability. Since the data on all cluster nodes is the same, you typically do not need more than five cluster nodes.</para> |
| 429 | +<para>The <literal>max_nodes</literal> variable defines the cluster size. In most cases, three cluster nodes are enough to ensure high availability. Since the data on all cluster nodes is the same, you do not typically need more than five cluster nodes.</para> |
369 | 430 | <important><para>The
|
370 | 431 | <literal>node_id</literal> variable takes natural
|
371 | 432 | numbers starting from 1, without any gaps in numbering.
|
@@ -507,64 +568,6 @@ SELECT * FROM mtm.get_cluster_state();
|
507 | 568 | </sect4>
|
508 | 569 | </sect3>
|
509 | 570 | </sect2>
|
510 |
| - <sect2 id="multimaster-usage"> |
511 |
| - <title>Using multimaster for Data Replication</title> |
512 |
| - <para>The <filename>multimaster</filename> extension takes care of the database replication in a fully automated way. You can perform write transactions on any node, and work with temporary tables on each cluster node simultaneosly. However, make sure to take the following replication restrictions into account:</para> |
513 |
| - <itemizedlist> |
514 |
| - <listitem> |
515 |
| - <para> |
516 |
| - <filename>multimaster</filename> can only replicate one database |
517 |
| - per cluster, which is specified in the <varname>multimaster.conn_strings</varname> variable. If you try to connect to a different database, <filename>multimaster</filename> will return a corresponding error message. |
518 |
| - </para> |
519 |
| - </listitem> |
520 |
| - <listitem> |
521 |
| - <para> |
522 |
| - The replicated tables must have primary keys or replica identity. Otherwise, |
523 |
| - <filename>multimaster</filename> will not allow replication |
524 |
| - because of the logical replication restrictions. Unlogged tables are not replicated, as in the standard <productname>PostgreSQL</productname>. |
525 |
| - </para> |
526 |
| - <note><para>You can enable replication |
527 |
| -of tables without primary keys by setting the <varname>multimaster.ignore_tables_without_pk</varname> variable to <literal>false</literal>. However, take into account that |
528 |
| -<filename>multimaster</filename> does not allow update operations on such tables.</para></note> |
529 |
| - </listitem> |
530 |
| - <listitem> |
531 |
| - <para> |
532 |
| - Isolation level. The <filename>multimaster</filename> extension |
533 |
| - supports <emphasis><literal>read committed</literal></emphasis> and <emphasis><literal>repeatable read</literal></emphasis> isolation levels. <emphasis><literal>Serializable</literal></emphasis> isolation level is currently not supported.</para> |
534 |
| - <important> |
535 |
| - <para>When performing a write transaction, <filename>multimaster</filename> blocks the affected objects only on the node on which the transaction is performed. However, since write transactions are allowed on all nodes, other transactions can try to change the same objects on the neighbor nodes at the same time. In this case, the replication of the first transaction can fail because the affected objects on the neighbor nodes are already blocked by another transaction. Similarly, the latter transaction cannot be replicated to the first node. In this case, a distributed deadlock occurs, and one of the transactions needs to be rolled back and repeated. |
536 |
| - </para> |
537 |
| - <para>If your typical workload has too many rollbacks, it is recommended to use <literal>read committed</literal> isolation level. If it does not help, you can try directing all the write transactions to a single node.</para> |
538 |
| - </important> |
539 |
| - </listitem> |
540 |
| - <listitem> |
541 |
| - <para> |
542 |
| - Sequence generation. To avoid conflicts between unique identifiers on different nodes, <filename>multimaster</filename> modifies the default behavior of sequence generators. For each node, ID generation is started with the node number and is incremented by the number of nodes. For example, in a three-node cluster, 1, 4, and 7 IDs are allocated to the objects written onto the first node, while 2, 5, and 8 IDs are reserved for the second node. If you change the number of nodes in the cluster, the incrementation interval for new IDs is adjusted accordingly. |
543 |
| - </para> |
544 |
| - </listitem> |
545 |
| - <listitem> |
546 |
| - <para> |
547 |
| - <acronym>DDL</acronym> replication. While <filename>multimaster</filename> |
548 |
| - replicates data on the logical level, <acronym>DDL</acronym> is replicated on the |
549 |
| - statement level, which results in distributed commits of the same |
550 |
| - statement on different nodes. As a result, complex <acronym>DDL</acronym> |
551 |
| - scenarios, such as stored procedures and temporary tables, may |
552 |
| - work differently as compared to the standard <productname>PostgreSQL</productname>. |
553 |
| - </para> |
554 |
| - </listitem> |
555 |
| - <listitem> |
556 |
| - <para> |
557 |
| - Commit latency. In the current implementation of logical |
558 |
| - replication, <filename>multimaster</filename> sends data to subscriber nodes only after the |
559 |
| - local commit, so you have to wait for transaction processing twice: first on the local node, |
560 |
| - and then on all the other nodes simultaneously. In the case of a heavy-write transaction, this may result in a noticeable delay. |
561 |
| - </para> |
562 |
| - </listitem> |
563 |
| - </itemizedlist> |
564 |
| - <para>If you have any data that must be present on one of the nodes only, you can exclude a particular table from replication, as follows: |
565 |
| - <programlisting>SELECT * FROM <function>mtm.make_table_local</function>(::regclass::oid) </programlisting> |
566 |
| - where <literal>regclass</literal> is the name of the table and <literal>oid</literal> is the unique table identifier.</para> |
567 |
| - </sect2> |
568 | 571 | <sect2 id="multimaster-administration"><title>Multi-Master Cluster Administration</title>
|
569 | 572 | <itemizedlist>
|
570 | 573 | <listitem>
|
@@ -801,7 +804,7 @@ pg_ctl -D <replaceable>datadir</replaceable> -l <replaceable>pg.log</replaceable
|
801 | 804 | <varname>multimaster.arbiter_port</varname> variable.
|
802 | 805 | </para></listitem></varlistentry>
|
803 | 806 | <varlistentry><term><varname>multimaster.max_nodes</varname><indexterm><primary><varname>multimaster.max_nodes</varname></primary></indexterm></term><listitem><para>
|
804 |
| - The maximum number of nodes allowed in the cluster. In most cases, three cluster nodes are enough to ensure high availability. Since the data on all cluster nodes is the same, you typically do not need more than five cluster nodes. The maximum possible number of nodes is limited to 64.</para><para>Default: the number of nodes specified in the <varname>multimaster.conn_strings</varname> variable |
| 807 | + The maximum number of nodes allowed in the cluster. In most cases, three cluster nodes are enough to ensure high availability. Since the data on all cluster nodes is the same, you do not typically need more than five cluster nodes. The maximum possible number of nodes is limited to 64.</para><para>Default: the number of nodes specified in the <varname>multimaster.conn_strings</varname> variable |
805 | 808 | </para></listitem></varlistentry>
|
806 | 809 | <varlistentry><term><varname>multimaster.arbiter_port</varname><indexterm><primary><varname>multimaster.arbiter_port</varname></primary></indexterm></term><listitem><para>
|
807 | 810 | Port for the arbiter
|
|
0 commit comments