Problems When Creating Reports From Slave
Problems When Creating Reports From Slave
The last month I was working on fixing some issues on a database system I am maintaining. We
have a PostgreSQL master instance that is being replicated using streaming replication to a slave
instance. As there is a lot of connections on the master server which are using a lot of resources we
had an idea to move all the reporting stuff to the slave server. And this is usually the way to do it.
Why spend critical resources on the master when you only have to read data and generate reports
that are displayed as web pages? So, after a few tweaks to our application we had the first reports
from the slave. And everybody was happy. Well, until someone run a slightly bigger report. And 30
seconds later you get an error that looks like this:
ERROR: canceling statement due to conflict with recovery SQL state: 40001 D
etail: User query might have needed to see row versions that must be remove
d.
Or one of the following flavours:
FATAL: terminating connection due to conflict with recovery SQL state: 4000
1 Detail: User was holding a relation lock for too long. Hint: In a moment
you should be able to reconnect to the database and repeat your command.
FATAL: terminating connection due to conflict with recovery SQL state: 4000
1 Detail: User query might have needed to see row versions that must be rem
oved. Hint: In a moment you should be able to reconnect to the database and
repeat your command.
So what is going on here? Well, our slave instance has some limitations regarding the queries that
you can run on it. On our master new data is coming all the time and some of the existing rows are
changed by this new data. When these changed rows are replicated on the slave the queries that run
on it might be canceled or even worse, their connection can be terminated. And this is a well known
problem, so there are some parameters you can change on the slave to fix this. Or better yet, you
[1]
can discover that using PostgreSQL you develop superpowers and are now able to stop time .
[1] Well, you can’t stop time, but you can do almost as good. Read on :)
Simulation preparations
We have a table a with 10000 rows of data and we want to change some of it. To do that we can run
the following script:
-- master - connection 1 = mc1 WITH random_rows AS ( SELECT floor(ran
dom() * 10000) + 1 as id FROM generate_series(1, 20) ) UPDATE a SET
c = now() WHERE id IN (SELECT id FROM random_rows); SELECT txid_curren
t();
You should run this as a whole so that the last statement can tell you the transaction ID that will be
present on the rows that the statement has changed.
On the other server you can run a very intensive reporting query. Like this:
Out reporting query lasts only 300 seconds, so it slows down time for only that amount of time. If I
can run a similar query before the first one finishes, it should slow time down during its duration.
Let’s use the same query:
The described scenarios are only theoretical, but with the values
ofmax_standby_archive_delay and max_standby_streaming_delay set to -1, and a steady input
of reporting queries, it might just be possible. Also, your reports might be wrong because they are
operating on ancient data, and not almost real-time data. And the DOS scenario is really far fetched
(but you should really try it at home, not on production).
Setting the hot_standby_feedback = on seems to solve the reporting errors, but not fully. We still
got some errors after setting it to on, but I haven’t found a set of example tables/queries to
demonstrate that. If I do, I will make a sequel to this post. Untill then, have some sane values
for max_standby_archive_delay and max_standby_streaming_delay, let’s say a few minutes so
that your longest report will finish. Some will fail, but be prepared for that and you will be ok.