Page MenuHomePhabricator

Lock wait timeout exceeded in SqlIdGenerator::generateNewId
Closed, ResolvedPublicPRODUCTION ERROR

Description

2018-05-09 17:47:48 [WvM-1NApAMEMAADtSlpwAAADA] mw1232 wikidatawiki 1.32.0-wmf.2 DBQuery ERROR: Wikibase\SqlIdGenerator::generateNewId  10.64.48.26     1205    Lock wait timeout exceeded; try restarting transaction (10.64.48.26)    SELECT  id_value  FROM `wb_id_counters`    WHERE id_type = 'wikibase-item'  LIMIT 1   FOR UPDATE {"db_server":"10.64.48.26","db_name":"wikidatawiki","db_user":"wikiuser","method":"Wikimedia\\Rdbms\\Database::makeQueryException","errno":1205,"error":"Lock wait timeout exceeded; try restarting transaction (10.64.48.26)","sql1line":"SELECT  id_value  FROM `wb_id_counters`    WHERE id_type = 'wikibase-item'  LIMIT 1   FOR UPDATE","fname":"Wikibase\\SqlIdGenerator::generateNewId"}

This is happening occasionally, 0 to 5 times a day.

Patch-For-Review / TODO:

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

That will indeed work:

INSERT INTO wb_id_counters (id_type, id_value) VALUES ('wikibase-item', 1) ON DUPLICATE KEY UPDATE id_value = LAST_INSERT_ID(id_value + 1);
SELECT LAST_INSERT_ID();

This solution is specific to MySQL, right?

I guess that's ok, since Wikibase doesn't support anything else. Do we check on install?

Yeah it's definitely a mysql-specific technique but it's atomic, fast and recommended in the mysql documentation.

I guess that's ok, since Wikibase doesn't support anything else. Do we check on install?

The implementation in the patch currently uses the mediawiki abstraction for upsert, so this should work for all db types mediawiki supports.
However LAST_INSERT_ID is indeed mysql only.

I was planning on making the id generator configurable to begin with anyway, so we can
leave the old implementation there too for people to use (it will be needed for sqlite).

Although it looks like we could make it work for sqlite too? https://sqlite.org/c3ref/last_insert_rowid.html

Change 474725 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikibase@master] Inject IdGenerator into SqlStore

https://gerrit.wikimedia.org/r/474725

Change 474731 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikibase@master] Add config to choose which IdGenerator to use

https://gerrit.wikimedia.org/r/474731

Change 474274 abandoned by Addshore:
WIP SqlIdGenerator, Use upsert for ID Generation

Reason:
See https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Wikibase/ /474731/

https://gerrit.wikimedia.org/r/474274

Change 474278 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] IdGenerator, Introduce UpsertSqlIdGenerator

https://gerrit.wikimedia.org/r/474278

Change 474725 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Inject IdGenerator into SqlStore

https://gerrit.wikimedia.org/r/474725

Change 474731 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Add config to choose which IdGenerator to use

https://gerrit.wikimedia.org/r/474731

Given the lock timeouts happen on SELECT, I don't really see how using a single query instead of SELECT+UPDATE/INSERT would greatly help us here. While that's definitely nice to have performance wise (less queries, less roundtrips), the main problem seems to me that something is getting a new id and then keeping the transaction (=> the lock) open to do other things before commit/rollback.

While I definitely think that using UpsertSqlIdGenerator is an improvement and will greatly help us here once the contention around this table grows even more… I don't think it can fix the problem at hand. We wont be able to solve the problem as long as we don't commit immediately after UPDATE/INSERT, I fear.

The new ID generator should be deployed to Test Wikidata within the next few hours – do we want to test it afterwards, or wait until the train reaches Wikidata and test there?

The new ID generator should be deployed to Test Wikidata within the next few hours – do we want to test it afterwards, or wait until the train reaches Wikidata and test there?

It's behind a config variable and should be turned off for now. That's my understanding of it.

Yes – what I meant is: do we want to write and deploy a config change to enable it on Test Wikidata separately, or go ahead with Wikidata directly (and only test on the debug server before deploying it everywhere)?

Yes – what I meant is: do we want to write and deploy a config change to enable it on Test Wikidata separately, or go ahead with Wikidata directly (and only test on the debug server before deploying it everywhere)?

We want to do beta, then test, then prod :)

And then also leave it enabled on all three, I assume.

Should we attach the necessary config changes to this task or create a subtask?

And then also leave it enabled on all three, I assume.

Should we attach the necessary config changes to this task or create a subtask?

I think we could move this back to TODO and create the config patches here for it :)
It looks like the code is already deployed so we could look at deploying this next week.?

Addshore updated the task description. (Show Details)

Change 480774 had a related patch set uploaded (by Addshore; owner: Addshore):
[operations/mediawiki-config@master] Wikibase: prepare to set $wgWBRepoSettings['idGenerator']

https://gerrit.wikimedia.org/r/480774

Change 480775 had a related patch set uploaded (by Addshore; owner: Addshore):
[operations/mediawiki-config@master] BETA: Wikibase, use mysql-upsert id generator

https://gerrit.wikimedia.org/r/480775

Change 480777 had a related patch set uploaded (by Addshore; owner: Addshore):
[operations/mediawiki-config@master] Wikibase, use mysql-upsert id generator on testwikidatawiki

https://gerrit.wikimedia.org/r/480777

Change 480778 had a related patch set uploaded (by Addshore; owner: Addshore):
[operations/mediawiki-config@master] Wikibase, use mysql-upsert on all Wikibases

https://gerrit.wikimedia.org/r/480778

Change 480774 merged by jenkins-bot:
[operations/mediawiki-config@master] Wikibase: prepare to set $wgWBRepoSettings['idGenerator']

https://gerrit.wikimedia.org/r/480774

Change 480775 merged by jenkins-bot:
[operations/mediawiki-config@master] BETA: Wikibase, use mysql-upsert id generator

https://gerrit.wikimedia.org/r/480775

Mentioned in SAL (#wikimedia-operations) [2018-12-19T17:04:40Z] <addshore@deploy1001> Synchronized wmf-config: Wikibase: prepare to set $wgWBRepoSettings idGenerator, T194299 (duration: 00m 53s)

Change 480777 merged by jenkins-bot:
[operations/mediawiki-config@master] Wikibase, use mysql-upsert id generator on testwikidatawiki

https://gerrit.wikimedia.org/r/480777

Mentioned in SAL (#wikimedia-operations) [2018-12-19T17:13:31Z] <addshore@deploy1001> Synchronized wmf-config: Wikibase: testwikidatawiki upsert idGenerator, T194299 (duration: 00m 52s)

Change 480778 merged by jenkins-bot:
[operations/mediawiki-config@master] Wikibase, use mysql-upsert on all Wikibases

https://gerrit.wikimedia.org/r/480778

Mentioned in SAL (#wikimedia-operations) [2018-12-19T17:22:39Z] <addshore@deploy1001> Synchronized wmf-config: Wikibase: wikidatawiki upsert idGenerator, T194299 (duration: 00m 52s)

So the new IdGenerator has just been deployed to all production wikibases.
We should be able to check logstash @ https://logstash.wikimedia.org/goto/b2a28cc42a9203bfe1d796929696fe55 and make sure that the old error stops occouring and no new errors occour with the new ID generator.

From yesterday:

Wikibase\UpsertSqlIdGenerator::upsertId	10.64.48.26	1205	Lock wait timeout exceeded; try restarting transaction (10.64.48.26)	INSERT INTO `wb_id_counters` (id_type,id_value) VALUES ('wikibase-item','1') ON DUPLICATE KEY UPDATE id_value = LAST_INSERT_ID(id_value + 1)

So we will need another different approach in order to make this stop happening as the small optimization didn't cut it.

The next thing would be T194299#4713654
Looking at T194299#4715287 it sounds like right now we should be aiming to use CONN_TRX_AUTO still.
@aaron is that correct? We can just add a note next to the use stating that we don't actually care about the TRX mode.

Addshore moved this task from Incoming to Needs Work on the Wikidata-Campsite board.
Addshore updated the task description. (Show Details)
Addshore updated the task description. (Show Details)
Addshore updated the task description. (Show Details)
Addshore updated the task description. (Show Details)
Addshore updated the task description. (Show Details)
Addshore claimed this task.
Addshore moved this task from Older to Resolved on the Wikimedia-production-error board.

This should now be resolved.
I don't see any errors in the log since the deployment.

Change 432156 abandoned by Lucas Werkmeister (WMDE):
Retry SELECT FOR UPDATE in SqlIdGenerator once

Reason:
This was superseded by other efforts, mainly I4fd46adae1, I believe.

https://gerrit.wikimedia.org/r/432156

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:09 PM
hashar subscribed.

From yesterday:

Wikibase\UpsertSqlIdGenerator::upsertId	10.64.48.26	1205	Lock wait timeout exceeded; try restarting transaction (10.64.48.26)	INSERT INTO `wb_id_counters` (id_type,id_value) VALUES ('wikibase-item','1') ON DUPLICATE KEY UPDATE id_value = LAST_INSERT_ID(id_value + 1)

Running Wikibase Selenium tests in parallel on CI seems to trigger the same issue. I have filed it as T298682