Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions CHANGELOG
Original file line number Diff line number Diff line change
@@ -1,6 +1,11 @@
devel
-----

* Fix BTS-2211: AQL queries using the async-prefetch rule, which result in a
failure due to a runtime exception or possibly a kill, had a possible race
during query destruction that could result in undefined behavior (e.g.
memory leaks or crashes).

* Add a separate thread to the UserManager to handle the load of the user-cache asynchronously.
This pre-loads the user-cache as fast as possible (as soon as the global version changes)
Also, compared to the old version, this allows the read-only operation to work on the user-cache
Expand Down
8 changes: 4 additions & 4 deletions arangod/Aql/ExecutionBlockImpl.h
Original file line number Diff line number Diff line change
Expand Up @@ -377,20 +377,20 @@ class ExecutionBlockImpl final : public ExecutionBlock {
* consumed.
*
* An additional note on stopping the query early: In this case
* we now have at least two concurrent operations, the cleannup
* we now have at least two concurrent operations, the cleanup
* which will eventually destruct all objects in the query, and
* an async prefetch task.
* This stop process now needs to ensure that the prefetchTask is
* either completed or will not be started.
* Therefore if the task is not consumed the prefetch task is tried to
* Therefore, if the task is not consumed, the prefetch task is tried to
* get claimed with above mechanism. If we can not claim it, we need
* to wait for it to finish, as someone is actively working on it.
* If we claimed it, we need to discard it, as it is not needed anymore.
*
* Also please note: When we are cleaning up the query in the stopAsyncTasks
* Also, please note: When we are cleaning up the query in stopAsyncTasks(),
* the execution order guarantees that for any ExecutionBlock we can only have
* one thread that is either in the "execute" or in the "stopAsyncTasks"
* codepath The Executionblock may have the prefetch task running, but that
* codepath. The ExecutionBlock may have the prefetch task running, but that
* only is within the executeFetcher, so already on the way to the next
* Executor. And as we are stopping executors from "RETURN" to "Singleton" we
* guarantee that we can only have prefetch tasks active for Executors in
Expand Down
27 changes: 22 additions & 5 deletions arangod/Aql/ExecutionBlockImpl.tpp
Original file line number Diff line number Diff line change
Expand Up @@ -381,12 +381,19 @@ void ExecutionBlockImpl<Executor>::stopAsyncTasks() {
// some thread is still working on our prefetch task
// -> we need to wait for that task to finish first!
_prefetchTask->waitFor();
} else {
// We took responsibility for the prefetch task, but we are not going to
// run anything, this in the shutdown process.
// We will simply discard the task
_prefetchTask->discard(/*isFinished*/ false);
}
// We either claimed the task, or it is finished. Now we have to discard
// it, for two reasons:
// 1) The state must not stay InProgress (if we claimed the task), so a
// second call to `stopAsyncTasks()`, which is currently done, will not
// wait forever.
// 2) We must destroy the result (if the task finished), so a possible
// SharedAqlItemBlockPtr will return the AqlItemBlock to the
// AqlItemBlockManager. Note that the callback executed by the
// scheduler queue will release its shared_ptr to the prefetch task
// after it has finished; so it is possible that the task outlives the
// query, and thus the AqlItemBlockManager.
_prefetchTask->discard(/*isFinished*/ false);
}
}
}
Expand Down Expand Up @@ -1143,6 +1150,16 @@ auto ExecutionBlockImpl<Executor>::executeFetcher(ExecutionContext& ctx,
block->getPlanNode()->id().id(), ": ",
block->getPlanNode()->getTypeString(), "]")});
}

TRI_IF_FAILURE("AsyncPrefetch::delayDestructor") {
// Delay the destruction of `task` to make it likely that the
// Query, and with it the AqlItemBlockManager, is destructed
// before the PrefetchTask. That way we can assert that the
// result, and possible SharedAqlItemBlockPtrs in it, will be
// destroyed before the AqlItemBlockManager.
using namespace std::chrono_literals;
std::this_thread::sleep_for(1ms);
}
});

if (!queued) {
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
/*jshint globalstrict:false, strict:false, maxlen: 500 */
/*global assertEqual, assertNotEqual */
/* jshint strict:true */

// //////////////////////////////////////////////////////////////////////////////
// / DISCLAIMER
Expand All @@ -25,13 +24,11 @@
// //////////////////////////////////////////////////////////////////////////////

const jsunity = require("jsunity");
const internal = require('internal');
const db = internal.db;
const {assertEqual, assertNotEqual} = jsunity.jsUnity.assertions;
const helper = require("@arangodb/aql-helper");
const errors = internal.errors;
const assertQueryError = helper.assertQueryError;
let IM = global.instanceManager;
const {aql} = require('@arangodb');
const {aql, db, errors} = require('@arangodb');

function AsyncPrefetchFailure () {
const ruleName = "async-prefetch";
Expand All @@ -53,9 +50,15 @@ function AsyncPrefetchFailure () {
db._drop(cn);
},

// This failure was observed when the exectution block were not destroyed in topological order
// and the async task had outlived its execution block
testAsyncPrefetchFailure : function () {
tearDown : function () {
IM.debugClearFailAt();
},

// This failure was observed when the execution blocks were not destroyed in
// topological order, and the async task had outlived its execution block.
// Regression test for https://arangodb.atlassian.net/browse/BTS-2119
// (https://github.com/arangodb/arangodb/pull/21709).
testAsyncPrefetchRegressionBts2119 : function () {
IM.debugSetFailAt("AsyncPrefetch::blocksDestroyedOutOfOrder");
// This query is forced to have async prefetch optimization
const query = aql`FOR d IN ${col} FILTER d.val > 0 LET x = FAIL('failed') RETURN d`;
Expand All @@ -64,23 +67,69 @@ function AsyncPrefetchFailure () {

const result = db._createStatement(query).explain();

const actualNodes = result.plan.nodes.map((n) => [n.type, n.isAsyncPrefetchEnabled === true ? true : false]);
const actualNodes = result.plan.nodes.map((n) => [n.type, n.isAsyncPrefetchEnabled === true]);
assertEqual(expectedNodes, actualNodes, query);

let expectPrefetchRule = expectedNodes.filter((n) => n[1]).length > 0;
if (expectPrefetchRule) {
assertNotEqual(-1, result.plan.rules.indexOf(ruleName));
} else {
assertEqual(-1, result.plan.rules.indexOf(ruleName));
}

assertNotEqual(-1, result.plan.rules.indexOf(ruleName));

assertQueryError(
errors.ERROR_QUERY_FAIL_CALLED.code,
query,
query.bindParams,
query.options
);
},

// Regression test for https://arangodb.atlassian.net/browse/BTS-2211.
// There was a race after the prefetch task has written its result, but
// before the shared_ptr to the task is destructed. If the Query is
// destructed during that time window, and with it the AqlItemBlockManager,
// the prefetch task could hold a SharedAqlItemBlockPtr in its result, that
// would then be released to the already-destroyed AqlItemBlockManager.
// In maintainer mode, the AqlItemBlockManager's destructor asserts that
// no AqlItemBlocks are still leased: This test would run into that
// assertion.
testAsyncPrefetchRegressionBts2211 : function () {
// After a prefetch task has written its result, wait a little before
// destructing the shared_ptr to the task. It's now likely to be
// destructed after the Query. The AqlItemBlockManager's destructor
// asserts that no AqlItemBlocks are still in use, particularly for a
// result in the prefetch task.
IM.debugSetFailAt("AsyncPrefetch::delayDestructor");
const query = aql`
FOR d IN ${col}
FILTER NOOPT(d.val >= 0)
LET x = FAIL('failed')
RETURN d.val
`;
query.options = {};
const expectedNodes = [
["SingletonNode", false],
["EnumerateCollectionNode", true],
["CalculationNode", false],
["FilterNode", true],
["CalculationNode", false],
["ReturnNode", false],
];

const result = db._createStatement(query).explain();

IM.debugRemoveFailAt("AsyncPrefetch::blocksDestroyedOutOfOrder");
const actualNodes = result.plan.nodes.map((n) =>
[ n.type,
n.isAsyncPrefetchEnabled === true,
]);
assertEqual(expectedNodes, actualNodes, query);

assertNotEqual(-1, result.plan.rules.indexOf(ruleName));

for (let i = 0; i < 10000; ++i) {
assertQueryError(
errors.ERROR_QUERY_FAIL_CALLED.code,
query,
query.bindParams,
query.options
);
}
},
};
}
Expand Down