You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Bug#31790329 PERFORMANCE REGRESSION DUE TO TOO MUCH RELEASE OF NDB API OBJECTS
A performance regression was introduced by WL#8351, where too many
NDB API objects were released to the memory allocator, instead of being
kept in the free-list. The overhead of the memort allocator being used too much,
mainly resulted in an increased latency being observed for the transactions at
high load. Maximum TPS was also somewhat affected.
Root cause seems to be that the NDB API object usage statistic introduced
by WL#8351, made assumption about the object usage being normal distributed
over the transactions being executed in each Ndb context. That might not
be the case, resulting in the object recycling mechanism in WL#8351 not behaving
as intended. When instrumenting some customer usecases the observed behavior could
typically be a majority (~99%) of transactions using 1-2 NDB API objects, then
a few outliers using a lot more of the NDB API objects. As all objects usage
was sampled with the same 'priority', these small usages dominated the usage
statistics, preventing the larger transactions to be served from the free-list
of API objects.
As we want even these larger transactions to be served from the free-list,
we should focus on sampling these in the statistcs. Smaller usage periode
inbetween should be ignored until it is likely that a more permanent change
in behaviour has occured.
This patch implements logic for ignoring such intermediate low usage periode,
and just sample the larger usage peaks - from comments in patch:
+ * update_stats() is called whenever a new local peak of 'm_used_cnt'
+ * objects has been observed.
+ *
+ * The high usage peaks are most interesting as we want to scale the
+ * free-list to accomodate these - The smaller peaks inbetween are mostly
+ * considered as 'noise' in this statistics. Which may cause a too low
+ * usage statistics to be collected, such that the high usage peaks could
+ * not be served from the free-list.
+ *
+ * In order to implement this we use a combination of statistics and
+ * heuristics. Heuristics is based on observing free-list behavior of
+ * an instrumented version of this code.
+ *
+ * 1) A 'high peak' is any peak value above or equal to the current
+ * sampled mean value. -> Added to the statistics immediately .
+ * 2) A sampled peak value of 2 or less is considered as 'noise' and
+ * just ignored.
+ * 3) Other peak values, less than the current mean:
+ * These are observed over a periode of such smaller peaks, and their
+ * max value collected in 'm_sample_max'. When the windows size has expired,
+ * the 'm_sample_max' value is sampled.
+ * Intn with this heuristic is that temporary reduced usage of objects
+ * should be ignored, but longer term changes should be acounted for.
+ *
+ * When we have taken a valid sample, we use the statistics to calculate the
+ * 95% percentile for max objects in use of 'class T'.
Reviewed by: Frazer Clement <frazer.clement@oracle.com>
0 commit comments