Page MenuHomePhabricator

Termbox SSR connection terminated very often
Closed, ResolvedPublic

Description

Starting last Wednesday, 2020-06-10, the number of connection failures with the Termbox SSR services seems to sharply rise to about 300 - 450 per hour.

https://logstash.wikimedia.org/goto/808228af4133ed6e857a168de0c6cd4c

image.png (209ร—1 px, 26 KB)

The error message is given as:

Wikibase\View\Termbox\Renderer\TermboxRemoteRenderer: encountered a bad response from the remote renderer

with the content being:

upstream connect error or disconnect/reset before headers. reset reason: connection termination

It is currently unclear what is going on or what the effects are. At the very least there is lots of logspam that started recently.

IRC log of discussion: https://wm-bot.wmflabs.org/browser/index.php?start=06%2F15%2F2020&end=06%2F15%2F2020&display=%23wikimedia-serviceops

Investigate:

  • what is going on
  • what are the effects

Likely it is the "top-level" mediawiki to termbox requests only that timeout.

Acceptance criteria:

  • Investigation results are provided in this task

Event Timeline

Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptJun 15 2020, 9:53 AM

Change 605554 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] services_proxy: Add keepalive, retries

https://gerrit.wikimedia.org/r/605554

Change 605554 merged by Alexandros Kosiaris:
[operations/puppet@production] services_proxy: Add keepalive, retries

https://gerrit.wikimedia.org/r/605554

The patch above (Ib7bfacf88/02109a44) seemingly resolved the original issue, but some new errors started appearing (though much less common than before).

Those new errors seem to come from a timeout in Termbox/Kubernetes which result in a timeout in the php/mediawiki code.

The following two links illustrate that error for the rendering process of a single item: ๅบžๅฎถๆ‘ (Q14836111)

Termbox/k8s: https://logstash.wikimedia.org/goto/480ada0d3692e0cadd04e9e09f9d95f5
Mediawiki: https://logstash.wikimedia.org/goto/fdc9bc9dab0d92bc440f7c6a20e016be

This was judged to be not pressing (@Tarrow .@Addshore please correct me if I'm talking rubbish here). Currently, the internal timeouts seem to be 3 seconds both for the Mediawiki side (ssrServerTimeout?) and for the Termbox side. It is not clear what we can do about those errors. While the internal requests should not take longer than 3 seconds, they sometimes do.

Follow-ups:

Logstash links are unfortunately not copy-pastable. Anyone clicking on those links will get a Unable to completely restore the URL, be sure to use the share functionality.

Meh! Thanks for pointing that out. I still have those tabs open. I'll try to see if there is some persistent share functionality there.

@akosiaris I edited my comment and it now has links that hopefully should work. There should be only a single event behind each of the searches linked there.

Also, I added another follow-up task to understand those k8s timeouts themselves.

@Michael thanks for writing this up!

So, if it is safe to assume the MW -> termbox timeout is 3s I would suggest we configure the envoys accordingly by setting tls.upstream_timeout: "3s" in termbox values.yaml as well as timeout: "3s" in appservers envoy config (also lowering keepalive there).

@Michael thanks for writing this up!

So, if it is safe to assume the MW -> termbox timeout is 3s I would suggest we configure the envoys accordingly by setting tls.upstream_timeout: "3s" in termbox values.yaml as well as timeout: "3s" in appservers envoy config (also lowering keepalive there).

That sounds sensible to me, but I'm really not the expert here. What do @Addshore and @Tarrow think? Note that in context of T255436, we might look at timeouts again, but this has not yet been prioritized, so I cannot speculate when that might be.

So one thing relating to this ticket and timeouts that we realized is that it would likely make sense to have the initial MW -> termbox timeout be longer than the termbox -> MW timeouts.
I guess in an ideal situation MW when making the first request would receive an error message from termbox service saying the inner MediaWiki request failed, rather than the current behaviour which is the first MW -> termbox request times out before that happens.
Another thing to take into account is the fact that termbox can possibly make multiple requests to MediaWiki depending on the state of the cache. I guess these each have a timeout?

Summary, we should look at all of this and figure out the correct values.

Note that while we are still seeing this issue, the events are still much lower than what we used to see:

image.png (200ร—1 px, 25 KB)

Further, looking into T255436: Termbox Error Logging Should Differentiate between RemoteRenderer and Service timeouts. might be helpful, though I'll note that there are already two error-messages present:

image.png (215ร—1 px, 21 KB)
(looking at the last 7 days). Though I do not know whether they correspond to what is requested in that task.

We (@Michael + @toan + @Tarrow + myself) looked at the last 12 hours of those events. We could not immediately see a connection between the items of interest, but what stood out was that there seem to be bursts of those ("Request failed with status 0. Usually this means network failure or timeout") happening.

This leads me to the assumption that we are not dealing with problems which are caused by the termbox service in particular responding to individual requests in a poor fashion but instead that the infrastructure at those moments in time is facing challenges which are outside of our team's control. I think we need someone with a better understanding of how the production hosting is currently set up (buzz words: tls-proxy, kubernetes) and insights about the health of similar services at the same time to not look at this problem at the wrong layer of abstraction, losing the majority of relevant data points in the process (e.g. no point in searching for "termbox" in kibana if we are in fact dealing with a network problem).

Possibly related (=> T263764):
We are seeing our termbox service, when reaching out to the mediawiki API, occasionally receiving a 503 (e.g.). We are also seeing a very similar error from a TLS-proxy (e.g.)
What we did not manage to track down is a corresponding error from an app server (MW; acting as API for the termbox service) which we would have expected to see.
We wonder if the errors may have something to do with the "recent" move of termbox to TLS (T254581), which we don't know how it works, and how we would we track down those errors?

(The few errors like Bad request. Language not existing which were mentioned during the session are already tracked in T231007)

It seems like weโ€™ve gone as far as we can in investigating this; now waiting for WMF input in the subtasks.

We are seeing our termbox service, when reaching out to the mediawiki API, occasionally receiving a 503 (e.g.). We are also seeing a very similar error from a TLS-proxy (e.g.)

Logstash urls aren't copy pasteable I am afraid. I get "Unable to completely restore the URL, be sure to use the share functionality." Could you please update with a logstash shared url so we can be sure we are looking into the same thing before we delve deeper into this?

Hi,

be sure we are looking into the same thing before we delve deeper into this?

That makes absolute sense.

Could you please update with a logstash shared url

I unfortunately don't know how to do this for single documents. The links show the warning for me as well but reproduce fine.

Maybe one of my more knowledgeable colleagues can help out with this on T263764 which we created for this particular sub problem.

Thanks

@akosiaris I did some tinkering in the kibana ui and came up with this (hopefully) shareable link

The 503's seem to occur seemingly at random but in this interesting bursts pattern.

Sorry for not answering earlier.

I unfortunately don't know how to do this for single documents. The links show the warning for me as well but reproduce fine.

Ok, then that should suffice. Thanks for confirming that. I can confirm there is minor increase in errors during the timeperiod of that document.

That being said, I 've been looking a bit more into this. Couple of notes:

  • Grafana for the last 7 days (https://grafana.wikimedia.org/d/JcMStTFGz/termbox-slo-panel?orgId=1&from=1601337600000&to=1602028799000) reports 517 500s.
  • Logstash (https://logstash.wikimedia.org/goto/2d69a0f714d7fb66c1959da2f0e8b69a) says 533 errors. Note also that logstash includes eqiad, which isn't pooled in those 7d but is still receiving health check requests and has to reach over to codfw with added latencies. So, I am going to assume they are roughly equal, cause that 17 entries discrepancy doesn't matter for the rest of this comment
  • We are starting to work on some preliminary/draft SLOs for mediawiki. There is some work to be done on getting numbers, but when we come up with those, it would be prudent to adjust the SLO of termbox to those as termbox is dependent on mediawiki and for it to provide more strict SLOs than mediawiki doesn't make sense.

So, we got an error rate of 0.01889% (or 0.0001889) with the SLO of the service being 0.1% (or 0.001) per T212189#5007579. The flip side of that is an availability of 99.98111% which is something to be rather proud of (see https://en.wikipedia.org/wiki/High_availability#Percentage_calculation).

If we increase the timespan a bit to 30d (https://grafana.wikimedia.org/d/JcMStTFGz/termbox-slo-panel?orgId=1&from=1599350400000&to=1602028799000) we get 0.07301%, 7 times higher, but still below the SLO.

Note that if we bump this before August 26th, the picture changes heavily, e.g. https://grafana.wikimedia.org/d/JcMStTFGz/termbox-slo-panel?orgId=1&from=1597449600000&to=1602028799000. However, as https://grafana.wikimedia.org/d/wJRbI7FGk/termbox?viewPanel=15&orgId=1&from=1598227200000&to=1598659199000 shows, a deployment(https://sal.toolforge.org/production?p=0&q=&d=2020-08-26 points out e03ee593f57adc7556f7a4 - enabling the service proxy in fact) fixed that already, so corrective action has been taken since this task was created.

Let's stick to the 7d timespan for now. Notes again:

  • Total of 533 errors in logstash
  • 273 are timeout of 3000ms exceeded - tracked in T255450. This seems to me the most interesting to visit.
  • 170 are Request failed with status code 500 - All of those are constrained in the timespan 2020-09-30T20:28:57 to 2020-09-30T22:01:05 and it's mediawiki that is returning those errors. https://logstash.wikimedia.org/goto/02a4bbcab3b7864b4b9a91fd7a26fb4a.
  • 77 are Request failed with status code 503. Those are from the sidecar envoy instance that termbox uses to connect to mediawiki. The reasons for adopting envoy are explained in https://wikitech.wikimedia.org/wiki/Envoy#Envoy_at_WMF (note that the same component also offers TLS termination so that termbox doesn't need to know or care for our internal TLS configuration). I guess this is also tracked in T263764, so I 'll add a bit more information into that.
  • The rest 13 events don't seem worthy of looking into more.

Overall, I am inclined to say that while the SLO isn't being violated over the course of the quarter, this should be a low priority.

@akosiaris I did some tinkering in the kibana ui and came up with this (hopefully) shareable link

The 503's seem to occur seemingly at random but in this interesting bursts pattern.

Thanks, I am having some problem drilling down into the events using that link as it has a variety of events (e.g. all errors from the termbox service, multiple different errors from the wikibase mw extension). I 'd like to believe that the termbox dashboard I linked to above is a bit more helpful.

First off, thank you @akosiaris for your time investigating this.

At the start of this task we were presented with the title "Termbox SSR connection terminated very often" and the logstash screenshot of the events that took place starting at 2020-06-10. Since then the patch https://gerrit.wikimedia.org/r/c/605554 has been applied and we are now seeing a much lower error rate. No other fix that we are aware of could be attributed to be mitigating these errors.

https://phabricator.wikimedia.org/T255450 "timeout of 3000ms exceeded" seems to happen in burst patterns where these rarely occur alone but rather spread out over multiple kubernetes containers for a couple seconds or less.

For the last months going back to 2020-07-12 these errors were much more frequent where the error rates averaged around 5000 every week on eqiad until 2020-08-31 (https://logstash-next.wikimedia.org/goto/4d74e7ec568b38ecebfe01328a2bad2d). After this period up until the time of writing this the error rates are again much lower ranging from 100 - 300 every week for both eqiad and codfw (https://logstash-next.wikimedia.org/goto/86501120a879772a1db250abb1c250dc)

https://phabricator.wikimedia.org/T263764 The second re-occuring error is the "Request failed with status code 503" ("upstream connect error or disconnect/reset before headers. reset reason: local reset") which does not follow the same pattern as the first timeout. They happen seemingly at random and most of the time there is only one error reported not a series.

For the last month these seem to be occurring around ~100 or less times every week with the exception of the period 2020-09-10 16:00 - 21:00 (https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=codfw&var-prometheus=k8s&var-app=termbox&var-destination=mwapi-async&from=1599753677237&to=1599770627856)

As mentioned here https://phabricator.wikimedia.org/T263764#6524589 errors like these are expected to occur and as we are still below the SLO we will not investigate these further for the time being.

After discussing this with @Michael this evening we've decided these two most prominent errors will not be investigated further now using the the SLO as the guideline for that decision. Instead they will be documented on https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service with a section describing these errors, their expected frequency and where in the stack they occur.

@akosiaris Thank you a lot for your detailed response. I did look into those errors a tiny bit more to properly document them as can be now seen on wikitech.

In the course of that I looked at the last days and noticed some discrepancies to the numbers you provided above. All the following data is for the 7 days between 2020-10-07 00:00:00 and 2020-10-13 23:59:59.

  • the Grafana SLO panel shows 277 errors.
  • the kubernetes logstash for Termbox SSR has 171 errors in that time frame
    • 120 timeout errors, 51 envoy 503 errors
    • I excluded some 19 errors about "startup finished", that are probably the ones you mentioned with "not worth looking into"

I was surprised by that, but noticed that there were also a similar amount of network errors between MediaWiki and the Termbox SSR app in that timeframe:

It would make sense to me if the SLO covered those network problems as well, as they defacto impact the availability of the service to MediaWiki. Also, taking those errors together, we can account for 275 of the 277 errors shown in the Grafana SLO panel.

Is the understanding layed out above correct?

@akosiaris Thank you a lot for your detailed response. I did look into those errors a tiny bit more to properly document them as can be now seen on wikitech.

In the course of that I looked at the last days and noticed some discrepancies to the numbers you provided above. All the following data is for the 7 days between 2020-10-07 00:00:00 and 2020-10-13 23:59:59.

I think you just exposed some weird behavior/bug in prometheus's increase() function regarding counter resets. I 've added a panel to the graph showcasing it. If you manually substract the peaks from the valleys for the 3 distinct timeframes depicted there you get almost the same errors as logstash. It's 62-0 + 99 - 0 + 484 - 440= 170. It's probably that last (first timewise) timeframe that throughs prometheus off. Given that per the docs [1]

It is syntactic sugar for rate(v) multiplied by the number of seconds under the specified time range window, and should be used primarily for human readability.

there is probably something funny going on over the large timeframe. The rate() is also depicted in the panel and itis gradually dropping as well but it's quite higher in the first timeframe.

Couple of notes though to clarify a few things.

This is from the PoV of termbox itself. It count the HTTP 500s termbox knows it emitted.

  • the kubernetes logstash for Termbox SSR has 171 errors in that time frame
    • 120 timeout errors, 51 envoy 503 errors
    • I excluded some 19 errors about "startup finished", that are probably the ones you mentioned with "not worth looking into"

Same PoV but on a log level.

I was surprised by that, but noticed that there were also a similar amount of network errors between MediaWiki and the Termbox SSR app in that timeframe:

That's actually from the PoV of mediawiki. If you put this logstash dashboard and the termbox one side-by-side there's considerable overlap as events are depicted in both.

It would make sense to me if the SLO covered those network problems as well, as they defacto impact the availability of the service to MediaWiki. Also, taking those errors together, we can account for 275 of the 277 errors shown in the Grafana SLO panel.

Is the understanding layed out above correct?

I think it's wrong to sum the 2 logstash dashboards (in fact, it's just coincidence that the numbers added up to something close to 277 as that was a made up number from prometheus). They are of a different nature and thus wrong to sum as you will be double counting events.

[1] https://prometheus.io/docs/prometheus/latest/querying/functions/#increase

@akosiaris Thank you a lot for your detailed response. I did look into those errors a tiny bit more to properly document them as can be now seen on wikitech.

In the course of that I looked at the last days and noticed some discrepancies to the numbers you provided above. All the following data is for the 7 days between 2020-10-07 00:00:00 and 2020-10-13 23:59:59.

I think you just exposed some weird behavior/bug in prometheus's increase() function regarding counter resets. I 've added a panel to the graph showcasing it. If you manually substract the peaks from the valleys for the 3 distinct timeframes depicted there you get almost the same errors as logstash. It's 62-0 + 99 - 0 + 484 - 440= 170. It's probably that last (first timewise) timeframe that throughs prometheus off. Given that per the docs [1]

It is syntactic sugar for rate(v) multiplied by the number of seconds under the specified time range window, and should be used primarily for human readability.

there is probably something funny going on over the large timeframe. The rate() is also depicted in the panel and itis gradually dropping as well but it's quite higher in the first timeframe.

That seems very strange. I would have expected the error rate to be calculated by (number of errors / number of total requests) for the given timeframe. How does it actually work? Something like (number of milliseconds with error/number of total milliseconds in timeframe)?

I was surprised by that, but noticed that there were also a similar amount of network errors between MediaWiki and the Termbox SSR app in that timeframe:

That's actually from the PoV of mediawiki. If you put this logstash dashboard and the termbox one side-by-side there's considerable overlap as events are depicted in both.

Oh, you are right! If I look only at the timeouts and remove the timeouts to the unused datacenter (triggered by the health checks), then they even line up almost perfectly!

Is the understanding layed out above correct?

I think it's wrong to sum the 2 logstash dashboards (in fact, it's just coincidence that the numbers added up to something close to 277 as that was a made up number from prometheus). They are of a different nature and thus wrong to sum as you will be double counting events.

Thank you for providing this feedback! I've updated our documentation.

Documentation looks great in my opinion.

That seems very strange. I would have expected the error rate to be calculated by (number of errors / number of total requests) for the given timeframe. How does it actually work? Something like (number of milliseconds with error/number of total milliseconds in timeframe)?

You can say that again :-). The main formula is what you described. In prometheus terms, it's

sum(increase(service_runner_request_duration_seconds_count{service="termbox", prometheus="k8s", uri="termbox", status="500"}[$__range]))

/

sum(increase(service_runner_request_duration_seconds_count{service="termbox", prometheus="k8s", uri="termbox", status=~"200|500"}[$__range]))

and that's what the left panel in that dashboard has. The issue isn't with the division, it's rather with the increase() function (the right hand side panel is just the nominator of the above equation), so it's

sum(
  increase(
     service_runner_request_duration_seconds_count{service="termbox", prometheus="k8s", uri="termbox", status="500"}[$__range]
  )
)

The sum() is to sum across all the instances of termbox in that timeframe, the increase() is to calculate the change in that quantity from start to end of the timeframe. Normally it works, but in this case, it has failed. My guess as to what has happened is that due to 2 deployments (you can use the main termbox dashboard to spot them) termbox pods were destroyed and new ones started. So metrics changed and the internal counter resetting detection of rate() could not function. If you target a week without deploys, you aren't gonna witness that.

If you are more interested about prometheus counter, there's more info about counters and how they work in prometheus at https://www.robustperception.io/how-does-a-prometheus-counter-work

It also means we 'll have to figure out how to calculate better the SLO across large timeframes.

We are starting to work on some preliminary/draft SLOs for mediawiki. There is some work to be done on getting numbers, but when we come up with those, it would be prudent to adjust the SLO of termbox to those as termbox is dependent on mediawiki and for it to provide more strict SLOs than mediawiki doesn't make sense.

Well said. The current objectives and timeouts were... educated guesses at the time and the - to an extend - circular dependency did not help in drafting them. There is poor T220404: Calculate sensible MW->Node timeout once deployed as a last ditch effort from back then to document the need to reiterate on this once we have real world data to base decisions one but nothing ever came from it. I'm looking forward to the mediawiki SLOs.

Thanks for updating the documentation on wikitech!