Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance heavy timing problems in epoll handling #7102

Open
2 of 7 tasks
MarkusBeerwerth opened this issue Feb 11, 2025 · 1 comment
Open
2 of 7 tasks

Performance heavy timing problems in epoll handling #7102

MarkusBeerwerth opened this issue Feb 11, 2025 · 1 comment
Labels
Component: Eventloop Issues related to the eventloop code Type: Performance / RT

Comments

@MarkusBeerwerth
Copy link

MarkusBeerwerth commented Feb 11, 2025

Description

Hi,

after upgrading from a 1.3 to the newest 1.4 version on a Linux system I noticed some performance issues.

In my case the average cpu time of the server increased around 20 times compared to the 1.3 version, even noticeable when running a idle server without any connections. So i gave a look into it.

After some investigation trying different 1.4 versions and recording traces all 1.4 versions shows the same behavior, that seems to be caused by the switch from the select to epoll kernel apis.

On every epoll_wait wakeup the opc server does many additional noblocking epoll_wait calls with timeout 0. Way to many to count in my case it seems to be something in the high 2 digit to low hundreds range adding usually up to more 500us of cpu time in each server call. So the overhead usually adds to way more that the expected cpu load of the server itself.

After building the library and debugging it following issue seems to occurre in the eventloop_posix.c and eventloop_posix_epoll.c modules in the functions UA_EventLoopPOSIX_run UA_EventLoopPOSIX_pollFDs:

  • all the timeouts of the internally added callbacks are managed with a precision of 100 ns
  • the used epoll_wait only supports a precision of 1ms. The input is rounded down by integer division
  • epoll_wait will wake up a ms too early, so the server main loop in UA_Server_run will loop nonblocking until the timeout a few hundred us away in the future is fulfilled, the callback executed and the timeout reset.

Depends a bit on the starting time and jitter, but worst case any callback takes around 1ms of extra time per execution indepedent of system performance, creating a noticeable amount of overhead. Even if epoll only supports ms precision it usually triggers rather precise with minimal jitter in the double digit microsecond range x ms after it got called. So how big the overhead gets, is consistent after the first call.

In all it should be a general problem that effects all linux builds in 1.4.x version.

Background Information / Reproduction Steps

  • run a server on Linux.
  • monitor cpu usage
  • record traces with kernel events (trace-cmd, lttng, etc.)
  • count timouts < 1ms in UA_EventLoopPOSIX_run

Issue #6593 seems to be related

Used CMake options (should not really matter):

  • Default debug Build config
  • tested on both arm64 and x86-64

Possible Solution:

At first glance I see a few possible options to fix the problem, but maybe you have better ideas.

Use epoll_pwait2:
There is already a commented out epoll_pwait2 implementation right under the epoll_wait implementation in UA_EventLoopPOSIX_pollFDs
epoll_pwait2 would probably work fine, but could be a breaking change for older systems. Already encountered some older linux versions that didn't support it before.

Ceil the timout that is going into epoll
A rather quick and dirty solution would be to ceil the timeout that is going into epoll_wait
tried it as a temporary workaround and seems to work rather good in my case. Subsequent calls of the repeated callbacks will usually have the same offset to the epoll_wait call so they will still get called rather precise in x ms intervals. It would also only effect the linux version.

Manage the timouts with ms precision:
A more clean solution would probably be to manage the callback times only with ms precision and take into account the jitter by checking if the timestamp is near the expected one. They only get parameterized with a ms precision. Not sure if there a good usecases to get more precise than that on different plattforms.

User Workarounds:

Run UA_Server_run_iterate with the nonblocking waitInternal yourself and sleep a ms longer than it return

Checklist

Please provide the following information:

  • open62541 Version (release number or git tag): All 1.4.x Versions
  • Other OPC UA SDKs used (client or server):
  • Operating system: Debian Bookworm
  • Logs (with UA_LOGLEVEL set as low as necessary) attached
  • Wireshark network dump attached
  • Self-contained code example attached
  • Critical issue
@marwinglaser marwinglaser added Component: Eventloop Issues related to the eventloop code Type: Performance / RT labels Feb 13, 2025
@jpfr
Copy link
Member

jpfr commented Feb 21, 2025

We can reproduce.
This will get fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Eventloop Issues related to the eventloop code Type: Performance / RT
Projects
None yet
Development

No branches or pull requests

3 participants