Performance heavy timing problems in epoll handling #7102

MarkusBeerwerth · 2025-02-11T13:24:32Z

Description

Hi,

after upgrading from a 1.3 to the newest 1.4 version on a Linux system I noticed some performance issues.

In my case the average cpu time of the server increased around 20 times compared to the 1.3 version, even noticeable when running a idle server without any connections. So i gave a look into it.

After some investigation trying different 1.4 versions and recording traces all 1.4 versions shows the same behavior, that seems to be caused by the switch from the select to epoll kernel apis.

On every epoll_wait wakeup the opc server does many additional noblocking epoll_wait calls with timeout 0. Way to many to count in my case it seems to be something in the high 2 digit to low hundreds range adding usually up to more 500us of cpu time in each server call. So the overhead usually adds to way more that the expected cpu load of the server itself.

After building the library and debugging it following issue seems to occurre in the eventloop_posix.c and eventloop_posix_epoll.c modules in the functions UA_EventLoopPOSIX_run UA_EventLoopPOSIX_pollFDs:

all the timeouts of the internally added callbacks are managed with a precision of 100 ns
the used epoll_wait only supports a precision of 1ms. The input is rounded down by integer division
epoll_wait will wake up a ms too early, so the server main loop in UA_Server_run will loop nonblocking until the timeout a few hundred us away in the future is fulfilled, the callback executed and the timeout reset.

Depends a bit on the starting time and jitter, but worst case any callback takes around 1ms of extra time per execution indepedent of system performance, creating a noticeable amount of overhead. Even if epoll only supports ms precision it usually triggers rather precise with minimal jitter in the double digit microsecond range x ms after it got called. So how big the overhead gets, is consistent after the first call.

In all it should be a general problem that effects all linux builds in 1.4.x version.

Background Information / Reproduction Steps

run a server on Linux.
monitor cpu usage
record traces with kernel events (trace-cmd, lttng, etc.)
count timouts < 1ms in UA_EventLoopPOSIX_run

Issue #6593 seems to be related

Used CMake options (should not really matter):

Default debug Build config
tested on both arm64 and x86-64

Possible Solution:

At first glance I see a few possible options to fix the problem, but maybe you have better ideas.

Use epoll_pwait2:
There is already a commented out epoll_pwait2 implementation right under the epoll_wait implementation in UA_EventLoopPOSIX_pollFDs
epoll_pwait2 would probably work fine, but could be a breaking change for older systems. Already encountered some older linux versions that didn't support it before.

Ceil the timout that is going into epoll
A rather quick and dirty solution would be to ceil the timeout that is going into epoll_wait
tried it as a temporary workaround and seems to work rather good in my case. Subsequent calls of the repeated callbacks will usually have the same offset to the epoll_wait call so they will still get called rather precise in x ms intervals. It would also only effect the linux version.

Manage the timouts with ms precision:
A more clean solution would probably be to manage the callback times only with ms precision and take into account the jitter by checking if the timestamp is near the expected one. They only get parameterized with a ms precision. Not sure if there a good usecases to get more precise than that on different plattforms.

User Workarounds:

Run UA_Server_run_iterate with the nonblocking waitInternal yourself and sleep a ms longer than it return

Checklist

Please provide the following information:

open62541 Version (release number or git tag): All 1.4.x Versions
Other OPC UA SDKs used (client or server):
Operating system: Debian Bookworm
Logs (with UA_LOGLEVEL set as low as necessary) attached
Wireshark network dump attached
Self-contained code example attached
Critical issue

The text was updated successfully, but these errors were encountered:

jpfr · 2025-02-21T08:02:29Z

We can reproduce.
This will get fixed.

marwinglaser added Component: Eventloop Issues related to the eventloop code Type: Performance / RT labels Feb 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance heavy timing problems in epoll handling #7102

Performance heavy timing problems in epoll handling #7102

MarkusBeerwerth commented Feb 11, 2025 •

edited

Loading

jpfr commented Feb 21, 2025

Performance heavy timing problems in epoll handling #7102

Performance heavy timing problems in epoll handling #7102

Comments

MarkusBeerwerth commented Feb 11, 2025 • edited Loading

Description

Background Information / Reproduction Steps

Possible Solution:

User Workarounds:

Checklist

jpfr commented Feb 21, 2025

MarkusBeerwerth commented Feb 11, 2025 •

edited

Loading