Skip to content

Output file can be corrupted if library keyword uses BuiltIn.run_keyword and timeout occurs #5417

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pekkaklarck opened this issue Apr 30, 2025 · 7 comments

Comments

@pekkaklarck
Copy link
Member

Output files can be corrupted if a timeout occurs when Robot is writing log messages. That is a serious issue, because it prevents creating normal log and report files. Investigating this kind of problems is also pretty annoying, because what happens depends on when exactly the timeout occurs. It could be that most times everything goes well, but now and then the output is corrupted.

To avoid these problems, we nowadays delay writing log messages when timeouts are enabled (#2839). Before we had this mechanism in place, the following example reproduced the problem pretty reliably regardless the operating system:

*** Settings ***
Library     lib.py

*** Test Cases ***
Test
    [Timeout]    0.1 sec
    Keyword
from robot.api import logger

MSG = 'A rather long message that is slow to write on the disk. ' * 10000

def keyword():
    while True:
        logger.info(MSG)

Unfortunately the current mechanism doesn't prevent the problem if a library keyword uses BuiltIn.run_keyword for executing other keywords that log messages. For example, this variation corrupts the output file nearly every time I run it on Windows:

from robot.api import logger
from robot.libraries.BuiltIn import BuiltIn

MSG = 'A rather long message that is slow to write on the disk. ' * 10000

def keyword():
    while True:
        BuiltIn().run_keyword('Log Huge Message')

def log_huge_message():
    logger.info(MSG)

I haven't been able to get the above example, or any variation of, to fail on Linux. This is most likely due to the platform specific timeout implementation, but looking at the related code, it seems the problem could happen anywhere. Anyway, this is a serious issue even if it would occur only on Windows.

@pekkaklarck
Copy link
Member Author

pekkaklarck commented Apr 30, 2025

The related code was changed recently when fixing #5395, but I tested that the same problem occurs also with earlier versions. I noticed this bug by looking at the new code, but I initially couldn't confirm it because there were not problems on Linux.

My understanding on why the bug occurs is this:

  • As mentioned in the description, the current mechanism prevents problems by delaying writing log messages. They are written to the output file only after the keyword has finished when timeouts, normally, cannot anymore occur.
  • If a library keyword runs other keywords, both the original library keyword and the keyword executed by it have their own timeouts. A timeout killing the parent keyword can thus occur when delayed messages by the child keyword are written.
  • The parent keyword could also be killed while other child keyword information (e.g. its documentation) is written. This is not a concern normally and there's no protection for timeouts in these cases.

I believe that fixing this requires a new approach to avoid problems with timeouts interrupting writing to output files. We could try making it possible to pause timeouts whenever we are writing something to an output file, but getting it to work reliably on all operating systems can be challenging. We probably should also make sure no more than one timeout can be active at the same time.

@pekkaklarck
Copy link
Member Author

Keywords executing other keywords should actually be registered as "run keyword variants" using the register_run_keyword method. If the above earlier example is changed to this problem doesn't anymore occur:

from robot.api import logger
from robot.libraries.BuiltIn import BuiltIn, register_run_keyword

register_run_keyword(
    library='lib',
    keyword='keyword', 
    deprecation_warning=False
)

MSG = 'A rather long message that is slow to write on the disk. ' * 10000

def keyword():
    while True:
        BuiltIn().run_keyword('Log Huge Message')

def log_huge_message():
    logger.info(MSG)

The problem is that this registration isn't well known and that it should be cleaned up (#2190). Output files getting corrupted is such a severe issue, that I hope we could avoid that also without extra configuration. Anyway, if fixing this problem is too complicated for RF 7.3, we should at least try to enhance the related documentation and make the registration API more well known.

pekkaklarck added a commit that referenced this issue Apr 30, 2025
Previous messages weren't restored meaning that messages could be
lost. The unused variable also made linters unhappy.

This code attemps to avoid problems with timeouts interrupting writing
to output files. The current code was written to fix #5395 and is
basically a rewrite of the fix for #2839. As #5417 explains, there are
still problems and bigger changes are needed.
@aaltat
Copy link
Contributor

aaltat commented Apr 30, 2025

So on Browser and SeleniumLibrary almost all keywords should be registered as run keywords because of their run on failure functionality? Which they are not done, at least on Browser library

@pekkaklarck
Copy link
Member Author

Yeah, the run-on-failure functionality uses BuiltIn.run_keyword and is affected by this issue. Registering these keywords as run keywords would mean that timeouts wouldn't be able to stop them and that would probably be worse. This makes fixing this issue even more important.

pekkaklarck added a commit that referenced this issue May 6, 2025
- Move high level timeout implementation from `__init__.py` to
  `timeout.py`.

- Make it an error to start or otherwise interact with inactive
  timeouts.

- Rename platform specific timeout classes to from `Timeout` to
  `<Platform>Runner`.

- Create common base class for runners.

- Add `Timeout.get_runner()` for getting a runner. This can be used
  later to get an access to a runner to be able to pause it (#5417).

- Preserve `Timeout.run()` as a convenience method.

- Remove locking and `_finished` flag from Windows implementation.
  They didn't seem to be actually needed, and also the commit were
  they were added mentioned that the fixed issue didn't really require
  them. See 1a5eeaa.
@pekkaklarck pekkaklarck changed the title Output file can be corrupted if library reyword runs other keywords and timeout occurs Output file can be corrupted if library keyword runs other keywords and timeout occurs May 7, 2025
@pekkaklarck pekkaklarck changed the title Output file can be corrupted if library keyword runs other keywords and timeout occurs Output file can be corrupted if library keyword BuiltIn.run_keyword and timeout occurs May 7, 2025
@pekkaklarck pekkaklarck changed the title Output file can be corrupted if library keyword BuiltIn.run_keyword and timeout occurs Output file can be corrupted if library keyword uses BuiltIn.run_keyword and timeout occurs May 7, 2025
@pekkaklarck pekkaklarck added the bug label May 7, 2025
@pekkaklarck
Copy link
Member Author

While investigating this, I noticed that timeouts outside Windows had a bug with nested timeouts (#5422). That explained why this issue occurred only on Windows and after fixing that problem this issue manifested itself also on Linux.

@pekkaklarck
Copy link
Member Author

This bug was fixed by pausing timeouts when BuiltIn.run_keyword is used in 00d45d4. The fix in that commit is pretty simple, but figuring out why exactly the problem occurred and what is the best way to fix it was rather complicated. Timeout related code was also refactored pretty heavily in 66d5300 to make the code easier to understand and to ease fixing this issue.

pekkaklarck added a commit that referenced this issue May 9, 2025
Missing part of the already closed #5417.
pekkaklarck added a commit that referenced this issue May 12, 2025
Also small adjustment to when to raise a timeout on resume. Earlier
timeout was raised on the first resume after timeout had been
exceeded, but now the runner must be fully resumed. This behavior is
more logical than the earlier.

The change shouldn't affect execution at all. The reason is that if
timeout is paused in nested manner, `BuiltIn.run_keyword` has been
used in recursively and each recursive call has its own timeout. The
last one of them will raise a timeout on resume immediately after the
timeout has been exceeded anyway.

This is part of #5417.
@pekkaklarck
Copy link
Member Author

The fix didn't work properly if BuiltIn.run_keyword was used recursively, but that was fixed by the above commits. This fix isn't in RF 7.3 rc 1, but it will be in the final release an in the possible rc 2.

pekkaklarck added a commit that referenced this issue May 14, 2025
- Enhance tests related to pausing timeouts.
- Fix the aforementioned tests on Windows.
- Enhance Windows timeout code to avoid race conditions.

Related to #5417.
pekkaklarck added a commit that referenced this issue May 19, 2025
Explain that using `BuiltIn.register_run_keyword` to disable timeouts
isn't relevant anymore now that #5417 is fixed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants