Skip to content

x/build: add LUCI linux-s390x builder #67307

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dmitshur opened this issue May 10, 2024 · 17 comments
Open

x/build: add LUCI linux-s390x builder #67307

dmitshur opened this issue May 10, 2024 · 17 comments
Assignees
Labels
arch-s390x Issues solely affecting the s390x architecture. Builders x/build issues (builders, bots, dashboards) NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. new-builder OS-Linux
Milestone

Comments

@dmitshur
Copy link
Contributor

There currently isn't a LUCI builder that tests the linux/s390x port (other than the misc-compile builder, which tests only that the port compiles). This is the tracking issue for it.

The next steps that a builder owner will need to follow to make progress here are documented https://go.dev/wiki/DashboardBuilders#luci-builders.

@dmitshur dmitshur added OS-Linux Builders x/build issues (builders, bots, dashboards) NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. new-builder arch-s390x Issues solely affecting the s390x architecture. labels May 10, 2024
@dmitshur dmitshur added this to the Unreleased milestone May 10, 2024
@srinivas-pokala srinivas-pokala self-assigned this Aug 15, 2024
@dmitshur dmitshur moved this to In Progress in Go Release Aug 16, 2024
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/617359 mentions this issue: crypto/internal/fips/sha3: reduce s390x divergence

@srinivas-pokala
Copy link
Contributor

Hostname for the builder: linux-s390x-ibm
csr is attached:
linux-s390x-ibm.csr.txt

@dmitshur
Copy link
Contributor Author

dmitshur commented Oct 4, 2024

Thanks. CC @mknyszek.

@mknyszek mknyszek assigned mknyszek and unassigned srinivas-pokala Oct 4, 2024
@mknyszek
Copy link
Contributor

mknyszek commented Oct 4, 2024

Here's the certificate:
linux-s390x-ibm-1728069494.cert.txt

@mknyszek mknyszek assigned srinivas-pokala and unassigned mknyszek Oct 4, 2024
cpu pushed a commit to cpu/go that referenced this issue Oct 16, 2024
It's a little annoying, but we can fit the IBM instructions on top of
the regular state, avoiding more intrusive interventions.

Going forward we should not accept assembly that replaces the whole
implementation, because it doubles the work to do any refactoring like
the one in this chain.

Also, it took me a while to find the specification of these
instructions, which should have been linked from the source for the next
person who'd have to touch this.

Finally, it's really painful to test this without a LUCI TryBot, per golang#67307.

For golang#69536

Change-Id: I90632a90f06b2aa2e863967de972b12dbaa5b2ae
gopherbot pushed a commit that referenced this issue Oct 28, 2024
It's a little annoying, but we can fit the IBM instructions on top of
the regular state, avoiding more intrusive interventions.

Going forward we should not accept assembly that replaces the whole
implementation, because it doubles the work to do any refactoring like
the one in this chain.

Also, it took me a while to find the specification of these
instructions, which should have been linked from the source for the next
person who'd have to touch this.

Finally, it's really painful to test this without a LUCI TryBot, per #67307.

For #69536

Change-Id: I90632a90f06b2aa2e863967de972b12dbaa5b2ae
Reviewed-on: https://go-review.googlesource.com/c/go/+/617359
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Filippo Valsorda <filippo@golang.org>
Reviewed-by: Carlos Amedee <carlos@golang.org>
Reviewed-by: Daniel McCarney <daniel@binaryparadox.net>
Reviewed-by: Roland Shoemaker <roland@golang.org>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/636055 mentions this issue: crypto/internal/cryptotest: skip TestAllocations on s390x

gopherbot pushed a commit that referenced this issue Dec 13, 2024
TestXAESAllocations fails like #70448, and crypto/rand's fails in FIPS
mode. We can't keep chasing these without even a LUCI builder.

Updates #67307

Change-Id: I5d0edddf470180a321dec55cabfb018db62eb940
Reviewed-on: https://go-review.googlesource.com/c/go/+/636055
Auto-Submit: Filippo Valsorda <filippo@golang.org>
Reviewed-by: Roland Shoemaker <roland@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Carlos Amedee <carlos@golang.org>
@srinivas-pokala
Copy link
Contributor

@mknyszek I am facing issue while following steps mentioned above for LUCI builder setup. After step-03, I have done as follow.

Note: Default builder machines are running under user(linux1)

Step-04 : Created cron job for luci_machine_tokend using new user(a2)

[a2@go-s390x01 ~]$ crontab -l
*/10 * * * * luci_machine_tokend -backend luci-token-server.appspot.com -cert-pem /home/a2/linux-s390x-ibm-1728069494.cert.txt -pkey-pem /home/a2/linux-s390x-ibm.key -token-file=/var/lib/luci_machine_tokend/token.json

Step-05: Created system service to run "bootstrapswarm" as below using same user(a2).

Description=Bootstrapswarm Service
After=network.target

[Service]
User=a2
Group=a2
ExecStart=/home/a2/go/bin/bootstrapswarm -hostname linux-s390x-ibm
Restart=always
RestartSec=5
Environment="PATH=/usr/local/go/bin:/usr/bin:/bin:/home/a2/go/bin"

[Install]
WantedBy=multi-user.target

After this when I verify bot's start-up log, I am encountering below verbose

[a2@go-s390x01 ~]$ sudo journalctl -u bootstrapswarm -f
[sudo] password for a2: 
-- Logs begin at Mon 2025-01-20 04:52:31 EST. --
Jan 20 09:22:32 go-s390x01 systemd[1]: Stopped Bootstrapswarm Service.
Jan 20 09:22:32 go-s390x01 systemd[1]: Started Bootstrapswarm Service.
Jan 20 09:22:32 go-s390x01 systemd[1]: bootstrapswarm.service: Main process exited, code=exited, status=203/EXEC
Jan 20 09:22:32 go-s390x01 systemd[1]: bootstrapswarm.service: Failed with result 'exit-code'.
Jan 20 09:22:38 go-s390x01 systemd[1]: bootstrapswarm.service: Service RestartSec=5s expired, scheduling restart.
Jan 20 09:22:38 go-s390x01 systemd[1]: bootstrapswarm.service: Scheduled restart job, restart counter is at 3102.
Jan 20 09:22:38 go-s390x01 systemd[1]: Stopped Bootstrapswarm Service.
Jan 20 09:22:38 go-s390x01 systemd[1]: Started Bootstrapswarm Service.
Jan 20 09:22:38 go-s390x01 systemd[1]: bootstrapswarm.service: Main process exited, code=exited, status=203/EXEC
Jan 20 09:22:38 go-s390x01 systemd[1]: bootstrapswarm.service: Failed with result 'exit-code'.

But when I tried manually running "/home/a2/go/bin/bootstrapswarm -hostname linux-s390x-ibm" I am getting status code 401 error(authentication). I tried inspecting/addressing the issue but I could not make much progress.
Can you please help me in troubleshooting this issue.

@dmitshur
Copy link
Contributor Author

@srinivas-pokala Thanks for working on this. I looked at the logs on our end for error details.

I'm seeing 403s that are failing because the bot ID being reported is "go-s390x01" instead of the expected "linux-s390x-ibm", which results in an "Bot ID doesn't match the token used" error. Can you check if the bootstrapswarm binary you're using is the latest version available in x/build? Looking at its code, if the -hostname linux-s390x-ibm is being propagated correctly, it should be sent to the server. As unlikely as it is, maybe you can check that the metadata.OnGCE() path isn't being taken somehow, since that does override hostname?

A 401 that I saw failed with the error that the token was expired 4 hrs earlier; perhaps /var/lib/luci_machine_tokend/token.json stopped being refreshed at the time you tried to run bootstrapswarm?

@srinivas-pokala
Copy link
Contributor

@dmitshur Thank's for the reply.

Can you check if the bootstrapswarm binary you're using is the latest version available in x/build?

Yes I have checked it. It's latest only.

Looking at its code, if the -hostname linux-s390x-ibm is being propagated correctly, it should be sent to the server. As unlikely as it is, maybe you can check that the metadata.OnGCE() path isn't being taken somehow, since that does override hostname?
I have traced back metadata.OnGCE() which invoking testOnGCE() in which connectivity of GCE instance failing. I have cross checked the ping on this address, surprisingly packets are getting loss(100%).

[a2@go-s390x01 ~]$ ping google.com
PING google.com(lga15s49-in-x0e.1e100.net (2607:f8b0:4006:80d::200e)) 56 data bytes
^C
--- google.com ping statistics ---
11 packets transmitted, 0 received, 100% packet loss, time 10434ms

So, I am suspecting this could be causing bootstarpswarm to fail and not taking into if case of metadata.OnGCE() path.

`

@srinivas-pokala
Copy link
Contributor

@dmitshur I tried inspecting further bootstrapswarm binary by running manually it got the below tailored verbose.

[swarming@go-s390x01 ~]$ /home/swarming/go/bin/bootstrapswarm -hostname linux-s390x-ibm
2025/02/25 10:58:04 Bootstrapping the swarming bot with certificate authentication
2025/02/25 10:58:04 retrieving the luci-machine-token from the token file /var/lib/luci_machine_tokend/token.json (default path for GOOS != windows)
2025/02/25 10:58:04 Downloading the swarming bot
2025/02/25 10:58:06 Starting the swarming bot /home/swarming/.swarming/swarming_bot.zip
Traceback (most recent call last):
  File "/home/swarming/.swarming/swarming_bot.1.zip/bot_code/bot_main.py", line 267, in _call_hook_safe
    return _call_hook(chained, botobj, name, *args)
  File "/home/swarming/.swarming/swarming_bot.1.zip/bot_code/bot_main.py", line 158, in hook
    return func(chained, botobj, name, *args, **kwargs)
  File "/home/swarming/.swarming/swarming_bot.1.zip/bot_code/bot_main.py", line 237, in _call_hook
    return hook(botobj, *args, **kwargs)
  File "/home/swarming/.swarming/swarming_bot.1.zip/config/bot_config.py", line 675, in get_dimensions
    dimensions = os_utilities.get_dimensions()
  File "/home/swarming/.swarming/swarming_bot.1.zip/api/os_utilities.py", line 1025, in get_dimensions
    'cpu': get_cpu_dimensions(),
  File "/home/swarming/.swarming/swarming_bot.1.zip/utils/tools.py", line 211, in wrapper
    v = func(*args)
  File "/home/swarming/.swarming/swarming_bot.1.zip/api/os_utilities.py", line 372, in get_cpu_dimensions
    info = get_cpuinfo()
  File "/home/swarming/.swarming/swarming_bot.1.zip/utils/tools.py", line 211, in wrapper
    v = func(*args)
  File "/home/swarming/.swarming/swarming_bot.1.zip/api/os_utilities.py", line 406, in get_cpuinfo
    info = platforms.linux.get_cpuinfo()
  File "/home/swarming/.swarming/swarming_bot.1.zip/utils/tools.py", line 211, in wrapper
    v = func(*args)
  File "/home/swarming/.swarming/swarming_bot.1.zip/api/platforms/linux.py", line 199, in get_cpuinfo
    cpu_info['flags'] = values['flags']
KeyError: 'flags'
3861687 2025-02-25 15:59:17.466 E: get_dimensions() threw
Traceback (most recent call last):
  File "/home/swarming/.swarming/swarming_bot.1.zip/bot_code/bot_main.py", line 267, in _call_hook_safe
    return _call_hook(chained, botobj, name, *args)
  File "/home/swarming/.swarming/swarming_bot.1.zip/bot_code/bot_main.py", line 158, in hook
    return func(chained, botobj, name, *args, **kwargs)
  File "/home/swarming/.swarming/swarming_bot.1.zip/bot_code/bot_main.py", line 237, in _call_hook
    return hook(botobj, *args, **kwargs)
  File "/home/swarming/.swarming/swarming_bot.1.zip/config/bot_config.py", line 675, in get_dimensions
    dimensions = os_utilities.get_dimensions()
  File "/home/swarming/.swarming/swarming_bot.1.zip/api/os_utilities.py", line 1025, in get_dimensions
    'cpu': get_cpu_dimensions(),
  File "/home/swarming/.swarming/swarming_bot.1.zip/utils/tools.py", line 211, in wrapper
    v = func(*args)
  File "/home/swarming/.swarming/swarming_bot.1.zip/api/os_utilities.py", line 372, in get_cpu_dimensions
    info = get_cpuinfo()
  File "/home/swarming/.swarming/swarming_bot.1.zip/utils/tools.py", line 211, in wrapper
    v = func(*args)
  File "/home/swarming/.swarming/swarming_bot.1.zip/api/os_utilities.py", line 406, in get_cpuinfo
    info = platforms.linux.get_cpuinfo()
  File "/home/swarming/.swarming/swarming_bot.1.zip/utils/tools.py", line 211, in wrapper
    v = func(*args)
  File "/home/swarming/.swarming/swarming_bot.1.zip/api/platforms/linux.py", line 199, in get_cpuinfo
    cpu_info['flags'] = values['flags']
KeyError: 'flags'
3861687 2025-02-25 15:59:17.466 E: post_error(Failed to call hook get_dimensions(): 'flags'
Traceback (most recent call last):
  File "/home/swarming/.swarming/swarming_bot.1.zip/bot_code/bot_main.py", line 267, in _call_hook_safe
    return _call_hook(chained, botobj, name, *args)
  File "/home/swarming/.swarming/swarming_bot.1.zip/bot_code/bot_main.py", line 158, in hook
    return func(chained, botobj, name, *args, **kwargs)
  File "/home/swarming/.swarming/swarming_bot.1.zip/bot_code/bot_main.py", line 237, in _call_hook
    return hook(botobj, *args, **kwargs)
  File "/home/swarming/.swarming/swarming_bot.1.zip/config/bot_config.py", line 675, in get_dimensions
    dimensions = os_utilities.get_dimensions()
  File "/home/swarming/.swarming/swarming_bot.1.zip/api/os_utilities.py", line 1025, in get_dimensions
    'cpu': get_cpu_dimensions(),
  File "/home/swarming/.swarming/swarming_bot.1.zip/utils/tools.py", line 211, in wrapper
    v = func(*args)
  File "/home/swarming/.swarming/swarming_bot.1.zip/api/os_utilities.py", line 372, in get_cpu_dimensions
    info = get_cpuinfo()
  File "/home/swarming/.swarming/swarming_bot.1.zip/utils/tools.py", line 211, in wrapper
    v = func(*args)
  File "/home/swarming/.swarming/swarming_bot.1.zip/api/os_utilities.py", line 406, in get_cpuinfo
    info = platforms.linux.get_cpuinfo()
  File "/home/swarming/.swarming/swarming_bot.1.zip/utils/tools.py", line 211, in wrapper
    v = func(*args)
  File "/home/swarming/.swarming/swarming_bot.1.zip/api/platforms/linux.py", line 199, in get_cpuinfo
    cpu_info['flags'] = values['flags']
KeyError: 'flags'
)

From the bt we can see it's failing for get_cpuinfo, I am suspecting facing any of challenges with this function handling, do we need to to have some changes for s390x similar to other ports??

@dmitshur
Copy link
Contributor Author

dmitshur commented Feb 25, 2025

Thanks for getting to that trace. Yes, it looks like swarming_bot's logic for auto-detecting information about the CPU in order to populate its dimensions doesn't work as expected on the machine you're running it on.

I see a TODO mentioning s390x here:

https://source.chromium.org/chromium/infra/infra_superproject/+/main:infra/luci/appengine/swarming/swarming_bot/api/os_utilities.py;l=341-342;drc=76b6d28778e8aeaa55f0cd892b500b35ee32b9d2

Though the exception you're seeing is happening as part of get_cpu_dimensions, which is used to populate the "cpu" dimension here, which leads to the problem happening on this line:

https://source.chromium.org/chromium/infra/infra_superproject/+/main:infra/luci/appengine/swarming/swarming_bot/api/platforms/linux.py;l=199;drc=4b5b630058c9a41d6ddcb7cef61630b434eeccb0

Note that section has a comment "# Intel". If it shouldn't apply, perhaps the if condition should be updated to also check for 'flags' being present in values, not only 'vendor_id'.

For the purposes of this builder, the "cipd_platform" dimension needs to be computed as "linux-s390x". Other dimensions aren’t currently used in our configuration and so their exact values are not critical, but of course their computation needs not to throw an exception.

If you're okay with navigating the LUCI contribution process and sending a CL directly, that would be an ideal way to make progress on this builder. Since you have access to the machine where swarming bot will run, you can test your changes there, and we'll help with code review. You can refer to a past issue where swarming bot changes were required for reference, for example please see #64660 and its CLs like crrev.com/c/5792941 (CC @prattmic, @pmur).

@srinivas-pokala
Copy link
Contributor

@dmitshur I have followed the contribution steps mentioned here: https://chromium.googlesource.com/infra/luci/luci-py/+/HEAD,
but unable to push the changes suggested, I facing issue with "git cl upload -s" command

root@t83lp35:~/dev/luci/luci-py (work)# git cl upload -s
Bundled Python 3.11 not found. Use VPYTHON_BYPASS if prebuilt cpython not available on this platform: open /root/src/depot_tools/.cipd_bin/.cipd/pkgs/0/G_xu-Y97FN_30TkdKMbm3javlcOhCsX1UOrSO31dk3QC/3.11/.versions/cpython3.cipd_version: no such file or directory

It looks like python3.11 missing in the depot_tools, I could not find the path mentioned, could you help me on this issue.

@dmitshur
Copy link
Contributor Author

dmitshur commented Apr 8, 2025

There are two pages linked from https://chromium.googlesource.com/infra/infra/+/main/doc/source.md that might be most relevant, if you haven't already looked there:

Are you mailing the CL from a linux/s390x machine, or is this happening on something like linux/amd64? From the error message, if this is on s390x, it might be indeed that cpython3 isn't available for linux-s390x, so you could try setting VPYTHON_BYPASS as suggested and try using system python instead. That env var needs to be set to a specific value as documented here:

VPYTHON_BYPASS='manually managed python not supported by chrome operations'

Another approach you can try, especially if the above doesn't work well, is to create the Gerrit CL using Gerrit's lower-level primitives (documented here):

git push origin HEAD:refs/for/main

Assuming HEAD is pointing to the commit you wish to mail. This has potential downsides in that if there are some extra steps taken by git cl upload, they don't get a chance to run this way, but the upside is that it doesn't require additional dependencies.

@srinivas-pokala
Copy link
Contributor

srinivas-pokala commented Apr 9, 2025

@dmitshur thank's for the details.

Are you mailing the CL from a linux/s390x machine, or is this happening on something like linux/amd64? From the error message, if this is on s390x, it might be indeed that cpython3 isn't available for linux-s390x, so you could try setting VPYTHON_BYPASS as suggested and try using system python instead. That env var needs to be set to a specific value as documented here:

I am running on linux/s390x machine, I could able to resolve the issue and raised the CL. Could you please the review the CL.

@dmitshur
Copy link
Contributor Author

@srinivas-pokala I believe the change in crrev.com/c/6439429 is rolled out and you should be able to try starting up the builder again.

@srinivas-pokala
Copy link
Contributor

@dmitshur Thank's for the help.
I've tried to launch the swarming bot, but I came across this error with hang:

[a2@go-s390x01 bin]$ luci_machine_tokend -backend luci-token-server.appspot.com -cert-pem /home/a2/linux-s390x-ibm-1728069494.cert.txt -pkey-pem /home/a2/linux-s390x-ibm.key -token-file=/var/lib/luci_machine_tokend/token.json
[I2025-04-23T03:43:13.587455-04:00 2322 0 iface.go:154] tsmon is disabled because no endpoint is configured
[I2025-04-23T03:43:13.592213-04:00 2322 0 main.go:242] The token is valid, skipping the update
[a2@go-s390x01 bin]$ ./bootstrapswarm -hostname linux-s390x-ibm
2025/04/23 03:59:11 Bootstrapping the swarming bot with certificate authentication
2025/04/23 03:59:11 retrieving the luci-machine-token from the token file /var/lib/luci_machine_tokend/token.json (default path for GOOS != windows)
2025/04/23 03:59:11 Downloading the swarming bot
2025/04/23 03:59:11 Starting the swarming bot /home/a2/.swarming/swarming_bot.zip
Can't open display :0.0
3005 2025-04-23 07:59:18.182 E: is_display_attached(): Command '['xrandr', '--display', ':0.0', '--query']' returned non-zero exit status 1.
Can't open display :0.0
3005 2025-04-23 07:59:18.221 E: get_display_resolution(): Command '['xrandr', '--display', ':0.0', '--query']' returned non-zero exit status 1.
Can't open display :0.0
3005 2025-04-23 07:59:18.239 E: is_display_attached(): Command '['xrandr', '--display', ':0.0', '--query']' returned non-zero exit status 1.
Can't open display :0.0
3005 2025-04-23 07:59:18.246 E: get_display_resolution(): Command '['xrandr', '--display', ':0.0', '--query']' returned non-zero exit status 1.
Can't open display :0.0
3005 2025-04-23 07:59:21.755 E: is_display_attached(): Command '['xrandr', '--display', ':0.0', '--query']' returned non-zero exit status 1.
Can't open display :0.0
3005 2025-04-23 07:59:22.264 E: get_display_resolution(): Command '['xrandr', '--display', ':0.0', '--query']' returned non-zero exit status 1.

It looks like failure functions which we were seeing above in get_dimesions() function, Is there something I'm doing wrong here.

@srinivas-pokala
Copy link
Contributor

@dmitshur can you help me on the above issue

@dmitshur
Copy link
Contributor Author

dmitshur commented Apr 28, 2025

From looking at the source for those functions (here), it looks like "Can't open display :0.0" with exit code 1 is the expected outcome when there's no display attached. The code prints that as a log message but otherwise considers there not to be a display attached for the purposes of computing the bot's dimensions. So I think that part is working as intended and not causing problems.

Looking at the build history for the bot at https://chromium-swarm.appspot.com/bot?id=linux-s390x-ibm, I see 3 builds from April 22-23. The first one was entirely successful, completing in 34 min: https://ci.chromium.org/b/8716953408078429585. That's great to see!

The following 2 builds failed, and I see a Bot denied running for user "a2" error on the aforementioned bot status page. Note that as mentioned at https://go.dev/wiki/DashboardBuilders, bootstrapswarm needs to be run as the swarming user and preferably without root permissions. Is it possible it was run with the wrong user while you were investigating the issue above?

Since the bot worked on the first time, I suggest trying it again while making sure it's executed as the swarming user, and keeping an eye on its status on the https://chromium-swarm.appspot.com/bot?id=linux-s390x-ibm page as well as local logs you have (ignoring the "xrandr returned exit code 1" messages since they're not a problem).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-s390x Issues solely affecting the s390x architecture. Builders x/build issues (builders, bots, dashboards) NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. new-builder OS-Linux
Projects
Status: In Progress
Development

No branches or pull requests

4 participants