Skip to content

Conversation

naushir
Copy link
Contributor

@naushir naushir commented Aug 13, 2025

The existing code unconditionally removes jobs from the job_queue list when all the nodes in a node group have stopped streaming. This will also remove jobs for any other node groups as the job_queue is a common list. Fix this by only conditionally deleting jobs that belong to the current node group.

Additionally, delete jobs as soon as the first node in the node group stops streaming. Running a job with an incomplete set of active nodes is invalid.

Fixes: 880153e ("media: pisp_be: Split jobs creation and scheduling")

The existing code unconditionally removes jobs from the job_queue list
when all the nodes in a node group have stopped streaming. This will
also remove jobs for any other node groups as the job_queue is a common
list. Fix this by only conditionally deleting jobs that belong to the
current node group.

Additionally, delete jobs as soon as the first node in the node group
stops streaming. Running a job with an incomplete set of active nodes
is invalid.

Fixes: 880153e ("media: pisp_be: Split jobs creation and scheduling")
Signed-off-by: Naushir Patuck <naush@raspberrypi.com>
@njhollinghurst
Copy link
Contributor

Is it possible to be in a situation where jobs are queued for this node-group, which don't reference the node that was stopped? (It seems unlikely.)

@naushir
Copy link
Contributor Author

naushir commented Aug 13, 2025

Is it possible to be in a situation where jobs are queued for this node-group, which don't reference the node that was stopped? (It seems unlikely.)

I think that would be a malformed job and probably fail validation?

@naushir
Copy link
Contributor Author

naushir commented Aug 13, 2025

Sadly, this does not seem to be enough to fix raspberrypi/rpicam-apps#849.

pispbe_schedule() currently takes a node group as a parameter, which is
left over from before the job prepare/scheduling refactoring. This is
now invalid, as jobs are executed in the order which they were queued.

As part of this old code, there was a check if the current node group
mached the node group of the job, and if unmathched, use a "continue"
statement. This is invalid as there is no loop to iterate over any more.
The reason this was not a compile bug is because of the for loop used
as part of the scoped_guard macro.

A consequence of breaking out of the scoped_guard loop early is that
the job structure gets freed, but not actually removed from the queue
and may be accessed after freeing.

Fix this by removing the node group test in pispbe_schedule() as it is
no longer valid to use.

Signed-off-by: Naushir Patuck <naush@raspberrypi.com>
@naushir
Copy link
Contributor Author

naushir commented Aug 14, 2025

With 1506a5e, I'm now pretty confident that raspberrypi/rpicam-apps#849 ought to be resolved.

@naushir
Copy link
Contributor Author

naushir commented Aug 14, 2025

@njhollinghurst this is ready to be reviewed and merged if you are happy with it.

@njhollinghurst
Copy link
Contributor

LGTM. I assume the checkpatch error is just an artifact of a shallow clone?

@6by9
Copy link
Contributor

6by9 commented Aug 14, 2025

LGTM. I assume the checkpatch error is just an artifact of a shallow clone?

Yes, it's one of the current limitations of the checkpatch job.

Looks like we can squash that one by adding --ignore UNKNOWN_COMMIT_ID to the checkpatch invocation - I'll sort a PR.

@naushir naushir merged commit cd18dc3 into raspberrypi:rpi-6.12.y Aug 14, 2025
11 of 12 checks passed
@jmondi
Copy link
Contributor

jmondi commented Aug 18, 2025

Hi Naush, thanks for fixing this. Indeed I introduced this when moving the patches from the non-multi-group mainline version to the multi-group BSP version of the driver. Sorry about that, I recall I tested with 2 cameras, but I don't recall seeing this (however, it should happen all the times one camera is stopped while the other is left running ?)

1506a5e is certainly good, I don't think there is anything to upport to mainline right ?

I'm a bit less sure about 3e4f99e. continue is certainly wrong, but I wonder if we should return ? node_group is still a thing for the BSP driver, but I guess the check in pispbe_schedule() is redundant as if we have any job at the front of the queue, we shall run it regardless of the group that has just created a new job ?

Thanks for fixing this, and sorry if you caught me on the week I was on leave.

@naushir
Copy link
Contributor Author

naushir commented Aug 18, 2025

Hi Naush, thanks for fixing this. Indeed I introduced this when moving the patches from the non-multi-group mainline version to the multi-group BSP version of the driver. Sorry about that, I recall I tested with 2 cameras, but I don't recall seeing this (however, it should happen all the times one camera is stopped while the other is left running ?)

Not to worry! This bug did take some time to trigger - and in very specific timing conditions. So most of out testing will not have caught it.

1506a5e is certainly good, I don't think there is anything to upport to mainline right ?

Correct, I checked and there is no need to mainline this change.

I'm a bit less sure about 3e4f99e. continue is certainly wrong, but I wonder if we should return ? node_group is still a thing for the BSP driver, but I guess the check in pispbe_schedule() is redundant as if we have any job at the front of the queue, we shall run it regardless of the group that has just created a new job ?

My feeling is that it's best if the test is removed entirely. Since we now rely on a single job queue across all users, we should unconditionally run the first job in the queue to maintain strict FIFO ordering. There is no reason to jump to the "current" user's job in such cases. Again, there is no need to upstream this as the logic is different, i.e. only handles a single user.

Thanks for fixing this, and sorry if you caught me on the week I was on leave.

No problem at all - hope you had a good rest 😄

popcornmix added a commit to raspberrypi/firmware that referenced this pull request Aug 20, 2025
kernel: INA260 driver support 70fb84a109c639637f0636281dbdb21ed8ffb000 upstream
See: raspberrypi/linux#7004

kernel: drivers: media: pisp_be: Fix for job queue removal in stop_streaming
See: raspberrypi/linux#6999

kernel: Improve PIO DMA performance
See: raspberrypi/linux#6994
popcornmix added a commit to raspberrypi/rpi-firmware that referenced this pull request Aug 20, 2025
kernel: INA260 driver support 70fb84a109c639637f0636281dbdb21ed8ffb000 upstream
See: raspberrypi/linux#7004

kernel: drivers: media: pisp_be: Fix for job queue removal in stop_streaming
See: raspberrypi/linux#6999

kernel: Improve PIO DMA performance
See: raspberrypi/linux#6994
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants