Skip to content

Self-hosting: bulk deleting TaskRun rows is slow #2105

@rharkor

Description

@rharkor

Provide environment information

System:
OS: Linux 6.8 Ubuntu 24.04.2 LTS 24.04.2 LTS (Noble Numbat)
CPU: (4) x64 AMD EPYC-Milan Processor
Memory: 10.10 GB / 15.24 GB
Container: Yes
Shell: 5.2.21 - /bin/bash
Binaries:
Node: 18.19.1 - /usr/bin/node
npm: 9.2.0 - /usr/bin/npm

Describe the bug

I self hosted trigger for a long time. I have t many runs stored in my db so I recently decided to clean the older ones. But my script seems to took a very long time to only delete 1500 runs. When I tried to describe the query plan to delete one row it appears to me that two relations where missing some key indexes.

Find slow foreign keys

postgres=# begin;
explain (analyze,buffers,timing)
delete from "TaskRun" where id='cmawag7nafpp8o0281yebzmdz';
rollback;
BEGIN
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Delete on "TaskRun"  (cost=0.42..8.44 rows=0 width=0) (actual time=3.989..3.990 rows=0 loops=1)
   Buffers: shared hit=7 read=5 dirtied=1
   ->  Index Scan using "TaskRun_pkey" on "TaskRun"  (cost=0.42..8.44 rows=1 width=6) (actual time=1.936..1.938 rows=1 loops=1)
         Index Cond: (id = 'cmawag7nafpp8o0281yebzmdz'::text)
         Buffers: shared hit=2 read=2
 Planning Time: 0.136 ms
 Trigger for constraint TaskRunAttempt_taskRunId_fkey on TaskRun: time=2.418 calls=1
 Trigger for constraint TaskRunDependency_taskRunId_fkey on TaskRun: time=2.381 calls=1
 Trigger for constraint BatchTaskRunItem_taskRunId_fkey on TaskRun: time=2.630 calls=1
 Trigger for constraint Checkpoint_runId_fkey on TaskRun: time=0.861 calls=1
 Trigger for constraint CheckpointRestoreEvent_runId_fkey on TaskRun: time=0.742 calls=1
 Trigger for constraint BulkActionItem_sourceRunId_fkey on TaskRun: time=0.032 calls=1
 Trigger for constraint BulkActionItem_destinationRunId_fkey on TaskRun: time=0.015 calls=1
 Trigger for constraint _TaskRunToTaskRunTag_A_fkey on TaskRun: time=2.354 calls=1
 Trigger for constraint ProjectAlert_taskRunId_fkey on TaskRun: time=0.015 calls=1
 Trigger for constraint TaskRun_rootTaskRunId_fkey on TaskRun: time=1.790 calls=1
 Trigger for constraint TaskRun_parentTaskRunId_fkey on TaskRun: time=1.418 calls=1
 Trigger for constraint TaskRunDependency_dependentAttemptId_fkey on TaskRunAttempt: time=0.022 calls=1
 Trigger for constraint BatchTaskRun_dependentTaskAttemptId_fkey on TaskRunAttempt: time=0.638 calls=1
 Trigger for constraint Checkpoint_attemptId_fkey on TaskRunAttempt: time=0.814 calls=1
 Trigger for constraint BatchTaskRunItem_taskRunAttemptId_fkey on TaskRunAttempt: time=1.489 calls=1
 Trigger for constraint CheckpointRestoreEvent_attemptId_fkey on TaskRunAttempt: time=40.039 calls=1
 Trigger for constraint ProjectAlert_taskRunAttemptId_fkey on TaskRunAttempt: time=0.033 calls=1
 Trigger for constraint TaskRun_parentTaskRunAttemptId_fkey on TaskRunAttempt: time=336.831 calls=1
 Execution Time: 398.553 ms
(25 rows)

ROLLBACK

Fixing the indexes

postgres=# -- does not block writers while it builds
CREATE INDEX CONCURRENTLY IF NOT EXISTS
       "TaskRun_parentTaskRunAttemptId_idx"
    ON "TaskRun" ("parentTaskRunAttemptId");

CREATE INDEX CONCURRENTLY IF NOT EXISTS
       "CheckpointRestoreEvent_attemptId_idx"
    ON "CheckpointRestoreEvent" ("attemptId");
CREATE INDEX
CREATE INDEX
postgres=# begin;                                   
explain (analyze,buffers,timing)
delete from "TaskRun" where id='cmawag7nafpp8o0281yebzmdz';
rollback;
BEGIN
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Delete on "TaskRun"  (cost=0.42..8.44 rows=0 width=0) (actual time=0.132..0.133 rows=0 loops=1)
   Buffers: shared hit=11 dirtied=1
   ->  Index Scan using "TaskRun_pkey" on "TaskRun"  (cost=0.42..8.44 rows=1 width=6) (actual time=0.046..0.047 rows=1 loops=1)
         Index Cond: (id = 'cmawag7nafpp8o0281yebzmdz'::text)
         Buffers: shared hit=4
 Planning:
   Buffers: shared hit=17 read=1
 Planning Time: 0.301 ms
 Trigger for constraint TaskRunAttempt_taskRunId_fkey on TaskRun: time=0.134 calls=1
 Trigger for constraint TaskRunDependency_taskRunId_fkey on TaskRun: time=0.092 calls=1
 Trigger for constraint BatchTaskRunItem_taskRunId_fkey on TaskRun: time=0.087 calls=1
 Trigger for constraint Checkpoint_runId_fkey on TaskRun: time=0.138 calls=1
 Trigger for constraint CheckpointRestoreEvent_runId_fkey on TaskRun: time=0.379 calls=1
 Trigger for constraint BulkActionItem_sourceRunId_fkey on TaskRun: time=0.109 calls=1
 Trigger for constraint BulkActionItem_destinationRunId_fkey on TaskRun: time=0.061 calls=1
 Trigger for constraint _TaskRunToTaskRunTag_A_fkey on TaskRun: time=0.079 calls=1
 Trigger for constraint ProjectAlert_taskRunId_fkey on TaskRun: time=0.050 calls=1
 Trigger for constraint TaskRun_rootTaskRunId_fkey on TaskRun: time=0.189 calls=1
 Trigger for constraint TaskRun_parentTaskRunId_fkey on TaskRun: time=0.245 calls=1
 Trigger for constraint TaskRunDependency_dependentAttemptId_fkey on TaskRunAttempt: time=0.067 calls=1
 Trigger for constraint BatchTaskRun_dependentTaskAttemptId_fkey on TaskRunAttempt: time=0.081 calls=1
 Trigger for constraint Checkpoint_attemptId_fkey on TaskRunAttempt: time=0.059 calls=1
 Trigger for constraint BatchTaskRunItem_taskRunAttemptId_fkey on TaskRunAttempt: time=0.063 calls=1
 Trigger for constraint CheckpointRestoreEvent_attemptId_fkey on TaskRunAttempt: time=0.135 calls=1
 Trigger for constraint ProjectAlert_taskRunAttemptId_fkey on TaskRunAttempt: time=0.043 calls=1
 Trigger for constraint TaskRun_parentTaskRunAttemptId_fkey on TaskRunAttempt: time=0.154 calls=1
 Execution Time: 2.353 ms
(27 rows)

ROLLBACK

It could be usefull to add those by default.

If you are okay with that I could maybe create a pr for this.

Thanks

Reproduction repo

https://trigger.dev/docs/open-source-self-hosting#self-hosting

To reproduce

Just selfhost trigger and try having more than >500k task run in your db.

Additional information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions