Skip to content

Commit 688e6c7

Browse files
committed
drm/i915: Slaughter the thundering i915_wait_request herd
One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
1 parent 1f15b76 commit 688e6c7

File tree

10 files changed

+648
-110
lines changed

10 files changed

+648
-110
lines changed

drivers/gpu/drm/i915/Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,7 @@ i915-y += i915_cmd_parser.o \
3838
i915_gem_userptr.o \
3939
i915_gpu_error.o \
4040
i915_trace_points.o \
41+
intel_breadcrumbs.o \
4142
intel_lrc.o \
4243
intel_mocs.o \
4344
intel_ringbuffer.o \

drivers/gpu/drm/i915/i915_debugfs.c

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -788,10 +788,22 @@ static int i915_gem_request_info(struct seq_file *m, void *data)
788788
static void i915_ring_seqno_info(struct seq_file *m,
789789
struct intel_engine_cs *engine)
790790
{
791+
struct intel_breadcrumbs *b = &engine->breadcrumbs;
792+
struct rb_node *rb;
793+
791794
seq_printf(m, "Current sequence (%s): %x\n",
792795
engine->name, engine->get_seqno(engine));
793796
seq_printf(m, "Current user interrupts (%s): %x\n",
794797
engine->name, READ_ONCE(engine->user_interrupts));
798+
799+
spin_lock(&b->lock);
800+
for (rb = rb_first(&b->waiters); rb; rb = rb_next(rb)) {
801+
struct intel_wait *w = container_of(rb, typeof(*w), node);
802+
803+
seq_printf(m, "Waiting (%s): %s [%d] on %x\n",
804+
engine->name, w->tsk->comm, w->tsk->pid, w->seqno);
805+
}
806+
spin_unlock(&b->lock);
795807
}
796808

797809
static int i915_gem_seqno_info(struct seq_file *m, void *data)
@@ -1428,6 +1440,8 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused)
14281440
engine->hangcheck.seqno,
14291441
seqno[id],
14301442
engine->last_submitted_seqno);
1443+
seq_printf(m, "\twaiters? %d\n",
1444+
intel_engine_has_waiter(engine));
14311445
seq_printf(m, "\tuser interrupts = %x [current %x]\n",
14321446
engine->hangcheck.user_interrupts,
14331447
READ_ONCE(engine->user_interrupts));
@@ -2415,7 +2429,7 @@ static int count_irq_waiters(struct drm_i915_private *i915)
24152429
int count = 0;
24162430

24172431
for_each_engine(engine, i915)
2418-
count += engine->irq_refcount;
2432+
count += intel_engine_has_waiter(engine);
24192433

24202434
return count;
24212435
}

drivers/gpu/drm/i915/i915_drv.h

Lines changed: 38 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -506,6 +506,7 @@ struct drm_i915_error_state {
506506
bool valid;
507507
/* Software tracked state */
508508
bool waiting;
509+
int num_waiters;
509510
int hangcheck_score;
510511
enum intel_ring_hangcheck_action hangcheck_action;
511512
int num_requests;
@@ -551,6 +552,12 @@ struct drm_i915_error_state {
551552
u32 tail;
552553
} *requests;
553554

555+
struct drm_i915_error_waiter {
556+
char comm[TASK_COMM_LEN];
557+
pid_t pid;
558+
u32 seqno;
559+
} *waiters;
560+
554561
struct {
555562
u32 gfx_mode;
556563
union {
@@ -1429,7 +1436,7 @@ struct i915_gpu_error {
14291436
#define I915_STOP_RING_ALLOW_WARN (1 << 30)
14301437

14311438
/* For missed irq/seqno simulation. */
1432-
unsigned int test_irq_rings;
1439+
unsigned long test_irq_rings;
14331440
};
14341441

14351442
enum modeset_restore {
@@ -3064,7 +3071,6 @@ ibx_disable_display_interrupt(struct drm_i915_private *dev_priv, uint32_t bits)
30643071
ibx_display_interrupt_update(dev_priv, bits, 0);
30653072
}
30663073

3067-
30683074
/* i915_gem.c */
30693075
int i915_gem_create_ioctl(struct drm_device *dev, void *data,
30703076
struct drm_file *file_priv);
@@ -3975,4 +3981,34 @@ static inline void i915_trace_irq_get(struct intel_engine_cs *engine,
39753981
i915_gem_request_assign(&engine->trace_irq_req, req);
39763982
}
39773983

3984+
static inline bool __i915_request_irq_complete(struct drm_i915_gem_request *req)
3985+
{
3986+
/* Ensure our read of the seqno is coherent so that we
3987+
* do not "miss an interrupt" (i.e. if this is the last
3988+
* request and the seqno write from the GPU is not visible
3989+
* by the time the interrupt fires, we will see that the
3990+
* request is incomplete and go back to sleep awaiting
3991+
* another interrupt that will never come.)
3992+
*
3993+
* Strictly, we only need to do this once after an interrupt,
3994+
* but it is easier and safer to do it every time the waiter
3995+
* is woken.
3996+
*/
3997+
if (i915_gem_request_completed(req, false))
3998+
return true;
3999+
4000+
/* We need to check whether any gpu reset happened in between
4001+
* the request being submitted and now. If a reset has occurred,
4002+
* the seqno will have been advance past ours and our request
4003+
* is complete. If we are in the process of handling a reset,
4004+
* the request is effectively complete as the rendering will
4005+
* be discarded, but we need to return in order to drop the
4006+
* struct_mutex.
4007+
*/
4008+
if (i915_reset_in_progress(&req->i915->gpu_error))
4009+
return true;
4010+
4011+
return false;
4012+
}
4013+
39784014
#endif

drivers/gpu/drm/i915/i915_gem.c

Lines changed: 53 additions & 90 deletions
Original file line numberDiff line numberDiff line change
@@ -1343,17 +1343,6 @@ i915_gem_check_wedge(unsigned reset_counter, bool interruptible)
13431343
return 0;
13441344
}
13451345

1346-
static void fake_irq(unsigned long data)
1347-
{
1348-
wake_up_process((struct task_struct *)data);
1349-
}
1350-
1351-
static bool missed_irq(struct drm_i915_private *dev_priv,
1352-
struct intel_engine_cs *engine)
1353-
{
1354-
return test_bit(engine->id, &dev_priv->gpu_error.missed_irq_rings);
1355-
}
1356-
13571346
static unsigned long local_clock_us(unsigned *cpu)
13581347
{
13591348
unsigned long t;
@@ -1386,7 +1375,7 @@ static bool busywait_stop(unsigned long timeout, unsigned cpu)
13861375
return this_cpu != cpu;
13871376
}
13881377

1389-
static int __i915_spin_request(struct drm_i915_gem_request *req, int state)
1378+
static bool __i915_spin_request(struct drm_i915_gem_request *req, int state)
13901379
{
13911380
unsigned long timeout;
13921381
unsigned cpu;
@@ -1401,17 +1390,14 @@ static int __i915_spin_request(struct drm_i915_gem_request *req, int state)
14011390
* takes to sleep on a request, on the order of a microsecond.
14021391
*/
14031392

1404-
if (req->engine->irq_refcount)
1405-
return -EBUSY;
1406-
14071393
/* Only spin if we know the GPU is processing this request */
14081394
if (!i915_gem_request_started(req, true))
1409-
return -EAGAIN;
1395+
return false;
14101396

14111397
timeout = local_clock_us(&cpu) + 5;
1412-
while (!need_resched()) {
1398+
do {
14131399
if (i915_gem_request_completed(req, true))
1414-
return 0;
1400+
return true;
14151401

14161402
if (signal_pending_state(state, current))
14171403
break;
@@ -1420,12 +1406,9 @@ static int __i915_spin_request(struct drm_i915_gem_request *req, int state)
14201406
break;
14211407

14221408
cpu_relax_lowlatency();
1423-
}
1424-
1425-
if (i915_gem_request_completed(req, false))
1426-
return 0;
1409+
} while (!need_resched());
14271410

1428-
return -EAGAIN;
1411+
return false;
14291412
}
14301413

14311414
/**
@@ -1450,123 +1433,97 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
14501433
s64 *timeout,
14511434
struct intel_rps_client *rps)
14521435
{
1453-
struct intel_engine_cs *engine = i915_gem_request_get_engine(req);
1454-
struct drm_i915_private *dev_priv = req->i915;
1455-
const bool irq_test_in_progress =
1456-
ACCESS_ONCE(dev_priv->gpu_error.test_irq_rings) & intel_engine_flag(engine);
14571436
int state = interruptible ? TASK_INTERRUPTIBLE : TASK_UNINTERRUPTIBLE;
14581437
DEFINE_WAIT(reset);
1459-
DEFINE_WAIT(wait);
1460-
unsigned long timeout_expire;
1438+
struct intel_wait wait;
1439+
unsigned long timeout_remain;
14611440
s64 before = 0; /* Only to silence a compiler warning. */
1462-
int ret;
1441+
int ret = 0;
14631442

1464-
WARN(!intel_irqs_enabled(dev_priv), "IRQs disabled");
1443+
might_sleep();
14651444

14661445
if (list_empty(&req->list))
14671446
return 0;
14681447

14691448
if (i915_gem_request_completed(req, true))
14701449
return 0;
14711450

1472-
timeout_expire = 0;
1451+
timeout_remain = MAX_SCHEDULE_TIMEOUT;
14731452
if (timeout) {
14741453
if (WARN_ON(*timeout < 0))
14751454
return -EINVAL;
14761455

14771456
if (*timeout == 0)
14781457
return -ETIME;
14791458

1480-
timeout_expire = jiffies + nsecs_to_jiffies_timeout(*timeout);
1459+
timeout_remain = nsecs_to_jiffies_timeout(*timeout);
14811460

14821461
/*
14831462
* Record current time in case interrupted by signal, or wedged.
14841463
*/
14851464
before = ktime_get_raw_ns();
14861465
}
14871466

1488-
if (INTEL_INFO(dev_priv)->gen >= 6)
1489-
gen6_rps_boost(dev_priv, rps, req->emitted_jiffies);
1490-
14911467
trace_i915_gem_request_wait_begin(req);
14921468

1493-
/* Optimistic spin for the next jiffie before touching IRQs */
1494-
ret = __i915_spin_request(req, state);
1495-
if (ret == 0)
1496-
goto out;
1469+
if (INTEL_INFO(req->i915)->gen >= 6)
1470+
gen6_rps_boost(req->i915, rps, req->emitted_jiffies);
14971471

1498-
if (!irq_test_in_progress && WARN_ON(!engine->irq_get(engine))) {
1499-
ret = -ENODEV;
1500-
goto out;
1501-
}
1472+
/* Optimistic spin for the next ~jiffie before touching IRQs */
1473+
if (__i915_spin_request(req, state))
1474+
goto complete;
15021475

1503-
add_wait_queue(&dev_priv->gpu_error.wait_queue, &reset);
1504-
for (;;) {
1505-
struct timer_list timer;
1476+
set_current_state(state);
1477+
add_wait_queue(&req->i915->gpu_error.wait_queue, &reset);
15061478

1507-
prepare_to_wait(&engine->irq_queue, &wait, state);
1508-
1509-
/* We need to check whether any gpu reset happened in between
1510-
* the request being submitted and now. If a reset has occurred,
1511-
* the seqno will have been advance past ours and our request
1512-
* is complete. If we are in the process of handling a reset,
1513-
* the request is effectively complete as the rendering will
1514-
* be discarded, but we need to return in order to drop the
1515-
* struct_mutex.
1479+
intel_wait_init(&wait, req->seqno);
1480+
if (intel_engine_add_wait(req->engine, &wait))
1481+
/* In order to check that we haven't missed the interrupt
1482+
* as we enabled it, we need to kick ourselves to do a
1483+
* coherent check on the seqno before we sleep.
15161484
*/
1517-
if (i915_reset_in_progress(&dev_priv->gpu_error)) {
1518-
ret = 0;
1519-
break;
1520-
}
1521-
1522-
if (i915_gem_request_completed(req, false)) {
1523-
ret = 0;
1524-
break;
1525-
}
1485+
goto wakeup;
15261486

1487+
for (;;) {
15271488
if (signal_pending_state(state, current)) {
15281489
ret = -ERESTARTSYS;
15291490
break;
15301491
}
15311492

1532-
if (timeout && time_after_eq(jiffies, timeout_expire)) {
1533-
ret = -ETIME;
1534-
break;
1535-
}
1536-
15371493
/* Ensure that even if the GPU hangs, we get woken up.
15381494
*
15391495
* However, note that if no one is waiting, we never notice
15401496
* a gpu hang. Eventually, we will have to wait for a resource
15411497
* held by the GPU and so trigger a hangcheck. In the most
15421498
* pathological case, this will be upon memory starvation!
15431499
*/
1544-
i915_queue_hangcheck(dev_priv);
1545-
1546-
timer.function = NULL;
1547-
if (timeout || missed_irq(dev_priv, engine)) {
1548-
unsigned long expire;
1500+
i915_queue_hangcheck(req->i915);
15491501

1550-
setup_timer_on_stack(&timer, fake_irq, (unsigned long)current);
1551-
expire = missed_irq(dev_priv, engine) ? jiffies + 1 : timeout_expire;
1552-
mod_timer(&timer, expire);
1502+
timeout_remain = io_schedule_timeout(timeout_remain);
1503+
if (timeout_remain == 0) {
1504+
ret = -ETIME;
1505+
break;
15531506
}
15541507

1555-
io_schedule();
1556-
1557-
if (timer.function) {
1558-
del_singleshot_timer_sync(&timer);
1559-
destroy_timer_on_stack(&timer);
1560-
}
1561-
}
1562-
remove_wait_queue(&dev_priv->gpu_error.wait_queue, &reset);
1508+
if (intel_wait_complete(&wait))
1509+
break;
15631510

1564-
if (!irq_test_in_progress)
1565-
engine->irq_put(engine);
1511+
set_current_state(state);
15661512

1567-
finish_wait(&engine->irq_queue, &wait);
1513+
wakeup:
1514+
/* Carefully check if the request is complete, giving time
1515+
* for the seqno to be visible following the interrupt.
1516+
* We also have to check in case we are kicked by the GPU
1517+
* reset in order to drop the struct_mutex.
1518+
*/
1519+
if (__i915_request_irq_complete(req))
1520+
break;
1521+
}
1522+
remove_wait_queue(&req->i915->gpu_error.wait_queue, &reset);
15681523

1569-
out:
1524+
intel_engine_remove_wait(req->engine, &wait);
1525+
__set_current_state(TASK_RUNNING);
1526+
complete:
15701527
trace_i915_gem_request_wait_end(req);
15711528

15721529
if (timeout) {
@@ -2796,6 +2753,12 @@ i915_gem_init_seqno(struct drm_i915_private *dev_priv, u32 seqno)
27962753
}
27972754
i915_gem_retire_requests(dev_priv);
27982755

2756+
/* If the seqno wraps around, we need to clear the breadcrumb rbtree */
2757+
if (!i915_seqno_passed(seqno, dev_priv->next_seqno)) {
2758+
while (intel_kick_waiters(dev_priv))
2759+
yield();
2760+
}
2761+
27992762
/* Finally reset hw state */
28002763
for_each_engine(engine, dev_priv)
28012764
intel_ring_init_seqno(engine, seqno);

0 commit comments

Comments
 (0)