About TPUs in GKE


This page describes how Cloud TPU works with Google Kubernetes Engine (GKE), including terminology, the benefits of tensor processing units (TPUs), and workload scheduling considerations. TPUs are Google's custom-developed application-specific integrated circuits (ASICs) for accelerating ML workloads that use frameworks such as TensorFlow, PyTorch, and JAX.

This page is for Platform admins and operators and Data and AI specialists who run machine learning (ML) models that have characteristics such as being large-scale, long-running, or are dominated by matrix computations. To learn more about common roles and example tasks that we reference in Google Cloud content, see Common GKE Enterprise user roles and tasks.

Before reading this page, ensure that you're familiar with how ML accelerators work. For details, see Introduction to Cloud TPU.

Benefits of using TPUs in GKE

GKE provides full support for TPU node and node pool lifecycle management, including creating, configuring, and deleting TPU VMs. GKE also supports Spot VMs and using reserved Cloud TPU. The benefits of using TPUs in GKE include:

  • Consistent operational environment: You can use a single platform for all machine learning and other workloads.
  • Automatic upgrades: GKE automates version updates, which reduces operational overhead.
  • Load balancing: GKE distributes the load, thus reducing latency and improving reliability.
  • Responsive scaling: GKE automatically scales TPU resources to meet the needs of your workloads.
  • Resource management: With Kueue, a Kubernetes-native job queuing system, you can manage resources across multiple tenants within your organization using queuing, preemption, prioritization, and fair sharing.

Benefits of using TPU Trillium

Trillium is Google's sixth-generation TPU. Trillium has the following benefits:

  • Trillium increases compute performance per chip compared to TPU v5e.
  • Trillium increases the High Bandwidth Memory (HBM) capacity and bandwidth, and also increases the Interchip Interconnect (ICI) bandwidth over TPU v5e.
  • Trillium is equipped with third-generation SparseCore, a specialized accelerator for processing ultra-large embeddings common in advanced ranking and recommendation workloads.
  • Trillium is over 67% more energy-efficient than TPU v5e.
  • Trillium can scale up to 256 TPUs in a single high-bandwidth, low-latency TPU slice.
  • Trillium supports collection scheduling. Collection scheduling lets you declare a group of TPUs (single-host and multi-host TPU slice node pools) to ensure high availability for the demands of your inference workloads.

On all technical surfaces like APIs and logs, and in specific parts of the GKE documentation, we use v6e or TPU Trillium (v6e) to refer to Trillium TPUs. To learn more about the benefits of Trillium, read the Trillium announcement blog post. To start your TPU setup, see Plan TPUs in GKE.

Terminology related to TPU in GKE

This page uses the following terminology related to TPUs:

  • TPU type: the Cloud TPU type, like v5e.
  • TPU slice node: a Kubernetes node represented by a single VM that has one or more interconnected TPU chips.
  • TPU slice node pool: a group of Kubernetes nodes within a cluster that all have the same TPU configuration.
  • TPU topology: the number and physical arrangement of the TPU chips in a TPU slice.
  • Atomic: GKE treats all the interconnected nodes as a single unit. During scaling operations, GKE scales the entire set of nodes to 0 and creates new nodes. If a machine in the group fails or terminates, GKE recreates the entire set of nodes as a new unit.
  • Immutable: You can't manually add new nodes to the set of interconnected nodes. However, you can create a new node pool that has the TPU topology that you want and schedule workloads on the new node pool.

Type of TPU slice node pool

GKE supports two types of TPU node pools:

The TPU type and topology determine whether your TPU slice node can be multi-host or single-host. We recommend:

  • For large-scale models, use multi-host TPU slice nodes
  • For small-scale models, use single-host TPU slice nodes

Multi-host TPU slice node pool

A multi-host TPU slice node pool is a node pool that contains two or more interconnected TPU VMs. Each VM has a TPU device connected to it. The TPUs in a multi-host TPU slice are connected over a high speed interconnect (ICI). After a multi-host TPU slice node pool is created, you can't add nodes to it. For example, you can't create a v4-32 node pool and then later add an additional Kubernetes node (TPU VM) to the node pool. To add an additional TPU slice to a GKE cluster, you must create a new node pool.

The VMs in a multi-host TPU slice node pool are treated as a single atomic unit. If GKE is unable to deploy one node in the slice, no nodes in the TPU slice node are deployed.

If a node within a multi-host TPU slice requires repairing, GKE shuts down all VMs in the TPU slice, forcing eviction of all the Kubernetes Pods in the workload. After all VMs in the TPU slice are up and running, the Kubernetes Pods can be scheduled on the VMs in the new TPU slice.

The following diagram shows a v5litepod-16 (v5e) multi-host TPU slice. This TPU slice has four VMs. Each VM in the TPU slice has four TPU v5e chips connected with high-speed interconnects (ICI), and each TPU v5e chip has one TensorCore.

Multi-host TPU slice diagram

The following diagram shows a GKE cluster that contains one TPU v5litepod-16 (v5e) TPU slice (topology: 4x4) and one TPU v5litepod-8 (v5e) slice (topology: 2x4):

TPU v5e Pod diagram

Single-host TPU slice node pools

A single-host slice node pool is a node pool that contains one or more independent TPU VMs. Each VM has a TPU device connected to it. While the VMs within a single-host slice node pool can communicate over the data center network (DCN), the TPUs attached to the VMs are not interconnected.

The following diagram shows an example of a single-host TPU slice that contains seven v4-8 machines:

Single-host slice node pool diagram

Characteristics of TPUs in GKE

TPUs have unique characteristics that require special planning and configuration.

Topology

The topology defines the physical arrangement of TPUs within a TPU slice. GKE provisions a TPU slice in two- or three-dimensional topologies, depending on the TPU version. You specify a topology as the number of TPU chips in each dimension, as follows:

For TPU v4 and v5p scheduled in multi-host TPU slice node pools, you define the topology in 3-tuples ({A}x{B}x{C}), for example 4x4x4. The product of {A}x{B}x{C} defines the number of TPU chips in the node pool. For example, you can define small topologies that have fewer than 64 TPU chips with topology forms such as 2x2x2,2x2x4, or 2x4x4. If you use topologies larger that have more than 64 TPU chips, the values you assign to {A},{B}, and {C} must meet the following conditions:

  • {A},{B}, and {C} must be multiples of four.
  • The largest topology supported for v4 is 12x16x16 and v5p is 16x16x24.
  • The assigned values must keep the A ≤ B ≤ C pattern. For example, 4x4x8 or 8x8x8.

Machine type

Machine types that support TPU resources follow a naming convention that includes the TPU version and the number of TPU chips per node slice, such as ct<version>-hightpu-<node-chip-count>t. For example, the machine type ct5lp-hightpu-1t supports TPU v5e and contains just one TPU chip.

Privileged mode

If you use GKE versions earlier than 1.28, you must configure your containers with special capabilities to access TPUs. In Standard mode clusters, you can use privileged mode to grant this access. Privileged mode overrides many of the other security settings in the securityContext. For details, see Run containers without privileged mode.

Versions 1.28 and later don't require privileged mode or special capabilities.

How TPUs in GKE work

Kubernetes resource management and prioritization treat VMs on TPUs the same as other VM types. To request TPU chips, use the resource name google.com/tpu:

    resources:
        requests:
          google.com/tpu: 4
        limits:
          google.com/tpu: 4

When using TPUs in GKE, consider the following TPU characteristics:

  • A VM can access up to 8 TPU chips.
  • A TPU slice contains a fixed number of TPU chips, with the number depending on the TPU machine type that you choose.
  • The number of requested google.com/tpu must be equal to the total number of available TPU chips on the TPU slice node. Any container in a GKE Pod that requests TPUs must consume all the TPU chips in the node. Otherwise, your Deployment fails because GKE can't partially consume TPU resources. Consider the following scenarios:
    • The machine type ct5l-hightpu-8t has a single TPU slice node with 8 TPU chips so on a node you:
      • Can deploy one GKE Pod that requires eight TPU chips.
      • Can't deploy two GKE Pods that require four TPU chips each.
    • The machine type ct5lp-hightpu-4t with a 2x4 topology contains two TPU slice nodes with four TPU chips each, for a total of eight TPU chips. With this machine type, you:
      • Can't deploy a GKE Pod that requires eight TPU chips on the nodes in this node pool.
      • Can deploy two Pods that require four TPU chips each, each Pod on one of the two nodes in this node pool.
    • TPU v5e with topology 4x4 has 16 TPU chips in four nodes. The GKE Autopilot workload that selects this configuration must request four TPU chips in each replica, for one to four replicas.
  • In Standard clusters, multiple Kubernetes Pods can be scheduled on a VM, but only one container in each Pod can access the TPU chips.
  • To create kube-system Pods, such as kube-dns, each Standard cluster must have at least one non-TPU slice node pool.
  • By default, TPU slice nodes have the google.com/tpu taint which prevents non-TPU workloads from being scheduled on the TPU slice nodes. Workloads that don't use TPUs are run on non-TPU nodes, freeing up compute on TPU slice nodes for code that uses TPUs. Note that the taint does not guarantee TPU resources are fully utilized.
  • GKE collects the logs emitted by containers running on TPU slice nodes. To learn more, see Logging.
  • TPU utilization metrics, such as runtime performance, are available in Cloud Monitoring. To learn more, see Observability and metrics.

How collection scheduling works

In TPU Trillium, you can use collection scheduling to group TPU slice nodes. Grouping these TPU slice nodes makes it easier to adjust the number of replicas to meet the workload demand. Google Cloud controls software updates to ensure that sufficient slices within the collection are always available to serve traffic.

Collection scheduling has the following limitations:

  • You can only schedule collections for TPU Trillium.
  • You can define collections only during node pool creation.
  • Spot VMs are not supported.

You can configure collection scheduling in the following scenarios:

What's next

To learn how to set up Cloud TPU in GKE, see the following pages: