Kubeflow Training

Overview

The Kubeflow Training Operator simplifies the management of distributed training jobs on Kubernetes. It allows users to define training jobs as Kubernetes custom resources, making it easy to scale and monitor machine learning models within a Kubernetes environment.

This guide explains how to use the Kubeflow Training Operator to submit, manage, and monitor training jobs.

Key Concepts

TrainingJob: A custom resource definition (CRD) used to define training jobs. It includes the details of the training task, such as the container image, number of replicas, and more.
Worker: A replica that runs the training logic. Multiple workers enable distributed training.
PS (Parameter Server): Used in some distributed training frameworks to handle model parameter updates.
Chief/Leader: A single pod responsible for coordinating the training (used in TensorFlow jobs).

Submitting a Training Job

To submit a training job, you need to create a YAML manifest file that describes the job. Below is an example YAML file for a TensorFlow training job.

apiVersion: "kubeflow.org/v1"
kind: TFJob
metadata:
  name: example-tfjob
spec:
  tfReplicaSpecs:
    Chief:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: tensorflow
              image: tensorflow/tensorflow:2.17.0
              command: ["python", "/app/train.py"]
              args: ["--epochs", "5"]
              resources:
                requests:
                  memory: "4Gi"
                  cpu: "2"
                limits:
                  memory: "4Gi"
                  cpu: "2"
    Worker:
      replicas: 2
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: tensorflow
              image: tensorflow/tensorflow:2.17.0
              command: ["python", "/app/train.py"]
              args: ["--epochs", "5"]
              resources:
                requests:
                  memory: "4Gi"
                  cpu: "2"
                limits:
                  memory: "4Gi"
                  cpu: "2"

Common options for kind are TFJob for TensorFlow and PyTorchJob for PyTorch.

For more information on specific training jobs like PyTorch or MXNet, or to explore additional features such as hyperparameter tuning, refer to the Kubeflow documentation.