Installation

Prerequisites

Online Installation

  1. ACP version: v4.0 or later

  2. Cluster administrator access to your ACP cluster

  3. Ensure that the bash tool exists in the NPU node. Otherwise, the driver and firmware installation script may fail to be parsed.

  4. Worker Node Operating System Requirements

    • Worker nodes (or node groups) running NPU workloads must use one of the following operating systems (Arm architecture):

      • openEuler 22.03 LTS
      • Ubuntu 22.04
    • Worker nodes running CPU workloads only can use any operating system, as the NPU operator performs no configuration on nodes without NPU workloads.

  5. Supported NPU Hardware

    • Nodes must use supported NPUs:

      • Ascend 910B
      • Ascend 310P
    • For detailed OS and hardware compatibility, see MindCluster Documentation

  6. The Alauda Build of Node Feature Discovery cluster plugin must be installed.

Offline Installation

  1. Offline installation requires all online installation prerequisites plus additional preparation steps.

  2. Prepare the driver and firmware package and the MindIO SDK package. Download the following packages (if you do not need to install MindIO, then you do not need to download the MindIO package):

    • For the driver and firmware package, find the config.json file in the GitCode repository of the npu-driver-installer, and download the package based on the version you want to choose, the NPU model and OS architecture of the corresponding node through the corresponding link provided.
    • For the MindIO SDK package, find the config.json file in the GitCode repository of the npu-node-provision, and download the SDK package based on the NPU model and OS architecture of the corresponding node through the corresponding link provided.
  3. Save the ZIP file of the driver and firmware package to the /tmp/driver_pkg/ path of the node where the offline installation is to be performed.

  4. Save the ZIP file of the MindIO package to the /opt/openFuyao/mindio/ path of the node where the offline installation is to be performed. (If you do not need to install MindIO, skip this step.)

  5. Check whether the target node contains the following tools.

    • For systems using Yum as the package manager, the following package needs to be installed: "jq wget unzip which net-tools pciutils gcc make kernel-devel-$(uname-r) kernel-headers-(uname-r) dkms".
    • For systems using apt-get as the package manager, the following package needs to be installed: "jq wget unzip debianutils net-tools pciutils gcc make dkms linux-headers-$(uname -r)".
    • For systems using DNF as the package manager, the following package needs to be installed: "jq wget unzip which net-tools pciutils gcc make kernel-devel-$(uname -r) kernel-headers-(uname-r) dkms".

Procedure

Downloading Packages

INFO

From the Marketplace on the Customer Portal website, download both:

  • The Alauda Build of NPU Operator operator package (delivered as an OLM OperatorBundle).
  • The Alauda Build of Node Feature Discovery cluster plugin package.
  • (Optional) The Volcano cluster plugin package — only needed if you plan to enable the ClusterD component during deployment.

Uploading Packages

The platform provides the violet command-line tool for uploading both operator packages and cluster plugin packages downloaded from the Customer Portal Marketplace.

For details, see Upload Packages.

Installing the Node Feature Discovery Cluster Plugin

Alauda Build of Node Feature Discovery is a cluster plugin, not an operator. Install it first because the NPU Operator depends on its node labelling.

  1. Navigate to Administrator > Marketplace > Cluster Plugins.
  2. Switch to the target cluster.
  3. Locate Alauda Build of Node Feature Discovery and click Install.
TIP

The Volcano cluster plugin can be left uninstalled for now. Install it from the same Cluster Plugins page only if you later enable the ClusterD component of the NPU Operator.

Installing the Alauda Build of NPU Operator

Alauda Build of NPU Operator is delivered as an operator (OLM bundle), so the install flow is the OperatorHub flow, not the Cluster Plugins flow.

  1. Apply the label masterselector=dls-master-node to all master nodes and the label workerselector=dls-worker-node to all worker nodes.

    kubectl label nodes {master-node-id} masterselector=dls-master-node
    kubectl label nodes {worker-node-id} workerselector=dls-worker-node
  2. Navigate to Administrator > Marketplace > OperatorHub, switch to the target cluster, and locate the Alauda Build of NPU Operator entry.

  3. If the status is Absent, confirm the operator package was uploaded with violet in the previous step.

  4. Click the operator to open its details page, then click Install.

  5. On the install page, choose the target namespace (the default is npu-operator; all pods managed by this operator will land here) and fill in the deployment form below. Click Install to start; confirm in the dialog and wait for the subscription to reach Succeeded.

    Deployment form parameter description:

    WARNING

    If the components listed in the table below are already installed, be sure to disable the corresponding buttons during deployment.

    TIP

    Ascend Operator, NodeD, ClusterD, Resilience Controller, MindIO TFT, and MindIO ACP are not deployed by default. Please deploy them only when there is a clear need for them.

    NOTE

    All pods created by the operator (driver, device plugin, docker runtime, NPU exporter, Ascend Operator, NodeD, ClusterD, Resilience Controller, MindIO TFT, MindIO ACP, NPU Feature Discovery, and the operator itself) are deployed into the same namespace as the operator pod (the default is npu-operator). Volcano-related components (vc-controller / vc-scheduler) are intentionally not exposed in this form — the platform's own Volcano cluster plugin should be installed separately when ClusterD is enabled.

    ComponentDefaultDescription
    DriverEnabledWhether to install driver and firmware.
    Driver Version25.5.0Driver and firmware version. You must select the version number from the repository directory npu-driver-installer. Hidden when Driver is disabled.
    Ascend Device PluginEnabledWhether to install Ascend Device Plugin.
    Ascend Docker RuntimeEnabledWhether to install Ascend Docker Runtime.
    NPU ExporterEnabledWhether to install NPU Exporter.
    Ascend OperatorDisabledWhether to install Ascend Operator.
    NodeDDisabledWhether to install NodeD.
    ClusterDDisabledWhether to install ClusterD. Requires the Volcano cluster plugin to be installed first.
    Resilience ControllerDisabledWhether to install Resilience Controller.
    MindIO TFTDisabledWhether to install MindIO TFT.
    MindIO ACPDisabledWhether to install MindIO ACP.

Verification

  1. On the Administrator > Marketplace > OperatorHub details page of Alauda Build of NPU Operator, the subscription status should be Succeeded. The corresponding CSV (npu-operator.v<version>) appears under Installed Operators in the target namespace.

  2. Wait for the npu-driver pod to become running. Offline installation takes about 10 minutes, while online installation is much faster.

    kubectl -n npu-operator get pod -w | grep npu-driver
    NOTE

    Replace npu-operator with the namespace you chose during installation if you installed the operator into a different namespace. All pods managed by the operator share this namespace.

  3. Reboot all the NPU nodes.

  4. Run the following command on the npu node.

    npu-smi info

    Make sure the display is working correctly.

  5. Run the following command on the master node.

    kubectl get npuclusterpolicy cluster

    Make sure the status of the npuclusterpolicy is Ready.

  6. Check whether there are allocatable NPU resources on the NPU node in the control node of the business cluster. Run the following command:

    kubectl get node  ${nodeName} -o=jsonpath='{.status.allocatable}'
    # Example, the output contains: "huawei.com/Ascend310P":"1" (the specific value depends on the number of NPU cards)
  7. Run validation workload. Create spec file:

    key="huawei.com/Ascend310P" # For 310P
    cat <<EOF > deploy-npu.yaml
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: ascend-pytorch
    spec:
      replicas: 1
      selector:
        matchLabels:
          service.cpaas.io/name: deployment-ascend-pytorch
      strategy:
        rollingUpdate:
          maxSurge: 1
          maxUnavailable: 1
        type: RollingUpdate
      template:
        metadata:
          labels:
            service.cpaas.io/name: deployment-ascend-pytorch
        spec:
          affinity: {}
          containers:
            - args:
                - |
                  sleep infinity
              command:
                - /bin/bash
                - -c
              image: ascendai/pytorch:ubuntu-python3.8-cann8.0.rc1.beta1-pytorch2.1.0
              imagePullPolicy: Always
              name: ascend-pytorch
              resources:
                limits:
                  cpu: 500m
                  $key: "1"
                  memory: 2Gi
                requests:
                  cpu: 500m
                  memory: 2Gi
          runtimeClassName: ascend
    EOF

    Apply spec:

    kubectl apply -f deploy-npu.yaml
    kubectl exec -it  deploy/ascend-pytorch -- bash

    Then run the following command in the container:

    npu-smi info

    Make sure the display is working correctly.

Installing Monitor

If the NPU Exporter component was deployed when installing the Alauda Build of NPU Operator, perform the following steps to create a monitoring panel.

  1. The operator automatically deploys a ServiceMonitor named npu-exporter-servicemonitor in the operator namespace, wired up to the npu-exporter Service. No manual ServiceMonitor creation is required. You can verify it with:

    kubectl -n npu-operator get servicemonitor npu-exporter-servicemonitor
  2. You can import a Grafana dashboard JSON file by following Import Dashboard, which converts it into a monitoring dashboard for display. The JSON file is available in ascend-npu-dashboard.

    NOTE

    Tags in the Grafana dashboard JSON file cannot contain Chinese characters and need to be manually deleted. For examples:

    {
      "tags": [
        "ascend",
        "昇腾"
      ]
    }

    After modification:

    {
      "tags": [
        "ascend"
      ]
    }

FAQ

What should I pay attention to when uninstalling Alauda Build of NPU Operator?

Even after Alauda Build of NPU Operator is uninstalled, the driver may still exist on the host machine. On the NPU node, execute the following command to uninstall the driver:

/usr/local/Ascend/driver/script/uninstall.sh