Huawei Ascend CloudMatrix 384 Supernode: In-Depth Project Analysis
Huawei Cloud ·
Source: https://mp.weixin.qq.com/s/Kb9k4_xnEbELn8AKbO8bEw
June 17, 2026
Huawei Ascend CloudMatrix 384 Supernode: In-Depth Project Analysis
384 × Ascend 910C + 192 × Kunpeng — China's Largest Domestic AI Computing Infrastructure
Total Compute (BF16): ~289 PFLOPS
NPU Configuration: 384 × Ascend 910C
Total NPU Memory: ~48 TB HBM
UB Interconnect BW: 196 GB/s / NPU
Physical Scale: 16 Racks
I. Project Background & Positioning
CloudMatrix 384 is an AI supernode platform released by Huawei Cloud in April 2025, first deployed at the Inner Mongolia Ulanqab Data Center. This supernode uses 384 Ascend 910C NPUs and 192 Kunpeng CPUs as its core, connected via Huawei's self-developed Unified Bus (UB) for fully non-blocking all-to-all interconnect, building an AI supercomputing unit with ~300 PFLOPS compute and 48 TB HBM memory within a single domain. The project is positioned to provide a domestic computing foundation for trillion-parameter model training and inference, supporting large-scale deployment of domestic models like DeepSeek V4 Pro.
II. Architecture Overview
The CloudMatrix 384 system architecture is divided into four layers: Compute Node Layer (48 nodes × 8 NPU) → L1 On-Board Switching Layer → L2 Rack-Level Switching Layer → RDMA Horizontal Scaling Layer. The UB bus achieves fully non-blocking all-to-all interconnect at both L1 and L2 levels.
▲ CloudMatrix 384 System Architecture — Four-layer stack: Compute Nodes (48×) → L1 UB Switches (336×) → L2 Rack Switches (112×) → RDMA RoCE (76.8 Tbps)
III. Core Compute Configuration
3.1 NPU — Ascend 910C
3.2 CPU — Kunpeng
3.3 System Total Compute Summary
IV. UB Interconnect Network Architecture
4.1 Why Not PCIe?
Traditional GPU clusters rely on PCIe for interconnection, but the PCIe bandwidth bottleneck becomes increasingly severe as model parameter scales grow. CloudMatrix 384 adopts Huawei's self-developed Unified Bus (UB), achieving 196 GB/s unidirectional bandwidth per NPU — far exceeding PCIe 5.0's 64 GB/s. The UB bus uses a 7-layer sub-plane design, with each layer being an independent full-mesh network, ensuring zero contention between any two NPUs.
4.2 UB Bus Technical Specifications
The UB bus uses 224 Gbps high-speed transceivers, with 7 transceivers per NPU, totaling 1,568 Gbps (196 GB/s) unidirectional bandwidth. The L1 on-board switching layer uses 7 UB Switch chips per node, and the L2 rack-level switching layer uses 16 switch chips per sub-plane, achieving fully non-blocking all-to-all interconnect for 384 NPUs.
4.3 L1 On-Board Switching Layer
Each compute node is equipped with 7 UB Switch chips, responsible for on-board switching between 8 NPUs and 4 CPUs. Each NPU is directly connected to 7 UB Switches, with 28 GB/s per port, totaling 196 GB/s unidirectional bandwidth.
4.4 L2 Rack-Level Switching Layer
4 communication racks house 112 L2 UB switch chips (7 sub-planes × 16 chips). Each sub-plane is an independent 384-port full-mesh network, with 28 GB/s per port and zero oversubscription. The L2 layer provides 448 GB/s uplink bandwidth per node.
4.5 RDMA Horizontal Scaling
Each NPU is equipped with a 200 Gbps RoCE RDMA NIC, providing 76.8 Tbps total horizontal scaling bandwidth. The RDMA plane is completely isolated from the VPC management plane, ensuring pure NPU-to-NPU communication without interference.
V. Physical Deployment & Rack Configuration
5.1 Rack Layout
The 12 compute racks are arranged in 3 rows of 4 racks each, with 4 compute nodes per rack. The 4 communication racks are located in the center of the computer room, connected to the compute racks via optical fiber. The distance between each compute node and the L2 switch is kept within 10 meters to ensure signal integrity.
5.2 Power & Cooling
The total system power consumption is estimated at 350–500 kW, requiring dual 10 kV power supply. The cooling system uses a combination of row-level air conditioning and liquid cooling, with a PUE target of 1.2–1.3. Each rack is equipped with independent UPS and PDU to ensure high availability.
VI. DeepSeek V4 Pro Inference Performance Benchmarks
6.1 Inference Performance Benchmarks
The CloudMatrix 384 supernode has demonstrated excellent inference performance in DeepSeek V4 Pro testing. The prefill throughput reaches 6,688 tokens/s/NPU, and the decode throughput reaches 1,943 tokens/s/NPU, both significantly exceeding the performance of H100 and H800 under the same conditions.
6.2 Comparison with H100 / H800
In strict latency constraints (TPOT < 50 ms), CloudMatrix 384 still maintains 538 tokens/s/NPU throughput, while H100 and H800 need to significantly reduce batch size to meet latency requirements, resulting in throughput dropping to 30–40% of peak performance.
6.3 Memory Capacity Advantage
The 128 GB HBM capacity per NPU (60% more than H100/H800's 80 GB) allows CloudMatrix 384 to load larger models or support longer context lengths. In DeepSeek V4 Pro's 1M context length test, the KV Cache only occupies hundreds of GB, thanks to V4's HCA attention mechanism reducing KV to 10% of V3.2.
VII. DeepSeek V4 Pro Full-Parameter Post-Training
7.1 Training Task Overview
In June 2026, the collaboration between Huawei and Shenzhen successfully completed full-parameter post-training of the V4 Pro model on the CloudMatrix 384 supernode at Ulanqab. This training involved 1.6 trillion parameters, using FP32 precision and AdamW optimizer, with a total memory requirement of approximately 25.6 TB — well within the 48 TB capacity of CloudMatrix 384.
Clarification: This validation was completed by the joint team of Shenzhen Hetao College, Harbin Institute of Technology (Shenzhen), Shenzhen Institute of Big Data, and Huawei, relying on the Ascend 910C domestic computing cluster.
7.2 Key Breakthroughs
7.3 Training Performance Metrics
The training MFU (Model FLOPs Utilization) exceeded 30%, with key operator efficiency improved by 14% through CANN graph optimization. The entire training process completed 1,500+ steps with zero interruptions, fully demonstrating the stability and reliability of the Ascend 910C cluster.
VIII. Ascend 950DT Roadmap & CloudMatrix Evolution
8.1 950DT Core Upgrades
The Ascend 950DT (expected August 2026) will bring significant upgrades: HBM capacity increased to 144 GB (HiZQ 2.0 self-developed), bandwidth increased to 4.0 TB/s, and architecture upgraded to Dual-Die UMA (Unified Memory Access). The on-device AI CPU reduces host CPU dependency, and the FP8/BF16 compute is expected to exceed 1,000 TFLOPS.
8.2 Next-Gen CloudMatrix Outlook
Based on the 950DT, the next-generation CloudMatrix is expected to achieve 512+ NPUs per supernode, with total compute exceeding 500 PFLOPS. The UB bandwidth is expected to increase to 280+ GB/s, further expanding the scale of all-to-all interconnect.
IX. CloudMatrix 384 Use Cases & Selection Guide
9.1 Applicable Scenarios
9.2 Selection Recommendations
For enterprises requiring trillion-parameter model inference, the CloudMatrix 384 standard configuration is recommended. For training scenarios, the extended configuration (2 sets cascaded via RoCE) can be considered. For small and medium enterprises or pilot projects, the basic configuration or Huawei Cloud Ascend cloud instances can be used.
X. Deployment Plan & Software Stack
10.1 Deployment Topology
10.2 Matrix Software Stack
The Matrix series software stack provides full lifecycle management from resource pooling to application scheduling. MatrixResource is responsible for NPU/CPU/Memory/Network resource pooling and topology-aware scheduling; MatrixLink provides UB dynamic routing and RDMA bandwidth allocation; MatrixCompute manages bare metal/VM/container instances; MatrixContainer provides K8s-based container scheduling; MatrixStorage provides distributed parallel file system access.
10.3 Storage Solution
The storage system uses a three-tier architecture: EVS (Elastic Volume Service) provides block storage for OS and container images; SFS (Scalable File Service) provides shared file access for training logs; OBS (Object Storage Service) provides object storage for massive training datasets. All storage services support multi-protocol access (POSIX/S3/HDFS).
10.4 Network Plane Segmentation
The CloudMatrix 384 network is divided into three planes: UB computing plane (196 GB/s NPU interconnect), RDMA scaling plane (200 Gbps per NPU), and VPC management plane (400 Gbps per node via QingTian DPU). The three planes are physically isolated to ensure computing, storage, and management traffic do not interfere with each other.
XI. Project Cost & ROI Analysis
11.1 Investment Breakdown (CapEx)
11.2 Three Configuration Options Comparison
11.3 Annual Operating Costs (OpEx)
11.4 TCO Comparison & ROI Reference
CapEx (One-time)
127 – 162 Million CNY
Standard configuration: 1 set CloudMatrix 384 full hardware, infrastructure, software, and 3-year service
Annual OpEx
~9.4 – 16.3 Million CNY/Year
Including electricity, personnel, maintenance renewal, storage expansion, and bandwidth
ROI Reference: If used for external DeepSeek V4 Pro inference API services (at public cloud rates of 40–80 CNY/million tokens), full load operation is expected to recover CapEx investment within 12–18 months. The 910C already has mature and stable inference capabilities, and combined with DeepSeek V4 Pro's proven full-parameter post-training, the ROI window for private deployment has opened.
11.5 Cost Optimization Recommendations
XII. Insights from the Ulanqab CloudMatrix 384 Deployment
The Ulanqab CloudMatrix 384 deployment and DeepSeek V4 Pro adaptation have released the following important signals:
1. Ascend 910C production capacity and maturity have been verified — 384 NPU large-scale deployment + completion of 1.6T parameter post-training demonstrates the chip has entered stable mass production, and customer solutions for Atlas 800I A2/A3 and other derivative servers can be confidently advanced.
2. The UB bus is a differentiated weapon — 196 GB/s NPU interconnect bandwidth far exceeds traditional PCIe solutions, providing a strong technical selling point for customers requiring multi-card collaborative large models.
3. The "Ascend + DeepSeek" combination is highly persuasive — R1 inference efficiency exceeds H100, and V4 Pro post-training has been proven in production, forming a complete technical closed loop for domestic-mandatory customers (government, central SOEs, research).
4. 910C is the preferred solution currently available for deployment — chips are already in stable mass production, the Ulanqab project has verified large-scale deployment capabilities, and Ascend computing solutions can be immediately advanced without waiting for next-generation products.
5. The Ulanqab model is replicable — the "East Data West Computing" model of western natural cooling + eastern computing consumption has been successfully verified, with similar opportunities at Horinger, Zhongwei, and other nodes.
6. Atlas 800I A2/A3 can carry V4 Pro inference — standard server solutions are available for small and medium customer trillion-model inference scenarios, mature and stable with quick results.
7. 1M context is a differentiated selling point — V4's HCA attention reduces KV Cache to 10% of V3.2, with unique advantages in long document understanding, code base analysis, and other scenarios.
Data Sources: Huawei Cloud Official Release (April 2025 Huawei Cloud Ecosystem Conference), Shenzhen Release / Hetao College Joint Announcement (June 2026), SemiAnalysis Ascend 950DT Trace Analysis (June 2026), DeepSeek Official V4 Technical Report. Prices in this article are for reference only; please consult Huawei and ecosystem partners for latest market quotes.
Note: Power consumption/TDP, SSD storage configuration, CPU DRAM specific capacity, and other parameters have not been disclosed by Huawei official channels. The above analysis is based on publicly available information. DeepSeek V4 Pro post-training cluster specific card counts have not been officially announced.






















