Huawei’s CloudMatrix Ascend: The AI Supernode for Ultra-Large-Scale Computing

Crowd observing server displays at a technology exhibition

Click to zoom

Huawei’s CloudMatrix Ascend platform leads the latest innovation ranking

Huawei’s CloudMatrix Ascend platform — built around the Ascend AI line — took top billing at the company’s Innovation and IP Forum in Beijing. I’ll be honest: the press blurb sounds like marketing at first. But when you peel back the specs and the engineering stories, you start to see why this matters for people actually trying to train foundation models at scale.

What is CloudMatrix and why it matters?

Put simply, CloudMatrix Ascend platform is Huawei’s answer to an emerging class of ultra-large-scale AI computing systems — what they call an AI supernode. The baseline CloudMatrix 384 pairs 384 Ascend 910C NPUs with 192 Kunpeng server CPUs and ties them together with a unified high-bandwidth, low-latency interconnect. That design isn’t just about raw FLOPs; it’s about creating a coherent training fabric where tens of thousands of chips can be stitched into one workload.

That coherence is crucial. Ask any ML engineer: building foundation models isn’t a bunch of independent boxes — it’s a single orchestra. If the instruments aren’t in sync (data movement, checkpoints, power), the performance falls apart. CloudMatrix aims to be an orchestra conductor, not just a louder trumpet.

Key technical features

Dense Ascend NPU integration: The 384 Ascend 910C NPU cluster in the baseline gives huge matrix-compute throughput — the kind of dense NPU integration you need when model parallelism meets gargantuan parameter counts.
Kunpeng CPU coordination: 192 Kunpeng CPUs handle orchestration, IO coordination and system tasks — think of them as traffic controllers for massive NPU fleets.
Unified high-bandwidth bus: A single fabric reduces latency and simplifies scaling. When you’re trying to do model and data parallelism at scale, that high-bandwidth low-latency interconnect is the glue.
Scalable architecture: The idea is to scale from the 384/192 baseline up to tens of thousands of chips, preserving consistency, reliability and throughput across nodes.

What challenges did Huawei solve?

The engineering story here is concrete: co-design. Sun Hongwei described how hardware, software and packaging had to be iterated together. The problems are familiar if you’ve run a multi-node cluster — keeping power delivery stable, making sure a fleeting storage hiccup doesn’t corrupt checkpoints, and recovering gracefully from component failures without restarting weeks of work.

They also addressed orchestration and data movement at scale. Coordinating thousands of NPUs is less sexy than peak TFLOPS, but it’s the real obstacle to efficient, repeatable training.

Storage innovations: beating the capacity wall

Here’s a part I liked — and I say this as someone who’s watched teams choke on IO. Huawei showcased next-generation SSDs designed to push past the so-called capacity wall. By combining chip packaging innovations with system-level design, they’re aiming for denser, faster persistent storage for petabyte-scale datasets and model checkpoints.

Why does that matter? For foundation models, checkpoints are massive. If your persistent storage can’t sustain throughput, the whole pipeline stalls. The combination of persistent storage for ML checkpoints and an SSD design tuned for continuous, high-throughput writes is exactly the pragmatic fix engineers need. Learn more in our guide to Nvidia supercomputers.

Other notable inventions presented

Ultra-chroma camera: Extremely accurate colour reproduction — already in the Mate 70 series. A nice reminder: Huawei’s work spans from devices to data centre physics.
Real-time assisted driving: A lane of research on real-time environment awareness — the sort of assisted-driving realism that shadows human decision-making.
IP and patents: Huawei emphasised scale: more than 150,000 active patents and sizable licensing revenue. That IP base feeds into their vertical integration strategy.

Geopolitics and R&D scale

We shouldn’t pretend geopolitics isn’t part of this story. Export controls and sanctions have nudged Huawei toward vertical integration — building Ascend NPUs, Kunpeng CPUs, and their own storage eco. The logic is simple: if some suppliers are uncertain, owning more of the stack hedges risk and helps secure sovereign AI infrastructure.

Still — there are trade-offs. Building vertically can accelerate resilience, but success depends on software ecosystems and developer adoption. The hardware can be brilliant; without tooling and community, adoption stalls.

Investment in innovation

Numbers matter. Huawei spent 179.7 billion yuan (about US$25.2 billion) on R&D in 2024 — roughly 20.8% of revenue — and over the past decade reports more than 1.24 trillion yuan in R&D investment. That’s industrial-scale commitment, and it explains how projects like CloudMatrix, next-gen SSDs and assisted-driving systems get funded long enough to iterate.

Real-world perspective

From my time working with data-centre teams, the main pain point in scaling large models wasn’t just compute — it was orchestration of data movement, power, and error recovery across thousands of components. I remember a team training a multimodal model where checkpoints were measured in petabytes; a single delayed write (or an SSD that couldn’t sustain throughput) forced a rerun and cost weeks.

So when you hear about how CloudMatrix combines dense NPUs, Kunpeng coordination, a high-bandwidth fabric and persistent storage for ML checkpoints — that’s hitting the exact pain points. The real test, though, will be operational: sustained reliability, tooling, and whether software vendors — or open-source projects — make it easy to use.

What CloudMatrix means for AI builders

Faster training at scale: More Ascend 910C NPUs plus a tight interconnect tends to shorten time-to-model for deep-learning projects, especially foundation models.
End-to-end co-design matters: You can’t bolt best-in-class SSDs onto a mismatched interconnect and expect seamless scaling. System-level co-design for reliability wins.
Geopolitical resilience: Vertical integration helps mitigate exposure to export controls — but it’s not a silver bullet.
Commercial impact: If the platform delivers operational reliability and an ecosystem, it could reduce model training costs and lower risk for production-scale AI deployments.

People often ask: Can CloudMatrix train GPT-style foundation models? In principle, yes — the architecture is targeted at exactly that kind of workload. How many Ascend 910C chips in a CloudMatrix 384? The name gives it away — 384. Is CloudMatrix available to cloud customers? Huawei’s message hints at both on-prem and cloud variants, but adoption will depend on software support and provider offerings. And how does an Ascend NPU compare to an Nvidia GPU? They’re different design points — NPUs tuned for certain matrix ops and tight system integration versus GPUs with broad ecosystem support.

One final thought: specs are impressive, but the long game is software, deployment tooling and the developer community. The hardware only becomes transformative when people actually build with it — and keep it running.

For additional context on multi-cloud and compute sourcing for large-scale model training, see our guide to Nvidia supercomputers.

🎉

Thanks for reading!

If you found this article helpful, share it with others

⌨️ Keyboard Shortcuts