An introduction of ARM-Based Server EcosystemA tale of ARM-B

From Mobile Roots to Data Center Ambitions

ARM Holdings, known for its low-power, high-density processor designs that dominate the mobile device market, have set its sights on the server space to address the growing demand for more efficient and sustainable compute resources. ARM’s RISC (Reduced Instruction Set Computing) architecture, characterized by its simplicity, compact design, and low power consumption, presented an attractive proposition for server manufacturers seeking alternatives to the power-hungry x86 processors prevalent in the industry.

Public cloud service providers (CSPs), unlike traditional server Original Equipment Manufacturers (OEMs), harbor distinct concerns. Entities such as Amazon Web Services (AWS) market business and operational models to IT organizations that promise enhanced efficiency. While significant, the underlying server technology pales in comparison to the convenience and flexibility these companies offer. CSPs are primarily preoccupied with operational efficiency. Their profit margins ebb and flow depending on operational expenses, a fundamental tenet of accounting that becomes exacerbated within hyperscale data centers where costs can escalate rapidly. Among the most substantial operational expenditures for hyperscalers are those associated with power consumption and cooling infrastructure.

What sets the Arm processor architecture apart is its ability to deliver high performance at astonishingly efficient energy levels. Arm's energy efficiency narrative has underpinned its dominance in the embedded and smartphone markets. This same emphasis on energy efficiency also prompted Apple to transition all of its Intel-based products to internally designed Arm-based processors. Thus, it is hardly surprising that energy efficiency plays a pivotal role for Arm in the cloud domain as well.

Technological Advancements in ARM Server Chips

Performance and Efficiency

ARM-based server processors have undergone substantial architectural enhancements to meet the stringent requirements of data center workloads. These advancements include:

Multi-core Scalability: Modern ARM server chips feature a high core count, often surpassing those found in comparable x86 CPUs. This enables parallel processing of multiple tasks, enhancing overall system throughput while maintaining power efficiency.
Hardware Acceleration: ARM server SoCs integrate accelerators for tasks such as cryptography, machine learning, and networking, offloading these functions from the CPU cores to boost performance and reduce power consumption.
Memory Subsystems: Advanced memory controllers and support for high-speed interconnects like DDR5, HBM (High-Bandwidth Memory), and CXL (Compute Express Link) enable faster data transfer and improved memory bandwidth, crucial for demanding workloads.

Software Ecosystem Maturity

A critical factor in the adoption of any new server platform is the availability and compatibility of software. ARM has made significant strides in this area:

Operating System Support: Major OS vendors, including Linux distributions (e.g., Ubuntu, Red Hat Enterprise Linux, and CentOS), FreeBSD, and Windows Server, now offer native or certified support for ARM-based servers.
Runtime Environments and Frameworks: Java, Python, Node.js, .NET, and other popular runtime environments have been ported or optimized for ARM64 architectures, ensuring broad application compatibility.
Cloud Native Compatibility: ARM servers seamlessly integrate with cloud-native technologies like Kubernetes, Docker, and container orchestration platforms, enabling easy deployment and management in modern data centers.
Developer Tools and Libraries: Arm Development Studio provide comprehensive toolchains, debuggers, and profilers tailored for ARM-based systems, facilitating software development and optimization.

ARM ISA

The ARM Instruction Set Architecture (ISA) serves as the foundation for the design of microprocessors based on the ARM architecture, dictating the instructions that these processors understand and execute. Two significant iterations of this ISA, ARMv8 and ARMv9, have shaped the evolution of ARM-based systems in recent years, each introducing key advancements and features to meet the demands of an ever-evolving technological landscape.

ARMv8

Introduced in 2011, ARMv8 marked a major milestone in the history of the ARM architecture by introducing support for 64-bit processing. This was a significant departure from the previous 32-bit ARMv7 architecture and was crucial for ARM to compete effectively in the domains of high-performance computing, servers, and data centers, where the demand for increased memory addressing capabilities and computational power had been growing.

Key features introduced in ARMv8 include:

AArch64: A new execution state and instruction set that provides a clean break from the 32-bit past, supporting 64-bit general-purpose registers, larger virtual and physical address spaces, and more comprehensive instruction encoding schemes for improved performance and efficiency.
Advanced SIMD (NEON): Enhanced vector processing capabilities, enabling acceleration of multimedia, signal processing, and machine learning workloads.
Security Extensions (TrustZone): An integral part of the ARMv8 architecture, TrustZone extends hardware-enforced security to both AArch32 and AArch64 execution states, facilitating secure boot, secure storage, and isolated execution environments for sensitive operations.
Virtualization Extensions: Support for efficient hardware-assisted virtualization, allowing multiple virtual machines to run concurrently on a single system with minimal overhead, catering to the needs of cloud computing and containerized environments.

ARMv9

Announced in 2021, ARMv9 builds upon the success of ARMv8 while introducing several groundbreaking features and enhancements to address emerging trends in computing, such as artificial intelligence, machine learning, and enhanced security requirements.

Key features introduced in ARMv9 include:

SVE2 (Scalable Vector Extension 2): An evolution of the SVE extension found in some ARMv8-A processors, SVE2 offers broader vector processing support with variable-length vectors, enabling more efficient execution of a wide range of compute-intensive workloads, particularly in the areas of AI, ML, and digital signal processing.
MTE (Memory Tagging Extension): A groundbreaking security feature that introduces fine-grained memory protection by tagging memory allocations with unique identifiers. MTE helps detect and prevent memory safety vulnerabilities, such as buffer overflows and use-after-free errors, without significantly impacting performance.
CryptoIsland Technology: A dedicated security block designed to provide highly secure, hardware-based cryptographic services, ensuring robust protection for sensitive data and cryptographic operations in the face of evolving security threats.

Introduction to Neoverse: ARM's Dedicated Server Platform

A key component in ARM's strategic push into the server market is the Neoverse platform, a purpose-built product line specifically designed to address the unique demands of cloud, hyperscale, and edge computing infrastructures. Neoverse represents ARM's commitment to delivering best-in-class performance, efficiency, and scalability for data center workloads, further solidifying the company's position as a formidable player in the server processor arena.

Neoverse is built upon ARM's advanced v9 instruction set architecture (ISA), which introduces several innovative features to enhance security, vector processing, and artificial intelligence/machine learning (AI/ML) capabilities.

Neoverse Product Lines

The Neoverse platform encompasses three distinct product lines: Neoverse V-series, Neoverse N-series, and Neoverse E-series, each catering to different segments of the server market.

Neoverse V-series: Targeted at high-performance computing (HPC), cloud gaming, and other compute-intensive workloads, the V-series is characterized by its focus on raw compute power and vector processing capabilities. It features larger core designs with wider vector units and higher clock speeds, making it ideal for applications that benefit from strong single-threaded performance. The V-series excels in scenarios where high floating-point performance and low latency are critical, such as scientific simulations, financial modeling, and real-time graphics rendering.
Neoverse N-series: Focused on cloud infrastructure, edge computing, and network processing, the N-series prioritizes power efficiency and core density. It offers highly scalable designs with a large number of smaller cores, optimized for parallel workloads and exceptional throughput-per-watt. The N-series is well-suited for tasks that can be effectively distributed across many threads, such as web serving, data analytics, and content delivery networks. Its energy-efficient profile ensures cost-effective operation in environments with stringent power constraints or those requiring high compute capacity within a limited thermal envelope.
Neoverse E-series: Designed specifically for edge and enterprise applications, the E-series strikes a balance between performance and power efficiency. It targets use cases where a combination of single-threaded performance and multi-threaded scalability is required, such as real-time data processing, edge inference, and virtualized network functions. The E-series cores offer a medium-sized core design with a balanced mix of integer, floating-point, and vector processing capabilities, providing competitive performance per watt and a versatile solution for workloads with varying compute profiles. This series is particularly attractive for customers seeking a flexible platform that can adapt to diverse application demands while maintaining efficient energy consumption.

Neoverse V1:

The first generation in the V-series, Neoverse V1, focuses on delivering exceptional single-threaded performance and vector processing capabilities. Key features include:

High-performance cores: Large, powerful cores designed for maximum compute efficiency, with wide vector units to handle demanding floating-point and vectorized workloads.
Scalable Vector Extension (SVE): Supports the ARM Scalable Vector Extension, which allows for variable-length vector processing, enhancing performance in scientific computing, machine learning, and signal processing applications.
High clock speeds: Optimized for operating at high frequencies, enabling rapid execution of complex instructions and reducing latency in time-sensitive applications.
Advanced memory subsystem: Incorporates features such as large L2 caches, high-bandwidth interconnects, and support for DDR5 memory to facilitate fast data access and efficient memory management.

Neoverse V2: The second generation in the V-series, Neoverse V2, builds upon the strengths of V1 while introducing additional enhancements and optimizations for even higher performance and efficiency. Key improvements include:

Enhanced core design: Further refinement of the core architecture, potentially featuring wider vector units, increased IPC (instructions per cycle), and improved branch prediction, resulting in even stronger single-threaded performance.
Updated vector extensions: May incorporate the next iteration of the Scalable Vector Extension (e.g., SVE2), offering expanded functionality and improved performance for vectorized workloads, particularly in AI/ML and data analytics domains.
Advanced power management: Introduction of power-saving techniques and more granular control over power consumption, allowing for better performance-per-watt without sacrificing raw compute power.
Increased memory bandwidth and I/O capabilities: Support for faster memory technologies (e.g., HBM2E or DDR5-5600), along with improved I/O interfaces, ensuring that the processor can efficiently handle the massive data flows characteristic of HPC applications.

Neoverse V3:

Arm released the latest generation of Neoverse products on February 21, 2024.

The Arm Neoverse V3 CPU is built to deliver maximum performance for cloud applications, high performance computing (HPC), and machine learning (ML) workloads. Neoverse V3 delivers double-digit performance improvements over Neoverse V2 on cloud and ML applications. It is the first Neoverse CPU to support the Arm Confidential Computing Architecture.

The Neoverse V3 processor uses the Armv9.2-A architecture, the Core interface uses DSU-120, and there is no official data on the overall performance improvement of the Core, but some analysts believe that the improvement is about 10~20% compared with the previous generation.

Its MMU still continue the classic Level 2 TLB structure, and the more detailed microarchitecture design manuals do not describe too much, and traditional technologies such as Translation Cache, aggregation, and prefetch still exist, and there are no obvious changes to L2 TLB and 3rd generation microarchitectures. The biggest change is the L1 TLB, where ITLB entries are upgraded from 48entry in V2 to 128entry in V3, while DTLB is upgraded from 48entry to 96entry. The most distinctive feature is the huge upgrade of ITLB, which may be to cope with the more frequent change of instruction set addresses in AI scenarios. Maybe this is why ARM places more emphasis on the analysis of AI scenarios in V3.

Arm highlighted the huge performance improvements in AI data analysis for its next-generation products.

As we see the evolution of the Neoverse architecture across its three distinct generations, we could see more cores, larger L2 cache/LLC, and more powerful vector extensions.

GPUs, NPUs, and Accelerators in the ARM Ecosystem

The ARM ecosystem, characterized by its widespread adoption in mobile devices, embedded systems, and increasingly in servers and edge computing, benefits from a diverse array of GPU, NPU, and accelerator technologies. These specialized components enable optimized performance, energy efficiency, and workload-specific capabilities, contributing to the growing demand for heterogeneous computing architectures.

GPUs for ARM-Based Systems

ARM Mali GPUs: ARM itself develops and licenses the Mali series of GPUs, which are widely used in SoCs designed for mobile devices, set-top boxes, smart TVs, and automotive applications. Mali GPUs are based on different architectures, such as the Bifrost, Valhall, and Midgard, each tailored for varying performance and power requirements. The Mali-G series, for instance, focuses on high-end mobile graphics, while the Mali-T series targets more power-constrained, entry-level devices. ARM continually updates its GPU designs to support the latest graphics standards, such as Vulkan, OpenGL ES, and OpenCL, ensuring compatibility with modern applications and games.

Third-party GPU Solutions: ARM's licensing model allows for flexibility in choosing GPU partners. Companies like Imagination Technologies and AMD provide alternative GPU IP that can be integrated into ARM-based SoCs. For example, Imagination's PowerVR series has been used in several ARM-based chips, offering competitive performance and power efficiency. AMD, following its planned acquisition of Xilinx, could potentially bring its Radeon graphics technology to the ARM ecosystem through custom-designed solutions or IP licensing.

NPUs for AI Workloads

ARM Ethos NPUs: ARM's Ethos line of NPUs is purpose-built to accelerate machine learning inference tasks, particularly in mobile and edge devices. Ethos NPUs are designed to work seamlessly with ARM Cortex CPUs and Mali GPUs, forming a comprehensive AI processing solution. They offer high performance per watt, optimized for tasks like image recognition, natural language processing, and computer vision, allowing devices to execute AI algorithms locally without relying on cloud connectivity.

Partner NPUs: ARM-based SoC designers may also opt for third-party NPU solutions. For instance, Huawei's Kirin chips integrate the Da Vinci NPU architecture, which is optimized for on-device AI processing. Qualcomm's Snapdragon SoCs feature the Hexagon DSP, which serves as a vector processor and can be utilized for AI inference tasks in conjunction with the Snapdragon Neural Processing Engine (NPE).

Accelerators and Co-processors

The ARM ecosystem is rich in accelerators and co-processors that enhance system performance for specific workloads:

DSPs and ISPs: ARM provides licensable digital signal processor (DSP) and image signal processor (ISP) IP for audio, video, and imaging tasks. These accelerators offload compute-intensive operations, such as noise reduction, color correction, and video encoding/decoding, from the main CPU, improving overall system efficiency.

Security Accelerators: With increasing concerns over device security, ARM offers TrustZone technology and Secure Processing Units (SPUs) to provide hardware-based security features, isolating sensitive operations and data from the rest of the system.

FPGAs and ASICs: Partners like Xilinx (now part of AMD) and Intel (through its acquisition of Altera) offer field-programmable gate arrays (FPGAs) and application-specific integrated circuits (ASICs) that can be tailored to specific workloads, such as network processing, data compression, or custom AI algorithms. These flexible accelerators can be integrated into ARM-based systems to boost performance and efficiency for targeted applications.

Edge Computing Accelerators: ARM-based platforms often incorporate accelerators for edge computing scenarios, such as machine learning inference, computer vision, and sensor fusion. For example, Qualcomm's Snapdragon Auto platforms include dedicated computer vision and AI accelerators for advanced driver assistance systems (ADAS) and autonomous driving applications.

In conclusion, the ARM ecosystem thrives on the availability of diverse GPU, NPU, and accelerator technologies, both from ARM itself and its numerous partners. This broad selection enables SoC designers to create highly customized solutions tailored to the unique requirements of various markets, from low-power wearables to high-performance servers. As AI, machine learning, and data-intensive applications continue to drive innovation in the tech industry, the ARM ecosystem's ability to accommodate a wide range of specialized compute components will remain a key factor in its ongoing success and relevance.

Industry Adoption and Collaborations

Leading semiconductor companies and system-on-chip (SoC) designers, such as Amazon Web Services (with their Graviton3 processors based on Neoverse V1), Ampere Computing (Altra and Altra Max based on Neoverse N1), Marvell (OCTEON Fusion based on Neoverse N2), Microsoft Azure (Cobalt 100 based on Neoverse N2) and Google Cloud (Axion based on Neoverse N2) have embraced the Neoverse platform to develop custom server solutions for various markets. These collaborations demonstrate the versatility and adaptability of Neoverse architectures in addressing diverse data center requirements.

Major cloud providers like AWS, Microsoft Azure, Google Cloud Platform, and Alibaba Cloud have introduced ARM-based instances, allowing customers to leverage the benefits of ARM technology without investing in dedicated hardware. This not only expands the reach of ARM servers but also fosters a diverse and competitive ecosystem within the cloud market.

In 2022, when Arm released the Neoverse V2, it specifically mentioned Alibaba's Yitian 710 (based on Neoverse N2) as the first CPU to score above 500 on the SPEC CPU2017 Integer Rate.

Perspectives and Challenges

Neoverse, the reference platform for ARM servers, has greatly accelerated ARM's technological progress in the server field, making ARM servers a strong competitor to x86 servers that cannot be ignored, and the market share will continue to rise in the future. ARM's flexibility and openness in the chip field allow manufacturers to flexibly modify CPU designs according to market demand. You can not only make minor repairs and minor changes on the Neoverse series platform, but also use the Neoverse reference design to prove yours design just like Ampere, and start to explore the self-developed server CPU microarchitecture. This kind of flexibility is something that traditional x86 can't give.

About 40% of Arm servers are currently in use in China. However, Huawei has been placed on the US government's blacklist, preventing them from accessing the advanced manufacturing processes employed by TSMC. Consequently, they are unable to offer processors that can compete with market leaders in terms of performance. Although Alibaba's T-Head, like other Chinese companies, can utilize TSMC's latest innovative technologies, it is restricted by US and UK export control regulations and the Wassenaar Arrangement from licensing Arm's Neoverse V series CPU cores for high-performance computing purposes. By the way, the complex relationship between Arm China and Arm also casts a layer of uncertainty on its market expansion in China.

Ironically, Arm's foray into the server domain has paved the way for a proliferation of alternative architectures. The adoption of Arm within the public cloud ecosystem demonstrates the feasibility of Arm-based servers. Some organizations have adapted their software and operational toolchains to be processor-agnostic, thereby lowering the barrier to entry for new processor architectures, such as RISC-V. Numerous chip designers, including Huawei, are closely watching RISC-V. It is a rapidly evolving open-source Instruction Set Architecture (ISA) that is unencumbered by restrictions and can accommodate highly customized general-purpose cores tailored for very specific workloads.

Meanwhile, the Arm ISA enjoys the backing of influential companies like AWS, Google, Nvidia, Microsoft, Qualcomm, and Samsung, all of which understand how to ensure software ecosystem support for their CPUs. Consequently, Arm's adoption is poised to further expand the data center landscape in the coming years.

RISC-V does not mandate the payment of royalties and licensing fees, potentially eradicating more profit from the value chain. On the other hand, unlike Arm, RISC-V is not centrally governed, making it nearly impossible to guarantee interoperability among different RISC-V components. This means fragmentation, just like there are many small RISC-V CPU vendors today, and the CPU characteristics of each vendor may be different, which will also bring some software adaptation and observation troubles.

Since the advantages mentioned above, I believe that arm-based CPUs will continue to expand their share and gain advantages on the server market. But in the medium to long term (maybe 5 to 10 years from now), it is possible that it will be challenged even replaced by RISC-V.

An introduction of ARM-Based Server Ecosystem