“Five years ago, as we at Qualcomm Datacenter Technologies began to map out our strategy to enter the datacenter market, we recognized the then-nascent trend to cloud computing and designed a technology roadmap with cloud computing in mind. The cornerstone design point for our datacenter server product roadmap was to deliver right-sized solutions, optimizing throughput performance and efficiency for emerging multi-core cloud workloads. Cloud services need to perform well in highly-loaded and multi-tenant environments, and the hardware platform needs to maximize aggregate compute performance while improving the cloud operator’s operational costs, largely driven by the cost of power and cooling.
At the Hot Chips Conference this week, we are unveiling details of the Qualcomm Falkor CPU core. Falkor is the custom CPU design at the heart of the Qualcomm Centriq 2400 SoC, the world’s first 10nm server processor, which will begin shipping commercially later this year. Here are some of the key highlights about the Falkor core that will be shared at the conference:
- Fully custom core design: Falkor was designed from the ground up specifically for the cloud datacenter server market. A pure 64-bit micro-architecture, Falkor is fully ARMv8 compliant. Our processor design team has a rich history of delivering high-performance, yet power-efficient, custom ARM CPUs for mobile platforms, and has brought this world-class design expertise to architect a CPU core specifically designed to support the features and performance demanded by cloud datacenters.
- Scalable building block: The Falkor core duplex includes two custom Falkor CPUs, a shared L2 cache and a shared bus interface to the Qualcomm System Bus (QSB) ring interconnect. This modular building block serves as the foundation for our highly scalable 48-core Qualcomm Centriq 2400 SoC design.
- Designed for performance, optimized for power: Falkor is designed to deliver high-end compute performance using a 4-issue, 8-dispatch heterogeneous pipeline. Falkor’s heterogeneous pipeline is designed to optimize performance per unit of power, with variable length pipelines that are tuned per function to maximize throughput and minimize idle hardware. Additionally, Falkor’s out-of-order and rename resources are sized to prevent instruction retirement from being in the performance-critical path, allowing unbridled usage of the multiple execution engines. Other performance-critical elements of the micro-architecture, such as branch prediction algorithms and the cache hierarchy, are state-of-the-art for today’s server class processors. A plethora of sophisticated power management techniques were baked into the design from day one, including such mechanisms as independent p-state control for each of the CPUs and L2, with entry to and exit from low-power states controlled by hardware state machines for ultra-fast state transitions, and hardware state retention for power-collapsed sleep states with ultra-fast recovery.
- Performance under memory-intensive workloads: Falkor is designed to fulfill the demand for larger instruction footprints using an innovative split instruction cache comprised of a single-cycle, low-power 24KB L0 I-cache complementing its 64KB L1 I-cache. The two caches are managed exclusively to provide a total of 88KB of low-latency I-cache. Optimizing data performance where it’s critical, the core supports a 32KB L1 D-cache with an impressive 3-cycle load-use latency. The L1 D-cache is augmented by a sophisticated multi-level hardware prefetch engine that dynamically adapts to system conditions.
- Datacenter features: Falkor is loaded with the features and functionality required for the cloud service environment, ready for multi-tenant and other virtualized workloads with the full suite of ARM Execution Levels (EL0-EL3) and TrustZone secure execution environment. Falkor supports the ARMv8 instruction extensions to accelerate cryptographic transform and secure hash operations needed for efficient performance when running networking security protocols such as https. Falkor also delivers on the RAS mechanisms needed to keep a datacenter running, such as fault isolation, reporting, and handling techniques.
- System on a chip: The 48 Falkor CPUs are brought together in a fully-integrated SoC, obviating the real estate, power, and cost of a separate chipset. The memory efficiency of the Falkor CPU is extended throughout the SoC with an extremely high-bandwidth and low-latency ring interconnect extending out to its large L3 cache and multiple memory controllers, avoiding on-die NUMA effects. In addition, the memory subsystem includes innovative shared resource management techniques such as L3 Quality of Service (QoS) extensions and effective memory bandwidth enhancement via in-line and transparent memory compression. The SoC also supports the most robust form in the industry for secure boot: an on-die hardware-based immutable root of trust that authenticates firmware before the first line of firmware is ever executed.
ARM provides an ideal foundation for a CPU that will address the needs of the cloud age of computing, and Falkor, designed from its inception as a server-class CPU core, has established the launching point for a Qualcomm Datacenter Technologies product roadmap tailored to the emerging demands of highly-scalable, performant, and power-efficient servers that will fuel the next wave of cloud datacenters.
We look forward to unveiling more details about the Qualcomm Centriq 2400 SoC architecture and product specifications in the coming months, and to providing a competitive advantage to cloud service providers looking for solutions.”