Have Arm Closed the Gap on Intel’s x86?

Article By : Rick Merritt

The Neoverse N1 and E1 cores geared for the 7-nm node mark another step deeper into sockets for infrastructure systems dominated by Intel's x86.

SAN JOSE, Calif. — The first two cores of Arm’s Neoverse line for servers and communications systems narrow but don’t close its performance gap with the x86. The N1 and E1 cores exceed Arm’s target of 30% annual gains, but one analyst said that SoCs still could be as much as 30% to 40% behind the performance of Intel Xeon processors.

The two cores target the 7-nm node and add significant changes for servers and comms gear. Arm claims that it has plenty of headroom to continue adding at least 30% annual gains with its 7-nm-plus Zeus and 5-nm Poseidon cores in the works for 2020 and 2021, respectively.

To date, Arm SoCs have made significant inroads into comms gear, where Intel also has a strong presence. With the exception of storage servers, Arm has yet to gain a foothold in the mainstream server sector that the x86 dominates.

The N1 core aims to bolster Arm’s position in servers. Based on RTL simulations, a single-core and 64-core SoC hit 37 and 1310, respectively, on the SpecInt2006 benchmark when running at 105 W.

The N1 improved Java and memcacheD performance 1.7× and 2.5×, respectively, compared to the Cortex-A72 core. Memory latency fell from 110 ns to 83 ns and DRAM streaming rose from 64 GB/second to 175 GB/s, comparing the A72 to the N1.

A 1.8-W core with a megabyte L2 cache can run up to 3.1 GHz at 1 V and fit into 1.4 mm2, Arm estimates. A 64-core block, which could be typical for many SoCs, will fit into 400 mm2.

In general, Arm expects the N1 to be used in data center server SoCs running beyond 150 W and using 64 to 128 cores. At the low end, SoCs for network storage and security systems may run 8 to 32 cores at 25 W to 65 W.

Amazon and Huawei are the first big users of Arm server SoCs. Late last year, AWS started offering access to Amazon’s Graviton SoC, based on A72 cores. Huawei announced in January a 7-nm server SoC for its own systems based on 64 custom Arm cores. In August, Fujitsu described a 7-nm Arm core that it is building into a supercomputer.

Arm did not compare the new cores directly to any Intel parts. However, last year, it claimed that its then-new A76 came within 10% of the performance of Intel’s Skylake chips for notebooks.

Neoverse N1 Perf
Arm released a wealth of performance comparisons with its A72 but none with its more recent A76 or Intel’s Xeon. (Source: Arm)

For the comms-focused E1 core, Arm boosted throughput 2.7×, efficiency 2.4×, and compute performance 2.1× compared to its A53. The dual-threaded core can fit in 0.46 mm2 of silicon and run at up to 2.5 GHz while consuming 183 mW.

The E1 targets a broad range of gear from 16-core SoCs running at 15 W for gateways or 35 W for 5G base stations. At the high end, 32-core versions could run the data plane for routers with multiple 100-Gbit/s Ethernet ports.

The biggest unknown in the comms space is around so-called carrier edge networks, a new tier of the internet that AT&T and startups have discussed — but not yet built.

“There has to be compute at the edge for services to scale … but the reality is that different use cases will have different needs,” said Robert Dimond, a system architect at Arm. “We are trying to address all of these, but the next year will be an interesting battleground,” as telcos, web giants, and content companies decide what to build.

Taking a look inside the N1 and E1 cores

The N1 sports several enhancements targeting infrastructure gear that uses virtualization heavily, according to a talk by Mike Filippo, a chief processor architect at Arm.

For example, each core has a private L2 cache of up to a megabyte linked to a shared L3 cache by a mesh interconnect with a 22-ns latency. In addition, local instruction caches were made coherent.

The 11-stage pipeline can collapse to nine stages to lower latency. It can ingest four operations/cycle and issue up to eight.

Overall, the “caches were sized for large, branch-heavy infrastructure workloads with a table walker built for virtualization and low-latency switching between privilege levels for the OS and hypervisor,” Filippo said.

The N1 sports 2.2× the vector performance of the A72. It delivers an amazing 4.7× boost over the A72 on Baidu’s DeepBench test for deep learning.

Neoverse N1 Pipeline

A deep dive into the guts of the N1 core for servers. (Source: Arm)

The core sports Arm’s first software execution profiler, a block that identifies instruction causing bottlenecks. It also adds new activity monitors supporting dynamic frequency and voltage scaling. Finally, it includes changes to close Arm’s vulnerabilities in speculative execution to the Spectre and Meltdown hacks.

For its part, the E1 sports a Neon SIMD block and generally smaller L1 and L2 caches than the N1. The cores can be packed in clusters of up to eight, sharing 1 to 4 Mbytes of L3 cache.

The cores use an eight- to 10-stage out-of-order design and dual threading, in part to hide latency and reduce cache misses that stall packet processing. Accelerators for cryptography and other functions can place data directly into a core’s L2 cache.

Analysts see unclear impact on market share

It’s unclear what impact the new cores will have on Arm’s share of infrastructure markets given that multiple forces are at play. The cores debut three months after AMD revealed its 7-nm Epyc server SoC based on its new Zen core, a part reviving strong competition in x86 processors for the data center.

Meanwhile, Intel is gearing up its first 10-nm CPUs. In August, it announced its Xeon roadmap that includes proprietary links to its Optane DIMMs being tested by Google and others.

In this market, Arm is “certainly getting closer to Intel, but there’s still a significant gap, especially in single-core, single-thread performance,” said Linley Gwennap of the Linley Group.

“Arm has been slowly closing the gap, but I’m not seeing Neoverse as a big leap … I was expecting the cores would have bigger, more accurate branch predictors or a wider memory pipeline or bigger load and store queues,” Gwennap said.

Long term, custom Arm cores in SoCs from designers at Ampere, Huawei, or Marvell may be the first to get significant traction in servers, he added.

Neoverse E1 pipeline

The E1 is focused on throughput and efficiency for packet processing. (Source: Arm)

Another veteran microprocessor analyst was more bullish. With the N1 and E1, “Arm has significantly extended the range of Arm cores,” said Kevin Krewell of Tirias Research.

“The improved performance of the N1 cores significantly closes the single-thread performance gap with Intel’s Core CPUs,” he said. “The N1 also adds advanced RAS, SIMD, virtualization, and security features, closing a feature gap with Intel’s Xeon.”

It’s been a 10-year journey making Arm “a first-class architecture with the x86, IBM Power, and IBM Z in Red Hat Linux,” said Jon Masters, chief Arm architect at Red Hat. That said, just two commercial Arm systems have been publicly certified for its software so far.

In networking, embedded processor vendors such as NXP, Texas Instruments, and others have largely migrated from proprietary cores to Arm. The E1 “may not make a huge impact on market share, but it shows that Arm recognizes the networking challenges and is addressing them,” Gwennap said.

Leave a comment