Speculating What Makes “Apple Silicon” Fast

The recent buzz (of around June 2020) in the tech world is that Apple has announced that it’ll switch from using Intel processors to what they call “Apple Silicon.” Apple Silicon in this case being their implementation of ARM via their A-series system-on-chips (SoCs). And even more interesting is that the first Mac to use Apple Silicon, the A12Z, manages to get a better benchmark score running in emulated x86 mode than Microsoft’s Surface Pro X with the SQ1 running the same benchmark in native ARM mode.

Earlier reviews of Apple’s mobile products, notably the iPad Pro, seem to paint a similar picture. For instance, in Ars Technica’s review of the 2020 iPad Pro, it managed to get similar single-core performance to the 2019 MacBook Pro 16″ in Geekbench. Granted while it’s a bit of a tough pill to swallow given some of different variables, it’s reasonable to assume that Geekbench at least runs the same set of tests. Though if people want to complain that using Geekbench scores from two different architectures is completely wrong, then we can’t use 3DMark scores between NVIDIA, AMD, and Intel GPUs since after the API, everything is essentially different.

The general presumptions I gather from the tech community is that ARM shouldn’t be able to perform this fast. But this presumption comes from seeing that ARM’s widespread use has been limited to mobile and power-limited applications. Though this is a flawed presumption in that ARM is the instruction set architecture (ISA), which only describes what the software needs to do to interact with a processor. It doesn’t describe how the processor actually performs. As an example of this, Intel’s NetBurst microarchitecture that served the Pentium 4 and Pentium D series used the x86 ISA, but so did the much better performing Core 2 Duo. Or similarly, AMD’s Bulldozer also used the x86 ISA, but so did the much higher performing Ryzen.

What we can look with regards to speculating how well a processor can perform is the core itself. Specifically, the execution engine of that core. While we could look at the front-end or instruction handling portions of the core, it’s likely benchmarks are designed in a way to trigger all of the ideal cases to avoid this being a potential bottleneck. After all, if measuring processor performance is the goal, making sure all of instructions line up for clean execution should be a must.

In any case, what does Apple’s latest implementation of their A-series processor look like? While I couldn’t find any neat diagrams, Anandtech does provide a description of what it has:

Monsoon (A11) and Vortex (A12) are extremely wide machines – with 6 integer execution pipelines among which two are complex units, two load/store units, two branch ports, and three FP/vector pipelines this gives an estimated 13 execution ports, far wider than Arm’s upcoming Cortex A76 and also wider than Samsung’s M3. In fact, assuming we’re not looking at an atypical shared port situation, Apple’s microarchitecture seems to far surpass anything else in terms of width, including desktop CPUs.

AnandTech – The iPhone XS & XS Max Review: Unveiling the Silicon Secrets by Andrei Frumusanu

For comparison, AnandTech also provided a table showing how many instruction types the A12 and its competitors, ARM’s A75, ARM’s A76, and Samsung’s M3 (truncated to just the integer side)

Compared to just the A75, the A12 can at times handle more instructions and in a few cases in less time. How does this compare to say Intel’s Skylake?

(The choice of Skylake is that Intel’s most recent processors based on the so-called Coffee Lake microarchitecture is basically the same as Skylake’s in terms of design)

Skylake has 8 ports total to submit instructions to. Right off the bat, one can see that Port’s 0 and 1 have a lot going on and the three of four ports that do take an integer operation share their slot with a vector or floating point one. Even under optimal conditions when testing integer operations, Apple’s A12 is still wider.

How about in comparison to AMD’s Zen?

Zen 1 has four ports for ALU operations which are not shared with any other operation. However, much like Skylake, Apple’s A12 is still a wider processor.

What makes knowing these specs more interesting is that in theory, regardless of what ISA they’re running on, this is the raw performance of the processor. Even if I didn’t tell you what ISA these processors were and somehow made a block diagram for the A12, you might surmise that the A12 is in theory, a better processor due to being able to crunch more instructions. Though of course, how many instructions that can be crunched at once is half the equation, there’s still the issue of how long it takes.

Going back to the Apple A12Z vs the SQ1 benchmarks and focusing on just the integer operation side of things, the SQ1 is already at a disadvantage with the A76 core it uses. Given Apple has an intimate understanding of x86, they can make a high performance emulator. And all of this was down to the implementation of how they made their processors.

Granted, this is merely a benchmark. How this performs in the real world is yet to be seen.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create your website at WordPress.com
Get started
%d bloggers like this: