How Apple Neural Engine Differs from GPU and CPU

If you’ve shipped a machine learning model on a Mac, you’ve probably noticed CoreML asking you to choose a Compute Unit: CPU only, CPU and GPU, or All. The “All” option includes a third processor most developers haven’t dealt with directly — the Apple Neural Engine (ANE, or just NE). This post walks through what the NE actually is, how it differs architecturally from the CPU and GPU, and how to think about which workloads belong on which.

The Apple M1 system-on-chip package alongside two LPDDR memory modules on a green PCB — The Apple M1 package (left) sits next to two LPDDR memory modules (right) — the physical layout that gives Apple Silicon its unified memory advantage. Image by Henriok, CC0.

The Three Compute Units on Apple Silicon

A modern M-series chip integrates three classes of processor on a single die, all sharing one unified memory pool:

flowchart TB
    subgraph SoC["Apple Silicon SoC"]
        CPU["CPU<br/>P-cores (4-12)<br/>E-cores (4-8)"]
        GPU["GPU<br/>(10-76 cores)"]
        NE["Neural Engine<br/>(16 cores)"]
        MEM[("Unified Memory<br/>(LPDDR, 8-192 GB)")]
        CPU --- MEM
        GPU --- MEM
        NE --- MEM
    end

All three units read and write the same physical RAM, which is the architectural superpower of Apple Silicon: there’s no PCIe transfer cost when handing data between the CPU, GPU, and NE.

What Each Unit Is Good At

A rough mental model:

Unit	Strength	Where it shines	Where it stalls
CPU	Branchy, latency-sensitive, sequential code	Control flow, single-threaded throughput, anything with unpredictable memory access	Massively parallel SIMD math
GPU	Massively parallel float32/float16 math	Graphics, large dense matrix ops, custom kernels via Metal	Heavy branching, small workloads (overhead dominates)
Neural Engine	Fixed-function neural network inference (int8, fp16)	Convolutions, attention blocks, matrix multiplies in the shapes ML models use	Custom ops, dynamic shapes, anything outside the supported op set

The Neural Engine is the most specialized of the three. Where the GPU is a flexible parallel processor that can do ML well, the NE is a fixed-function accelerator that only does ML — and pays for that narrowness with dramatic efficiency gains for the workloads it does support.

Why the Neural Engine Wins on Power

The headline number Apple cites is 15.8 trillion operations per second (M2 NE) or higher on later chips, but the more interesting number is power consumption. On many ML workloads, the NE delivers similar or better latency than the GPU at a fraction of the power:

Workload: SAM2 image encoding (1024×1024 input)

CPU (P-cores):  ████████████████████████████  ~3000ms,  8W avg
GPU:            ██████████                    ~ 800ms, 12W avg
Neural Engine:  ████                          ~ 310ms,  3W avg

(Numbers are illustrative — actual benchmarks vary by chip generation and model.)

The implication for laptops is huge. If your app does heavy ML inference and you keep it on the GPU, the laptop fan kicks on. Move the same workload to the NE and the system stays silent. For a continuously-running model like real-time segmentation or background photo enhancement, that’s the difference between an app users keep open and one they quit to save battery.

How CoreML Decides

When you load a model with MLModelConfiguration().computeUnits = .all, CoreML doesn’t blindly run everything on the NE. It analyzes the model graph and partitions it across all three units based on what each one supports best:

flowchart TB
    M["ML Model Graph<br/>Conv → ReLU → Custom → Conv → Softmax"]
    M --> C[CoreML graph compiler]
    C --> NE["NE<br/>Conv, ReLU, Conv"]
    C --> GPU["GPU<br/>Custom layer<br/>(no NE support)"]
    C --> CPU["CPU<br/>Softmax<br/>(small, fast on CPU)"]

The graph is sliced into segments, each segment runs on whichever unit handles it best, and CoreML automatically schedules the data movement between them. In practice, for any non-trivial model you’ll see all three units active during inference — even if 80% of the math is on the NE.

The decision isn’t perfect. Sometimes CoreML places a small op on the GPU when the CPU would be faster (because the round-trip overhead matters more than the per-op speed). The flag computeUnits = .cpuAndNeuralEngine tells CoreML “skip the GPU even if you think it would help” — useful when you’ve measured and the GPU path is slower for your specific model.

(See Porting WobblePic to macOS for a real-world ONNX-to-CoreML migration.)

What the NE Can’t Do

The fixed-function nature has costs. The NE supports a specific set of operations and tensor layouts. If your model uses anything outside that set, those ops fall back to the GPU or CPU. Common gotchas:

Issue	Why	Workaround
Dynamic shapes	NE prefers static shapes baked at compile time	Use fixed input dimensions; pad/crop instead of variable sizing
Non-standard activations	Only common activations (ReLU, GELU, etc.) are NE-native	Replace with standard ones during model conversion
Custom ops	NE has no equivalent of CUDA kernels	Run the custom op on GPU or CPU; keep the rest on NE
Very small models	Compile + dispatch overhead dominates	CPU is often faster for sub-1M-parameter models
fp32-required precision	NE is optimized for fp16/int8	Quantize the model, or pin precision-sensitive ops to GPU

The “very small models” entry surprises people — there’s a per-inference overhead of dispatching to the NE that’s measured in hundreds of microseconds. For a model that takes 50 microseconds on CPU, that’s pure loss.

Quantization Matters More Than You’d Think

The NE peaks on 8-bit integer (int8) and 16-bit float (fp16) data. fp32 is supported but at much lower throughput:

NE throughput (relative):

int8:   ████████████████████████  100%
fp16:   ████████████              50%
fp32:   ███                       12%

This is why most production ML deployments on Apple Silicon convert their models to fp16 (or int8 with calibration) during the CoreML conversion step. The model is the same architecturally, the weights are stored in narrower types, and inference becomes 2-8× faster on the NE — usually with negligible accuracy impact for vision and language models.

CoreMLTools makes this conversion fairly mechanical for most architectures:

import coremltools as ct

mlmodel = ct.convert(
    pytorch_model,
    convert_to="mlprogram",
    compute_precision=ct.precision.FLOAT16,  # <-- key setting
    minimum_deployment_target=ct.target.macOS14,
)

Switching from FLOAT32 to FLOAT16 on this single line is the lowest-effort, highest-impact optimization for any CoreML deployment.

Measuring What Actually Runs Where

CoreML doesn’t tell you out of the box where each op landed. The way to find out is Xcode’s CoreML Performance tool, which prints a layer-by-layer breakdown:

Layer	Unit	Time	Note
conv1	NE	0.42 ms
bn1	NE	0.08 ms
relu1	NE	0.05 ms
custom_op	CPU	1.20 ms	← outlier
conv2	NE	0.38 ms
…

If you see a single layer taking 10× the time of its neighbors, it’s almost always a layer that fell off the NE onto the CPU. Either rewrite the model to avoid that op, replace it with an NE-supported equivalent, or accept the cost.

For automated profiling without Xcode, the lower-level os_signpost API can mark NE/GPU/CPU transitions, and you can grep Console.app for them after a run.

Beyond Apple: What This Tells Us About AI Hardware

The NE is part of a broader trend: every major mobile and desktop chip vendor is shipping a dedicated neural accelerator. Qualcomm has the Hexagon NPU, Intel has the AI Boost (NPU on Core Ultra), AMD has XDNA, and Google has TPU. Each is fixed-function for the same reason — modern ML inference is dominated by a small handful of operation types (convolutions, matmuls, attention), and a chip designed exclusively for those operations is dramatically more efficient than a general-purpose GPU running the same workload.

The downside is fragmentation: each accelerator has its own SDK, its own supported op set, and its own quirks. Cross-platform ML deployment increasingly means targeting a specific accelerator on each platform — CoreML/NE on Apple, DirectML/NPU on Windows, NNAPI on Android — rather than writing once and shipping everywhere.

Wrapping Up

A practical mental model:

NE is a specialist. It does ML inference brilliantly and almost nothing else. For workloads it supports, nothing beats it on power efficiency.
GPU is a generalist. Slightly slower than NE for ML, but handles custom ops, graphics, and any compute kernel you want to write in Metal.
CPU is the fallback. It handles whatever the other two can’t, plus any control flow that needs to run between them.
CoreML’s .all mode is usually correct. Trust it to partition unless you’ve measured and found a specific layer that should be pinned elsewhere.
Quantize aggressively. fp16 is the default, int8 if your accuracy budget allows.

If you’re doing on-device ML on Apple Silicon, the NE is doing most of the work whether you knew it or not. Understanding what it’s actually doing is the first step to making it work harder.