Sitemap

I Almost Bought an RTX 5090. Then Apple’s Unified Memory Changed My Mind

6 min readSep 20, 2025

Apple Unified Memory vs. NVIDIA VRAM for Local LLM Inference (2025 Deep Dive)

Press enter or click to view image in full size

The Humming of Servers & Some curiosity

A few weeks back, we moved our dev/staging Kubernetes cluster out of the cloud and into a rack next to my cabin. The hum of fans became my new soundtrack. And like any engineer who hears servers at night, I got the itch: What if we run LLMs locally, served as OpenAI-compatible APIs inside our Agentic AI framework?

That sent me down the GPU rabbit hole.

I spoke to half the GPU resellers in India. Prices were weird. New cards are sold above MSRP. “Pre-loved” cards under 16GB are everywhere (and honestly, you can’t do much for LLMs with that). The moment you step into 24GB / 32GB / 48GB territory, prices goes crazy and sellers have long queues to get one. A used A6000 48GB quote landed north of ₹4.2 lakh INR and even then, you can’t reliably run many 32B parameter models in full precision. So the old plan was simple: save up, get a 5090, and push hard on quantized models.

Then came one filter-coffee in Bangalore that changed everything.

Filter Coffee, a Friend, and geeking out on tech

Over filter coffee in Bangalore, a friend and I started geeking out about Apple’s Unified Memory Architecture , that idea of CPU and GPU sharing one big memory pool instead of juggling separate RAM and VRAM.

No ferrying tensors across a PCIe bus. No “copy to VRAM” tax. Just one giant, high-bandwidth pool shared by CPU, GPU, and the Neural Engine. On Apple silicon, memory sits on same chip as compute (SoC). Data moves less & latency drops. The whole pipeline feels simpler and more predictable. (This is literally how Apple describes their SoC + unified memory approach for developers.)

I went home wondering: Have I been optimizing for tokens-per-second when my real bottleneck is tokens-per-memory?

Can a Mac Really Handle Big Models?

Short answer: under the right conditions, yes. Apple’s current Mac Studio lineup can be configured with M3 Ultra and up to 512GB of unified memory.

Real-world? A respected reviewer ran a quantized DeepSeek R1 671B model entirely in Mac Studio M3 Ultra unified memory, by raising the default VRAM limit within macOS to let the GPU grab a massive slice of that shared pool and kept power draw under ~200W. That’s… wild. (It was a 4-bit quantized variant, not full-precision. But still.) [original post]

Does that mean “Apple beats NVIDIA, full stop”? No. It means UMA changes the constraint. For many workloads, the gating factor isn’t raw FLOPs — it’s whether the model fits. If you can fit the whole model (or most of it) into one pool without shuttling chunks back and forth, your experience changes: fewer “OOM” surprises, less orchestration overhead, simpler pipelines.

Why This Matters for Builder-Founders (like me)

I don’t care about winning synthetic benchmarks. I care about:

  • Serving grounded answers via agentic pipelines, not just raw speed.
  • Batch throughput for nightly jobs and async flows.
  • Latency that’s predictable, even if not blazing-fast.
  • Operational simplicity my team can support at 2am.

In that world, fit > flash. If a model fits comfortably, life is good. If it barely fits or spills, you enter a maze of hacks, sharding, and prayer.

UMA in Plain English

  • One big memory pool: CPU, GPU, and Neural Engine share it. No separate “VRAM island.”
  • Less copying: You’re not constantly moving tensors across a bus.
  • Dynamic allocation: When the GPU needs a lot, it can grab a lot (within system limits).
  • Lower energy, lower friction: Data stays close to compute.
  • Simplicity: Fewer moving parts to manage at the software layer.

Caveats: UMA doesn’t magically remove all bottlenecks. If CPU, GPU, and other units hammer the pool at once, they still compete for bandwidth. And on Apple machines, memory is soldered buy once, cry once.

When NVIDIA Still Wins

I ran lightweight models (3B–9B) across a motley lab of Apple silicon (M1 → M4) and a Lenovo with an RTX 3060 (6GB). In this lightweight division, NVIDIA often wins on raw tokens/sec especially for tight kernels and mature CUDA paths. For custom training and niche ops, CUDA’s ecosystem still feels like home base.

But when I tried to push Gemma 9B beyond the 3060’s 6GB VRAM cliff, performance cratered. Once a model exceeds VRAM, you’re suddenly paying overhead you didn’t sign up for. On the Mac side, 8–16GB unified memory also hit walls with larger models once you count macOS + processes. That’s the heart of it: memory headroom determines how much pain you feel.

Translation:

  • Small models, tight loops? NVIDIA throughput is a rush.
  • Bigger models, less fuss? UMA feels… peaceful.

The Two-GPU Myth (and why “64GB” isn’t 32+32)

I also revisited a common idea: “What if I buy two 32GB GPUs and get 64GB?”

That’s not how it works. VRAM doesn’t stack. With consumer cards, SLI is long gone and even with multi-GPU setups, model parallelism is a specialized path. It adds complexity and overhead; it’s a way to fit a too-big model across cards, not a magical speed cheat. If your model fits in a single GPU, splitting it can make you slower, not faster.

The Pivot

I started this journey believing “5090 or bust.” Raw horsepower. Max tokens/sec. The dream.

Then reality nudged me:

  • Local inference is memory-first. If your model doesn’t fit, everything becomes a workaround.
  • My team needs reliability. A simpler pipeline we trust beats a hair-on-fire cluster we fear.
  • My costs need clarity. UMA rigs sip power compared to big GPU towers (and my cabin ears are grateful).

So I pivoted. Instead of chasing a 5090, I’m now leaning toward a Mac Studio M3 Ultra configured with 256GB unified memory (and eyeing 512GB if budget permits). Apple’s own specs confirm that range and the flexibility is exactly the point.

Decision Framework (Use This Before You Buy)

1) Start with your models, not the hardware.

  • What’s the largest model you’ll run this quarter? Next quarter?
  • What quantization quality still gives you acceptable answers for your domain?
  • How much context (RAG) do you really need per request?

2) Define your workload shape.

  • Batch jobs (nightly/async) vs. interactive (latency-sensitive)?
  • Peak concurrency and batch size targets?
  • Tolerance for OOM retries, offloading, or sharding?

3) Choose your bottleneck.

  • If your ceiling is tokens/sec on small models, NVIDIA likely wins.
  • If your ceiling is “does it fit?” on larger models, UMA can change the game.

4) Price the operational tax.

  • Multi-GPU orchestration, NVLink availability, cooling, PSU headroom.
  • Driver updates, container builds, CUDA toolchain maintenance.
  • Noise, heat, and power (your future self will thank you).

5) Buy memory like you’ll regret not buying memory.

  • UMA: pick the largest RAM you won’t hate paying for.
  • Discrete GPUs: err on more VRAM than your spreadsheet says you need.

What I’m Actually Building (Right Now)

Primary: Mac Studio M3 Ultra with 256GB unified memory (stretch target: 512GB).

Workload mix:

  • Production inference for agentic workflows (RAG-heavy, grounded).
  • Batch processing prioritized; interactive responses second.

Models in rotation: Llama 3.1/3.2 family, Gemma 2, Qwen 2.5/72B class, and task-tuned models — mostly quantized for memory headroom.

Why this rig:

  • It fits more tiers of models without gymnastics.
  • Predictable latency and fewer brittle hacks.
  • Power draw stays sane for an on-prem, always-on setup.
  • Teams can operate it without summoning a GPU priesthood.

The Final Takeaways

  • Local LLMs are memory-first. If it doesn’t fit, you’ll pay somewhere else.
  • Apple UMA collapses the RAM/VRAM split and can handle surprisingly large quantized models in-memory.
  • NVIDIA still rules small-model throughput and custom CUDA workflows.
  • Two GPUs ≠ pooled VRAM. Memory doesn’t stack; model parallelism adds complexity.
  • Buy for your next model, not your last. Future-proof your memory choice.
  • Simplicity scales. The rig your team can operate at 2am is the right rig.
  • If your world is tiny models, max throughput, heavy CUDA go NVIDIA.
  • If your world is big models that must fit, predictable energy use, and less orchestration , try UMA.

I chose UMA. I might choose differently in six months. That’s allowed.

--

--

Karan Singh
Karan Singh

Written by Karan Singh

Co-Founder & CTO @ Scogo AI ♦ I Love to solve problems using Tech

No responses yet