TL;DR

Thorsten Meyer AI published guidance saying local LLM inference often remains near full speed after GPU power limits because the workload is usually bound by memory bandwidth. The guide cites RTX 4090 measurements showing a 70% power cap cutting about 90 watts of heat while keeping about 93.4% of tokens/sec, though results vary by card and workload.

Thorsten Meyer AI has published a tuning guide that says local inference users can often cut GPU heat and fan noise by applying a power limit or undervolt, with only modest loss in tokens per second, a finding that matters for people running high-power AI workstations at home or in small offices.

The guide’s central claim is workload-specific: local LLM inference is often constrained more by VRAM bandwidth than by GPU core compute. Because of that, the author says reducing GPU power can lower voltage, clocks, temperature and fan speed while leaving much of the token throughput intact.

In the cited RTX 4090 example, stock operation is listed at 390 watts, 72°C and 100% speed. At a 70% power limit, the guide lists 300 watts, 67°C and 93.4% of the original tokens/sec, a reduction of roughly 90 watts for about a 6.6% speed drop. At 80%, it lists 330 watts, 70°C and 98.6% speed. At 55%, it lists 240 watts, 60°C and 89.2% speed, which the guide labels the peak-efficiency area.

The article recommends starting with power limiting rather than manual undervolting. On Windows, it points to MSI Afterburner. On Linux, it cites nvidia-smi or LACT, with an example command of sudo nvidia-smi -pl 300. It says users should test with their real inference workload and measure actual tokens/sec, power draw, clock behavior and temperature over a sustained run.

Why It Matters

The guidance matters because many local AI users are trying to run large models on consumer GPUs that can draw hundreds of watts under load. Lowering GPU power can reduce heat dumped into a room, ease fan noise, and lower electricity use without requiring a new case, cooler or GPU.

For workstation owners, the tradeoff is practical rather than theoretical: a small tokens/sec loss may be acceptable if it makes long inference sessions quieter and more stable. The guide frames power limiting as a free first step before hardware purchases, while still warning that card behavior varies.

EVGA GeForce RTX 3090 FTW3 Ultra Gaming, 24GB GDDR6X, 10496 CUDA Cores, 1800MHz Boost Clock, 3x Fans, ARGB LED, Metal Backplate, PCIe 4, HDMI, DisplayPort, Desktop Compatible

EVGA GeForce RTX 3090 FTW3 Ultra Gaming, 24GB GDDR6X, 10496 CUDA Cores, 1800MHz Boost Clock, 3x Fans, ARGB LED, Metal Backplate, PCIe 4, HDMI, DisplayPort, Desktop Compatible

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Background

Modern high-end GPUs ship with factory voltage and clock settings designed to maintain rated performance across many chips. Thorsten Meyer AI argues that this leaves extra voltage headroom that can be reclaimed for inference workloads, where peak core clocks may not be the limiting factor.

The guide places this advice as the first lever in a broader series on reducing heat and noise in high-power AI workstations. It distinguishes simple power limiting, which restricts the card’s power budget, from undervolting, which directly edits the voltage-frequency curve and needs more careful stability testing.

“Local inference is memory-bound — the GPU core spends much of its time waiting on VRAM.”

— Thorsten Meyer AI guide

“Power limiting moves one slider and can’t damage anything.”

— Thorsten Meyer AI guide

“Test under your real workload — a curve stable for 10 min can fail on hour 3.”

— Thorsten Meyer AI guide

MSI MPG Ai1300TS PCIE5, Fully Modular Gaming 1300W Power Supply, 80+ Titanium, Dual 12V-2x6 Cables, GPU Safeguard+, SiC MOSFETs, Fan Safeguard, ATX 3.1 & PCIe 5.1 Ready,12 Year Warranty

MSI MPG Ai1300TS PCIE5, Fully Modular Gaming 1300W Power Supply, 80+ Titanium, Dual 12V-2×6 Cables, GPU Safeguard+, SiC MOSFETs, Fan Safeguard, ATX 3.1 & PCIe 5.1 Ready,12 Year Warranty

GPU SAFEGUARD+ – providing real-time current protection. This proactive mechanism delivers an early alert if it detects the…

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

What Remains Unclear

The exact gains are not guaranteed. The guide says figures are illustrative and vary by GPU, model, quantization, cooling, silicon quality and workload. It is also unclear how closely the listed RTX 4090 pattern will map to every inference stack, especially workloads that are more compute-heavy or already tuned for a specific power profile.

Manual undervolting also carries stability risk. The source describes undervolting and power limits as reversible and widely used, but says users make changes at their own risk and should confirm current hardware specs before acting on purchase or tuning advice.

(2-Pack) COMeap 12 Pin GPU Cable, Dual PCIe 8 Pin Female to Mini 12 Pin Male GPU Power Adapter Extension for NVIDIA GeForce RTX 30 Series 9.5-inch (24cm)

(2-Pack) COMeap 12 Pin GPU Cable, Dual PCIe 8 Pin Female to Mini 12 Pin Male GPU Power Adapter Extension for NVIDIA GeForce RTX 30 Series 9.5-inch (24cm)

『12 Pin GPU Cable』Dual 8 pin female ends to plug into the power supply, Mini 12 Pin male…

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

What’s Next

The next step for readers is to test their own system with a conservative power cap, often in the 60% to 80% range cited by the guide, then compare sustained tokens/sec, wattage, temperatures and fan noise. Users who need finer control may then test a voltage curve around the guide’s suggested 0.9V to 0.95V starting range, saving the profile only after a long run under the actual inference workload.

Amazon

GPU undervolt for inference

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Key Questions

Is this mainly for gaming or local AI inference?

The guide is aimed at local AI inference. It says gaming workloads are often more compute-bound, so cutting GPU core power may cost more performance than it does in memory-bound LLM inference.

What is the easiest setting to try first?

The guide recommends a simple power limit first, such as moving from 100% toward about 70%, then measuring real tokens/sec and temperatures during a sustained inference run.

Does undervolting keep the same tokens/sec?

Not always. The cited RTX 4090 example keeps 93.4% of speed at a 70% power limit and 98.6% at 80%, but the source says results vary by card and workload.

Can power limiting damage the GPU?

The guide says power limiting restricts the card rather than pushing it harder. Manual undervolting is also described as reversible, but the source warns that all tuning is done at the user’s own risk.

What should users measure before saving a profile?

Users should compare sustained tokens/sec, GPU power draw, temperature, fan noise and clock behavior under the same model and settings they normally use.

Source: Thorsten Meyer AI

You May Also Like

80 Plus Bronze, Gold, Platinum: Power Supply Ratings Explained

Within this guide, discover what 80 Plus Bronze, Gold, and Platinum ratings mean and how they can impact your system’s efficiency and savings.

Flipper One – we need your help

The Flipper One project aims to create a fully open, well-documented ARM Linux device. Developers are calling for community support to overcome technical challenges.

Microsoft BitLocker-protected drives can now be opened with just some files on a USB stick — YellowKey zero-day exploit demonstrates an apparent backdoor

Security researcher reveals a zero-day exploit allowing access to BitLocker-protected drives using simple files, raising security concerns.

Integrated Vs Dedicated Graphics: What’s the Difference?

The difference between integrated and dedicated graphics can impact your computer’s performance—discover which option is right for you.