Training an LLM in Swift, Part 1: Taking matrix mult from Gflop/s to Tflop/s

TL;DR

A developer has successfully optimized matrix multiplication code in Swift for training large language models, achieving performance improvements from gigaflops to teraflops. This highlights potential for faster ML training on Apple Silicon without relying on external libraries.

A developer has demonstrated a significant performance increase in matrix multiplication code written in Swift, moving from gigaflop-per-second levels to teraflop-per-second levels on Apple Silicon hardware, which could accelerate LLM training without relying on external ML frameworks.

The developer rewrote a plain C implementation of matrix multiplication, originally used in training a GPT2-compatible model, into Swift. Initial tests showed very slow performance, but through targeted optimizations—such as leveraging SIMD, AMX, and GPU capabilities—the code was optimized to reach performance levels approaching one teraflop per second. These improvements were achieved without using any third-party ML libraries, relying solely on low-level Swift and hardware-specific features. The developer emphasizes that these optimizations are still in progress and that the full training loop, including forward and backward passes, is being refined for speed.

Why It Matters

This development demonstrates that high-performance neural network training can be achieved in Swift on Apple Silicon hardware, potentially reducing reliance on external frameworks and enabling more developers to experiment with ML directly in native code. Achieving Tflop/s performance in matrix multiplication is a critical step toward making training large models feasible on consumer-grade hardware, which could democratize access to AI research and development.

Amazon

Apple Silicon optimized matrix multiplication GPU

As an affiliate, we earn on qualifying purchases.

Background

Two years ago, the developer revisited an old neural network project and was motivated by the release of Andrej Karpathy’s llm.c, a plain C implementation of a GPT2-like model. Rewriting this in Swift revealed significant performance limitations initially. While Apple offers optimized ML frameworks, the developer chose a no-library approach to understand and improve low-level performance, focusing on matrix multiplication, a core operation in neural network training. Recent efforts have concentrated on exploiting Apple Silicon’s hardware features to push performance into the Tflop/s range, a notable milestone compared to earlier Gflop/s levels.

“Achieving teraflop-level performance in Swift for matrix multiplication on Apple Silicon is a promising step toward faster, more accessible ML training.”

— Developer

“Leveraging SIMD, AMX, and GPU features on Apple Silicon can dramatically boost computational throughput for matrix operations.”

— Hardware expert

AI and Machine Learning for On-Device Development: A Programmer's Guide

As an affiliate, we earn on qualifying purchases.

What Remains Unclear

It remains unclear how these low-level optimizations will translate into full training speed for large models, or how they compare with established ML frameworks’ performance. The developer is still refining the code, and real-world training benchmarks are pending.

Amazon

high performance neural network training Mac

As an affiliate, we earn on qualifying purchases.

What’s Next

The developer plans to complete the full training loop in Swift, including forward and backward passes, and benchmark the model’s training speed at scale. Future work will also explore integrating Metal for GPU acceleration and further optimizing performance across hardware units.

Tall GPU Support Bracket – Heavy Duty Adjustable GPU Anti Sag Holder & Support Stand for Graphics Card, 4.53"-8.27" Height Durable Black Metal PC Build Stabilizer, Large/Long GPU Sag Prevention

Graphics Card Support — This GPU support bracket is to provide support for the end of the graphics…

As an affiliate, we earn on qualifying purchases.

Key Questions

Can this Swift code replace existing ML frameworks for training large models?

Currently, the code is experimental and focused on low-level performance. While promising, it does not yet match the ease and optimization of established frameworks like PyTorch or TensorFlow. Further development is needed before it can replace these tools.

What hardware is required to achieve Tflop/s performance in Swift matrix multiplication?

Apple Silicon chips (such as M1, M2, or newer) with support for SIMD, AMX, and GPU acceleration are necessary to reach these performance levels.

How does this performance compare to traditional GPU-based training?

Initial results suggest that optimized Swift code on Apple Silicon can approach or exceed some GPU performance levels for specific matrix operations, but comprehensive training benchmarks are still in progress.

Will this approach be practical for training large-scale LLMs?

While promising at the matrix operation level, full-scale training involves many other factors. The developer is working toward integrating full training routines, but practical deployment remains a future goal.

Training an LLM in Swift, Part 1: Taking matrix mult from Gflop/s to Tflop/s

Up next

30% Off Canon Promo Codes | May 2026

Author

Tech Trend Trove Team

Share article

Why It Matters

Apple Silicon optimized matrix multiplication GPU

Background

AI and Machine Learning for On-Device Development: A Programmer's Guide

What Remains Unclear

high performance neural network training Mac

What’s Next

Tall GPU Support Bracket – Heavy Duty Adjustable GPU Anti Sag Holder & Support Stand for Graphics Card, 4.53"-8.27" Height Durable Black Metal PC Build Stabilizer, Large/Long GPU Sag Prevention

Key Questions

Can this Swift code replace existing ML frameworks for training large models?

What hardware is required to achieve Tflop/s performance in Swift matrix multiplication?

How does this performance compare to traditional GPU-based training?

Will this approach be practical for training large-scale LLMs?

Anthropic now has more business customers than OpenAI, according to Ramp data

If AI writes your code, why use Python?

Workerd fetch test

Thrive Infinite — solid brand name. Side note: more clients now ask Claude/ChatGPT “find me a coach for [their thing]” before they ever browse a site. Free 30-sec scan that shows what AI agents actually see when they look at you. Vid below.

Philippines no longer military ‘weakling’ but buildup has far to go

Leaked images reveal Xbox Elite 3 controller with mysterious new buttons

Trump-Xi summit live: Xi promises not to give Iran arms, Trump says

Linux devs are fighting the new age-gated internet

Training an LLM in Swift, Part 1: Taking matrix mult from Gflop/s to Tflop/s

Up next

Author

Tech Trend Trove Team

Share article

Why It Matters

Apple Silicon optimized matrix multiplication GPU

Background

AI and Machine Learning for On-Device Development: A Programmer's Guide

What Remains Unclear

high performance neural network training Mac

What’s Next

Tall GPU Support Bracket – Heavy Duty Adjustable GPU Anti Sag Holder & Support Stand for Graphics Card, 4.53"-8.27" Height Durable Black Metal PC Build Stabilizer, Large/Long GPU Sag Prevention

Key Questions

Can this Swift code replace existing ML frameworks for training large models?

What hardware is required to achieve Tflop/s performance in Swift matrix multiplication?

How does this performance compare to traditional GPU-based training?

Will this approach be practical for training large-scale LLMs?

You May Also Like