A 21.9M hybrid recurrent attention to learn speech on 1.8k steps
Ok, so I might have had a light W with this little speech experiment. Yes, Olmo type recurrent attention hybrids are pretty good.
I wanted a cheap answer to a simple question: can a tiny OLMo-hybrid style recurrent-attention language model learn enough codec structure to produce voice-like audio instead of static?
Seems like it did, haha.
By about step 1800, the 21.9M parameter OLMo-hybrid trained on LJ Speech with EnCodec 24 kHz tokens was already producing samples with a clear speaking voice. The outputs were still mildly babbly and not semantically coherent, but they were decisively quite clearly past the “hiss / drone / decoder garbage” failure mode.
That happened at only about 15% of the originally planned 12k training budget.
Setup
| Item | Value |
|---|---|
| Model | OLMo-hybrid speech LM |
| Parameters | 21.9M |
| Backbone | 8 layers = 6 Gated DeltaNet blocks + 2 attention blocks |
| Width | d_model=384, d_ff=1024 |
| Attention | 6 heads, 2 KV heads |
| Hybrid schedule | attention every 4th block |
| Data | LJ Speech |
| Tokenizer | EnCodec 24 kHz, 8 codebooks, vocab 1027 |
| Chunking | 8s chunks |
| Split | 12,624 train / 666 val |
| Context | 1024 delayed steps |
| Hardware | A100-SXM4-80GB |
| Runtime | bf16, fused AdamW, CUDA SDPA flash path on, fused FLA GDN path off |
| Stable batch | true batch 24, grad accum 1 |
| Throughput | about 17.9k tok/s |
A bit of an important note is that this run did not use the intended fused Flash Linear Attention Gated DeltaNet kernel. The Triton backward kernel was unstable on my runpod stack and kept causing crashes, so I decided to just have the recurrent core fall back to the plain PyTorch scan implementation. That means the run was still much slower than ideal for an A100. Which is a sort of upside because it means it could much quicker.
Validation Trajectory
The A100 run seemed to have also improved monotonically through every saved eval up to 1800.
| Step | EMA Val Loss | PPL |
|---|---|---|
| 200 | 6.7878 | 886.99 |
| 400 | 6.1530 | 470.13 |
| 600 | 5.6474 | 283.54 |
| 800 | 5.1973 | 180.78 |
| 1000 | 4.8510 | 127.87 |
| 1200 | 4.6169 | 101.18 |
| 1400 | 4.4626 | 86.72 |
| 1600 | 4.3585 | 78.14 |
| 1800 | 4.2847 | 72.58 |
This matters because my earlier M4 Mini run had already shown the model could get to “speech-like babble,” but it plateaued and got operationally ugly. The A100 run kept moving cleanly.
What The Samples Sound Like
Qualitatively:
- there is clearly a human voice present
- the outputs are not static
- they are not tonal droning failures
- they are still largely babble rather than intelligible text
- later checkpoints sound more stable than earlier ones, but sample quality variance is still real
That is exactly the threshold I cared about for this pilot. I was not trying to get polished TTS from a 21.9M unconditioned model; I wanted evidence that the hybrid architecture could learn speech codec structure at all.
It did.
Local sample bundles were saved for:
step 1200step 1400step 1600step 1800
and the best saved checkpoint by the end of the session was step 1800.
What Failed
The systems story was much worse than the learning story.
Main problems:
- the fused FLA Gated DeltaNet kernel did not train cleanly on the pod software stack
- cheap pod infrastructure was volatile and restarted at bad times
- pod-local storage could not be trusted
- the run likely died right as
step 2000was being saved
That said, I did manage to pull the important checkpoints locally before the pod disappeared:
1200140016001800- latest
best.pt
So the final result survived even though the pod did not.
Significance?
I believe the primary lesson is:
a tiny OLMo-hybrid style recurrent-attention LM can learn enough speech codec structure on a clean single-speaker corpus to produce clear voice-like audio surprisingly early in training.
Even without:
- a fused recurrent kernel
- a long training budget
- text conditioning
- semantic token hierarchies
- a polished speech decoder stack
So it seems fairly obvious my next step is text conditioning. Once the model is already making convincing voice-like audio, the high-value problem moves the goalpost from simple unconditioned sampling to actual steering.
Next Steps
- use the
1800checkpoint as the main saved pilot result - compare
1400 / 1600 / 1800sample quality side-by-side - add a matched transformer baseline later on real GPU access
- move to text-conditioned speech generation next
- revisit the fused GDN kernel on a more stable CUDA/Triton stack
This run was messy to execute, but the core result was clean: the model learned real speech structure.