A 21.9M hybrid recurrent attention to learn speech on 1.8k steps

Ok, so I might have had a light W with this little speech experiment. Yes, Olmo type recurrent attention hybrids are pretty good.

I wanted a cheap answer to a simple question: can a tiny OLMo-hybrid style recurrent-attention language model learn enough codec structure to produce voice-like audio instead of static?

Seems like it did, haha.

By about step 1800, the 21.9M parameter OLMo-hybrid trained on LJ Speech with EnCodec 24 kHz tokens was already producing samples with a clear speaking voice. The outputs were still mildly babbly and not semantically coherent, but they were decisively quite clearly past the “hiss / drone / decoder garbage” failure mode.

That happened at only about 15% of the originally planned 12k training budget.

Setup

Item	Value
Model	OLMo-hybrid speech LM
Parameters	`21.9M`
Backbone	`8` layers = `6` Gated DeltaNet blocks + `2` attention blocks
Width	`d_model=384`, `d_ff=1024`
Attention	`6` heads, `2` KV heads
Hybrid schedule	attention every `4th` block
Data	`LJ Speech`
Tokenizer	`EnCodec 24 kHz`, `8` codebooks, vocab `1027`
Chunking	`8s` chunks
Split	`12,624` train / `666` val
Context	`1024` delayed steps
Hardware	`A100-SXM4-80GB`
Runtime	`bf16`, fused AdamW, CUDA SDPA flash path on, fused FLA GDN path off
Stable batch	true batch `24`, grad accum `1`
Throughput	about `17.9k tok/s`

A bit of an important note is that this run did not use the intended fused Flash Linear Attention Gated DeltaNet kernel. The Triton backward kernel was unstable on my runpod stack and kept causing crashes, so I decided to just have the recurrent core fall back to the plain PyTorch scan implementation. That means the run was still much slower than ideal for an A100. Which is a sort of upside because it means it could much quicker.

Validation Trajectory

The A100 run seemed to have also improved monotonically through every saved eval up to 1800.

Step	EMA Val Loss	PPL
200	6.7878	886.99
400	6.1530	470.13
600	5.6474	283.54
800	5.1973	180.78
1000	4.8510	127.87
1200	4.6169	101.18
1400	4.4626	86.72
1600	4.3585	78.14
1800	4.2847	72.58

This matters because my earlier M4 Mini run had already shown the model could get to “speech-like babble,” but it plateaued and got operationally ugly. The A100 run kept moving cleanly.

What The Samples Sound Like

Qualitatively:

there is clearly a human voice present
the outputs are not static
they are not tonal droning failures
they are still largely babble rather than intelligible text
later checkpoints sound more stable than earlier ones, but sample quality variance is still real

That is exactly the threshold I cared about for this pilot. I was not trying to get polished TTS from a 21.9M unconditioned model; I wanted evidence that the hybrid architecture could learn speech codec structure at all.

It did.

Local sample bundles were saved for:

step 1200
step 1400
step 1600
step 1800

and the best saved checkpoint by the end of the session was step 1800.

What Failed

The systems story was much worse than the learning story.

Main problems:

the fused FLA Gated DeltaNet kernel did not train cleanly on the pod software stack
cheap pod infrastructure was volatile and restarted at bad times
pod-local storage could not be trusted
the run likely died right as step 2000 was being saved

That said, I did manage to pull the important checkpoints locally before the pod disappeared:

1200
1400
1600
1800
latest best.pt

So the final result survived even though the pod did not.

Significance?

I believe the primary lesson is:

a tiny OLMo-hybrid style recurrent-attention LM can learn enough speech codec structure on a clean single-speaker corpus to produce clear voice-like audio surprisingly early in training.

Even without:

a fused recurrent kernel
a long training budget
text conditioning
semantic token hierarchies
a polished speech decoder stack

So it seems fairly obvious my next step is text conditioning. Once the model is already making convincing voice-like audio, the high-value problem moves the goalpost from simple unconditioned sampling to actual steering.

Next Steps

use the 1800 checkpoint as the main saved pilot result
compare 1400 / 1600 / 1800 sample quality side-by-side
add a matched transformer baseline later on real GPU access
move to text-conditioned speech generation next
revisit the fused GDN kernel on a more stable CUDA/Triton stack

This run was messy to execute, but the core result was clean: the model learned real speech structure.