Ok, so I might have had a light W with this little speech experiment. Yes, Olmo type recurrent attention hybrids are pretty good.

I wanted a cheap answer to a simple question: can a tiny OLMo-hybrid style recurrent-attention language model learn enough codec structure to produce voice-like audio instead of static?

Seems like it did, haha.

By about step 1800, the 21.9M parameter OLMo-hybrid trained on LJ Speech with EnCodec 24 kHz tokens was already producing samples with a clear speaking voice. The outputs were still mildly babbly and not semantically coherent, but they were decisively quite clearly past the “hiss / drone / decoder garbage” failure mode.

That happened at only about 15% of the originally planned 12k training budget.


Setup

Item Value
Model OLMo-hybrid speech LM
Parameters 21.9M
Backbone 8 layers = 6 Gated DeltaNet blocks + 2 attention blocks
Width d_model=384, d_ff=1024
Attention 6 heads, 2 KV heads
Hybrid schedule attention every 4th block
Data LJ Speech
Tokenizer EnCodec 24 kHz, 8 codebooks, vocab 1027
Chunking 8s chunks
Split 12,624 train / 666 val
Context 1024 delayed steps
Hardware A100-SXM4-80GB
Runtime bf16, fused AdamW, CUDA SDPA flash path on, fused FLA GDN path off
Stable batch true batch 24, grad accum 1
Throughput about 17.9k tok/s

A bit of an important note is that this run did not use the intended fused Flash Linear Attention Gated DeltaNet kernel. The Triton backward kernel was unstable on my runpod stack and kept causing crashes, so I decided to just have the recurrent core fall back to the plain PyTorch scan implementation. That means the run was still much slower than ideal for an A100. Which is a sort of upside because it means it could much quicker.


Validation Trajectory

The A100 run seemed to have also improved monotonically through every saved eval up to 1800.

Step EMA Val Loss PPL
200 6.7878 886.99
400 6.1530 470.13
600 5.6474 283.54
800 5.1973 180.78
1000 4.8510 127.87
1200 4.6169 101.18
1400 4.4626 86.72
1600 4.3585 78.14
1800 4.2847 72.58

This matters because my earlier M4 Mini run had already shown the model could get to “speech-like babble,” but it plateaued and got operationally ugly. The A100 run kept moving cleanly.


What The Samples Sound Like

Qualitatively:

  • there is clearly a human voice present
  • the outputs are not static
  • they are not tonal droning failures
  • they are still largely babble rather than intelligible text
  • later checkpoints sound more stable than earlier ones, but sample quality variance is still real

That is exactly the threshold I cared about for this pilot. I was not trying to get polished TTS from a 21.9M unconditioned model; I wanted evidence that the hybrid architecture could learn speech codec structure at all.

It did.

Local sample bundles were saved for:

  • step 1200
  • step 1400
  • step 1600
  • step 1800

and the best saved checkpoint by the end of the session was step 1800.


What Failed

The systems story was much worse than the learning story.

Main problems:

  • the fused FLA Gated DeltaNet kernel did not train cleanly on the pod software stack
  • cheap pod infrastructure was volatile and restarted at bad times
  • pod-local storage could not be trusted
  • the run likely died right as step 2000 was being saved

That said, I did manage to pull the important checkpoints locally before the pod disappeared:

  • 1200
  • 1400
  • 1600
  • 1800
  • latest best.pt

So the final result survived even though the pod did not.


Significance?

I believe the primary lesson is:

a tiny OLMo-hybrid style recurrent-attention LM can learn enough speech codec structure on a clean single-speaker corpus to produce clear voice-like audio surprisingly early in training.

Even without:

  • a fused recurrent kernel
  • a long training budget
  • text conditioning
  • semantic token hierarchies
  • a polished speech decoder stack

So it seems fairly obvious my next step is text conditioning. Once the model is already making convincing voice-like audio, the high-value problem moves the goalpost from simple unconditioned sampling to actual steering.


Next Steps

  • use the 1800 checkpoint as the main saved pilot result
  • compare 1400 / 1600 / 1800 sample quality side-by-side
  • add a matched transformer baseline later on real GPU access
  • move to text-conditioned speech generation next
  • revisit the fused GDN kernel on a more stable CUDA/Triton stack

This run was messy to execute, but the core result was clean: the model learned real speech structure.