Completed 12k Run: Early Speech Emergence in a 21.9M OLMo-Hybrid Speech LM
This is the follow-up to my earlier 1800-step pilot. I resumed the same 21.9M OLMo-hybrid speech codec LM on A100 and let it run all the way to 12000 steps.
The short version is that the run finished cleanly, validation kept improving the whole time, and the final samples are noticeably stronger than the earlier 15%-budget checkpoint.
Final result:
- best checkpoint:
step 12000 - EMA validation loss:
3.8207 - perplexity:
45.63 - dataset:
LJ Speech - tokenizer:
EnCodec 24 kHz,8codebooks - hardware:
A100-SXM4-80GB
Useful links:
Setup
This is still an unconditional codec language model, not a text-conditioned TTS system yet.
| Item | Value |
|---|---|
| Model | OLMo-hybrid speech LM |
| Parameters | 21.9M |
| Backbone | 8 layers = 6 Gated DeltaNet blocks + 2 attention blocks |
| Width | d_model=384, d_ff=1024 |
| Attention | 6 heads, 2 KV heads |
| Hybrid schedule | attention every 4th block, final block forced to attention |
| Data | LJ Speech |
| Tokenizer | EnCodec 24 kHz, 8 codebooks, vocab 1027 |
| Chunking | 8s chunks |
| Split | 12,624 train / 666 val |
| Context | 1024 delayed steps |
| Hardware | A100-SXM4-80GB |
| Runtime | bf16, fused AdamW, CUDA SDPA flash path on, fused FLA GDN path off |
| Stable batch | true batch 24, grad accum 1 |
| Throughput | about 18k tok/s |
One important systems caveat: this successful run still used the plain PyTorch recurrent fallback for the Gated DeltaNet blocks. The intended fused FLA recurrent kernel was unstable on the available pod stack, so this is a clean modeling result more than a clean kernel-stack result.
Validation Progression
The important part is that the run never really rolled over. It just kept getting better more slowly.
| Step | Train Loss | EMA Val Loss | PPL |
|---|---|---|---|
| 200 | 5.0049 | 6.7878 | 886.99 |
| 1000 | 4.1367 | 4.8510 | 127.87 |
| 1800 | 4.2224 | 4.2847 | 72.58 |
| 3200 | 4.2821 | 4.0500 | 57.40 |
| 5200 | 3.7050 | 3.9194 | 50.37 |
| 7400 | 3.8102 | 3.8551 | 47.23 |
| 10000 | 3.8445 | 3.8267 | 45.91 |
| 12000 | 3.7043 | 3.8207 | 45.63 |
So the original 1800-step pilot was real, but it was not the end of the useful training regime.
Sample Progression
What I cared about most was whether this architecture could stay in a clearly speech-like regime and get cleaner with more training. It did.
Earlier pilot checkpoint: step 1800
Mid-run checkpoint: step 7400
Final-best checkpoint: step 12000
Sample 1
Sample 2
Sample 3
These are still babbly and not semantically grounded. But they are much less in the “barely holding together” regime than the early pilot, and that matters.
What I Think This Says
The narrow claim I am comfortable making is:
a small OLMo-hybrid / Gated-DeltaNet-style speech codec LM can learn enough local speech structure on a clean single-speaker corpus to produce stable, clearly voice-like audio, and it keeps improving well past the first emergence point.
What I am not claiming:
- that this is a finished TTS system
- that it beats a matched transformer baseline
- that the current no-FLA run reflects the architecture’s ideal efficiency
- that these samples are semantically meaningful speech
This is still an architecture viability result. But it is a real one.
Next Step
This is enough to justify moving into text conditioning.
The next question is whether I can inject text cleanly enough to get controllable reading rather than only speech-like babble.
So the plan from here is:
- keep the current audio decoder
- add a small text encoder
- add cross-attention from selected decoder blocks into text states
- start with sentence-level LJ Speech transcripts before worrying about anything longer-form