obiohagwu

Topological Signal In Learned Odor Embeddings

2026-04-08T00:00:00+00:00

Got nerdsniped recently.

if a learned odor embedding appears to show nontrivial topology, does that signal survive metric choice, subsampling, null models, checkpoint variation, and comparison to ordinary chemical baselines?

That is a much more defensible question than “what is the topology of odor space?”

Useful links:

What I Actually Tested

The representation under audit was OpenPOM, the open implementation of the Principal Odor Map line of work.

The comparison stack was:

two primary OpenPOM checkpoints for matched direct comparisons
all 10 released OpenPOM ensemble checkpoints for stability checks
a curated 4,983-molecule GoodScents/Leffingwell table
a broader 5,862-molecule GS/LF table
a 1,600-molecule non-overlap subset
Morgan bit fingerprints
Morgan count fingerprints
a small RDKit physicochemical descriptor baseline

The main target was persistent H1, not “all topology.” The whole point was to see whether any apparent loop-like structure was actually robust enough to survive the usual failure modes.

The Main Result

The shortest honest summary is:

POM really does show reproducible H1 signal
that signal survives repeated subsampling, matched nulls, dataset changes, and checkpoint variation
but it is not unique to POM

That last point matters. The strongest Morgan bit fingerprint baselines were at least as strong and often stronger than POM on the raw robustness ratios used here.

So this is not a “we found hidden loops in odor space and POM wins” paper.

It is a much narrower and, I think, more useful result:

topological signal in learned odor representations is real enough to survive serious scrutiny, but current evidence does not justify treating that topology as uniquely favorable to POM.

Why I Still Think The Result Is Interesting

The part I find most interesting is not the trivial leaderboard question of which space gets the biggest persistence ratio.

The interesting part is the compression mismatch.

POM here is a 256-dimensional dense learned embedding. The strongest chemical baselines are 2048-bit sparse combinatorial encodings. Those are not the same kind of objects.

So we have:

sparse fingerprints preserve lots of raw combinatorial chemical detail, which can naturally produce strong persistent structure
POM is much more compressed, so the fact that it still retains robust H1 at all is nontrivial
but that still does not prove that POM preserves better odor-relevant structure

That distinction is basically the spine of the paper now.

If I push this line further, the next version would likely compare POM against matched-bottleneck chemical baselines rather than only against much larger sparse spaces. PCA-compressed fingerprints, random projections, or autoencoder-compressed baselines would make the fairness question much cleaner.

Utility: Beyond descriptive topology?

I also did a fairly narrow first utility pass, because I did not want the whole project to stop at “look, persistence diagrams.”

The utility question was:

do topology-derived local features explain neighborhood-level odor-label structure better than local geometry alone?

The answer was cautiously yes, mostly in a limited and target-dependent way.

Local topology sometimes improved prediction of neighborhood ambiguity / entropy style targets beyond local geometry. But again, the gains were not uniquely favorable to POM. Some ordinary chemistry baselines seemed to also benefit.

So the utility result is quite modest:

topology can add explanatory value
the gains are conditional and not currently enough to claim that topology is a uniquely powerful odor-modeling advantage

Clear ?’s

I want to be very explicit about the non-claims here.

I am not claiming:

that this reveals the true topology of human odor perception
that persistent H1 implies clean interpretable circular perceptual dimensions
that current learned odor embeddings uniquely beat strong chemical baselines
that topology has already been shown to confer large practical gains for odor design

The paper is strongest as a representation audit.

That is still worth something. In a lot of modern embedding work, people jump from “the space predicts well” to “the geometry must mean something deep.” This project is an attempt to put that move under pressure in a domain where the representations are interesting, but the claims can get fuzzy very quickly.

Why?

I think there is real value in results that sharpen what a representation does and does not justify.

The cleanest positive claim from this line is mostly:

“compressed learned odor spaces can preserve real topological structure, but robust topology is not automatically a learned-embedding-only phenomenon”

That is a stricter and less glamorous claim, but it is the one I can currently defend.

If you want the compact artifact version, the paper is here again:

Topological signal in learned odor embeddings (PDF)

Hmm1

2026-03-28T00:00:00+00:00

I’m not entirely sure. Wouldn’t you also want to?

There’s nothing else to all this really, friends.

More personal Diary entry post.

Completed 12k Run: Early Speech Emergence in a 21.9M OLMo-Hybrid Speech LM

2026-03-14T00:00:00+00:00

This is the follow-up to my earlier 1800-step pilot. I resumed the same 21.9M OLMo-hybrid speech codec LM on A100 and let it run all the way to 12000 steps.

The short version is that the run finished cleanly, validation kept improving the whole time, and the final samples are noticeably stronger than the earlier 15%-budget checkpoint.

Final result:

best checkpoint: step 12000
EMA validation loss: 3.8207
perplexity: 45.63
dataset: LJ Speech
tokenizer: EnCodec 24 kHz, 8 codebooks
hardware: A100-SXM4-80GB

Useful links:

Setup

This is still an unconditional codec language model, not a text-conditioned TTS system yet.

Item	Value
Model	OLMo-hybrid speech LM
Parameters	`21.9M`
Backbone	`8` layers = `6` Gated DeltaNet blocks + `2` attention blocks
Width	`d_model=384`, `d_ff=1024`
Attention	`6` heads, `2` KV heads
Hybrid schedule	attention every `4th` block, final block forced to attention
Data	`LJ Speech`
Tokenizer	`EnCodec 24 kHz`, `8` codebooks, vocab `1027`
Chunking	`8s` chunks
Split	`12,624` train / `666` val
Context	`1024` delayed steps
Hardware	`A100-SXM4-80GB`
Runtime	`bf16`, fused AdamW, CUDA SDPA flash path on, fused FLA GDN path off
Stable batch	true batch `24`, grad accum `1`
Throughput	about `18k tok/s`

One important systems caveat: this successful run still used the plain PyTorch recurrent fallback for the Gated DeltaNet blocks. The intended fused FLA recurrent kernel was unstable on the available pod stack, so this is a clean modeling result more than a clean kernel-stack result.

Validation Progression

The important part is that the run never really rolled over. It just kept getting better more slowly.

Step	Train Loss	EMA Val Loss	PPL
200	5.0049	6.7878	886.99
1000	4.1367	4.8510	127.87
1800	4.2224	4.2847	72.58
3200	4.2821	4.0500	57.40
5200	3.7050	3.9194	50.37
7400	3.8102	3.8551	47.23
10000	3.8445	3.8267	45.91
12000	3.7043	3.8207	45.63

So the original 1800-step pilot was real, but it was not the end of the useful training regime.

Sample Progression

What I cared about most was whether this architecture could stay in a clearly speech-like regime and get cleaner with more training. It did.

Earlier pilot checkpoint: step 1800

Mid-run checkpoint: step 7400

Final-best checkpoint: step 12000

Sample 1

Sample 2

Sample 3

These are still babbly and not semantically grounded. But they are much less in the “barely holding together” regime than the early pilot, and that matters.

What I Think This Says

The narrow claim I am comfortable making is:

a small OLMo-hybrid / Gated-DeltaNet-style speech codec LM can learn enough local speech structure on a clean single-speaker corpus to produce stable, clearly voice-like audio, and it keeps improving well past the first emergence point.

What I am not claiming:

that this is a finished TTS system
that it beats a matched transformer baseline
that the current no-FLA run reflects the architecture’s ideal efficiency
that these samples are semantically meaningful speech

This is still an architecture viability result. But it is a real one.

Next Step

This is enough to justify moving into text conditioning.

The next question is whether I can inject text cleanly enough to get controllable reading rather than only speech-like babble.

So the plan from here is:

keep the current audio decoder
add a small text encoder
add cross-attention from selected decoder blocks into text states
start with sentence-level LJ Speech transcripts before worrying about anything longer-form

A 21.9M hybrid recurrent attention to learn speech on 1.8k steps

2026-03-10T00:00:00+00:00

Ok, so I might have had a light W with this little speech experiment. Yes, Olmo type recurrent attention hybrids are pretty good.

I wanted a cheap answer to a simple question: can a tiny OLMo-hybrid style recurrent-attention language model learn enough codec structure to produce voice-like audio instead of static?

Seems like it did, haha.

By about step 1800, the 21.9M parameter OLMo-hybrid trained on LJ Speech with EnCodec 24 kHz tokens was already producing samples with a clear speaking voice. The outputs were still mildly babbly and not semantically coherent, but they were decisively quite clearly past the “hiss / drone / decoder garbage” failure mode.

That happened at only about 15% of the originally planned 12k training budget.

Setup

Item	Value
Model	OLMo-hybrid speech LM
Parameters	`21.9M`
Backbone	`8` layers = `6` Gated DeltaNet blocks + `2` attention blocks
Width	`d_model=384`, `d_ff=1024`
Attention	`6` heads, `2` KV heads
Hybrid schedule	attention every `4th` block
Data	`LJ Speech`
Tokenizer	`EnCodec 24 kHz`, `8` codebooks, vocab `1027`
Chunking	`8s` chunks
Split	`12,624` train / `666` val
Context	`1024` delayed steps
Hardware	`A100-SXM4-80GB`
Runtime	`bf16`, fused AdamW, CUDA SDPA flash path on, fused FLA GDN path off
Stable batch	true batch `24`, grad accum `1`
Throughput	about `17.9k tok/s`

A bit of an important note is that this run did not use the intended fused Flash Linear Attention Gated DeltaNet kernel. The Triton backward kernel was unstable on my runpod stack and kept causing crashes, so I decided to just have the recurrent core fall back to the plain PyTorch scan implementation. That means the run was still much slower than ideal for an A100. Which is a sort of upside because it means it could much quicker.

Validation Trajectory

The A100 run seemed to have also improved monotonically through every saved eval up to 1800.

Step	EMA Val Loss	PPL
200	6.7878	886.99
400	6.1530	470.13
600	5.6474	283.54
800	5.1973	180.78
1000	4.8510	127.87
1200	4.6169	101.18
1400	4.4626	86.72
1600	4.3585	78.14
1800	4.2847	72.58

This matters because my earlier M4 Mini run had already shown the model could get to “speech-like babble,” but it plateaued and got operationally ugly. The A100 run kept moving cleanly.

What The Samples Sound Like

Qualitatively:

there is clearly a human voice present
the outputs are not static
they are not tonal droning failures
they are still largely babble rather than intelligible text
later checkpoints sound more stable than earlier ones, but sample quality variance is still real

That is exactly the threshold I cared about for this pilot. I was not trying to get polished TTS from a 21.9M unconditioned model; I wanted evidence that the hybrid architecture could learn speech codec structure at all.

It did.

Local sample bundles were saved for:

step 1200
step 1400
step 1600
step 1800

and the best saved checkpoint by the end of the session was step 1800.

What Failed

The systems story was much worse than the learning story.

Main problems:

the fused FLA Gated DeltaNet kernel did not train cleanly on the pod software stack
cheap pod infrastructure was volatile and restarted at bad times
pod-local storage could not be trusted
the run likely died right as step 2000 was being saved

That said, I did manage to pull the important checkpoints locally before the pod disappeared:

1200
1400
1600
1800
latest best.pt

So the final result survived even though the pod did not.

Significance?

I believe the primary lesson is:

a tiny OLMo-hybrid style recurrent-attention LM can learn enough speech codec structure on a clean single-speaker corpus to produce clear voice-like audio surprisingly early in training.

Even without:

a fused recurrent kernel
a long training budget
text conditioning
semantic token hierarchies
a polished speech decoder stack

So it seems fairly obvious my next step is text conditioning. Once the model is already making convincing voice-like audio, the high-value problem moves the goalpost from simple unconditioned sampling to actual steering.

Next Steps

use the 1800 checkpoint as the main saved pilot result
compare 1400 / 1600 / 1800 sample quality side-by-side
add a matched transformer baseline later on real GPU access
move to text-conditioned speech generation next
revisit the fused GDN kernel on a more stable CUDA/Triton stack

This run was messy to execute, but the core result was clean: the model learned real speech structure.

OLMo Hybrid for Autoregressive Music Token Modeling: A 300M Single-A100 Baseline

2026-03-06T00:00:00+00:00

Something really cool happened yesterday.

It’s really convenient that while i was working on alternate architecures for music generation that are subquadratic, olmo hybrid just gets dropped.

The funny weird cool thing is we seem to have had similar (maybe obvious?) intuitions about the required architectural tweaks haha, although they are much more comptetent and serious about their research and implementation.

But yeah, their olmo7bhybrid model is not necessarily for music, but the subquadratic scaling that the architecure the model uses is really quite promising as a decoder for my DAC tokens.

I’m really excited to break ground on this.

Here is a more thorough view of my intentions:

Long-context music generation lives in an awkward middle ground for sequence models.

On one end, full attention is dependable and expressive, but expensive. On the other, purely recurrent models are efficient, but they can blur local detail or struggle to preserve the exact interactions that matter for music. For autoregressive music token modeling, that tradeoff matters a lot: music depends on both fine local structure and longer-range organization.

This post attempts to describes a new baseline I added to my existing music language modeling codebase: an OLMo-hybrid-style decoder for DAC-token music generation. The model will use a repeating 3:1 layer schedule, three DeltaNet-style recurrent blocks followed by one full attention block—and combines that with RoPE, per-head QK RMS normalization, SwiGLU feed-forwards, grouped-query attention, and PyTorch scaled dot-product attention so it can take advantage of FlashAttention on A100 hardware.

The instantiated model lands at 294.3M parameters and is designed as a single-A100 baseline for unconditional music modeling on 44.1 kHz DAC tokens.

This is not a results post, lol. I’m posting this moreso as a methdology: the goal is to define the architecture, training configuration, and evaluation plan for a new baseline in the repository.

Why this architecture may be really useful for music

Autoregressive music token modeling needs to preserve local acoustic structure while also carrying information across long spans.

With DAC tokens, a 24-second clip turns into roughly 2,000 autoregressive steps once you apply a MusicGen-style delay pattern over 9 RVQ codebooks. That is long enough for quadratic attention to become costly, but not so long that abandoning exact attention entirely feels justified (PAIN!).

That makes hybrid architectures a lot more appealing from an efficiency point of view as we’ve already covered in prior experiments.

Recent OLMo Hybrid models suggest a useful compromise: let recurrent sequence layers do most of the work, but insert periodic full-attention layers so the model can refresh exact token-to-token interactions. For music, that seems especially natural. Musical sequences contain local events—chord color, rhythmic edges, timbral transitions—but also slower-moving structure such as motif reuse, phrasing, and section-level form.

The goal here is simple: define a practical 300M-class hybrid baseline that can be trained on a single A100 and compared against the transformer, Mamba, hybrid, and memory-caching baselines already present in the codebase.

Tokenization

Audio is represented using Descript Audio Codec (DAC) at 44.1 kHz.

The codec emits 9 RVQ codebooks at roughly 86 frames per second, with a codebook size of 1024. As in MusicGen, the model uses a delay pattern that offsets codebook (k) by (k) steps, allowing all codebooks to be predicted in parallel within each autoregressive step.

The final vocabulary size is 1027, after adding pad, BOS, and EOS tokens. During training, sequences are truncated to a maximum length of 2048 timesteps.

The model

The architecture added to the repository is called olmo_hybrid.

At a high level, it is an OLMo-hybrid-style decoder with the following design:

a repeating 3:1 block schedule, with three recurrent DeltaNet-style blocks followed by one full attention block
RoPE in attention layers
per-head RMS normalization on queries and keys before attention
SwiGLU feed-forward layers
grouped-query attention (GQA) with separate query-head and KV-head counts
PyTorch scaled dot-product attention, enabling the FlashAttention fast path on A100s when masks are full-length
no learned positional embeddings on this path

Each DeltaNet-style recurrent block uses a standard pre-norm residual structure:

RMS normalization on the input
a recurrent mixer with learned forget and update gates
a residual connection
a second RMS normalization
a SwiGLU feed-forward layer
a second residual connection

The attention blocks mirror that same structure, replacing the recurrent mixer with RoPE-based attention.

Current implementation status

One important detail: the attention path is aligned with the public OLMo recipe in terms of architecture and kernel choice, but the recurrent path is not a verbatim import of fused OLMo-core kernels, just for my usecase at this point at least.

Instead, it uses a native PyTorch DeltaNet-style approximation implemented inside the local codebase.

That is on purpose. The immediate goal is to test the architecture at music scale first, before investing in lower-level kernel work.

The 300M baseline configuration

The repository should now have a preset called music_olmo_hybrid_300m_a100.

Here is the model configuration:

Model dimension: 1024
Layers: 22
Attention heads: 16
KV heads: 4
Feed-forward width: 2816
Dropout: 0.1
Max sequence length: 2048
Attention period: 4
Total parameters: 294,263,072

With an attention period of 4, the 22-layer model contains 17 recurrent blocks and 5 full-attention blocks. The grouped-query setup reduces KV projection cost while preserving full query resolution.

Training plan

The default preset is designed for mixed-precision training on a single A100 GPU.

Initial training configuration:

Dataset: FMA-Large DAC tokens
Optimizer: AdamW
Learning rate: (2 \times 10^{-4})
Warmup: 2000 steps
Max steps: 200,000
Per-device batch size: 4
Gradient accumulation: 8
Effective batch size: 32 sequences
Precision: bfloat16 autocast on CUDA
Evaluation interval: 2000 steps
Checkpoint interval: 5000 steps

For a 40GB A100, the repository should also now expose --grad_accum_steps, which makes it easier to shift toward smaller microbatches without changing the effective batch size too aggressively.

Why this is a useful baseline and the ACTUAL crux

This model is meant to answer a fairly narrow but important question:

Can an OLMo-hybrid-style decoder preserve more musical structure than the current transformer and Mamba-family baselines at roughly the same parameter scale?

I think this is a worthwhile music baseline for three reasons.

First, it avoids a false choice between exact attention and recurrent state. Music benefits from both.

Second, it is practical. A 294M model that trains on one A100 is much easier to iterate on than a multi-billion-parameter reproduction of the full public OLMo stack.

Third, it is easy to ablate. The 3:1 schedule, RoPE setup, GQA configuration, and feed-forward width can all be varied cleanly in follow-up experiments.

Evaluation plan

The evaluation setup right now below.

Primary metrics include:

validation loss and perplexity
codebook-wise token accuracy
long-range coherence across sliding windows
Fréchet Audio Distance on decoded generations… maybe?
listening-based inspection of rhythmic stability, harmonic continuity, and phrase development

The comparison set should include:

the existing transformer baseline
pure Mamba baselines
earlier hybrid models already in the repo
memory-caching variants where relevant

Limitations

This is more so an implementation and experiment-design note, not a completed empirical claim. No training results are reported here.

There are also two technical limitations worth being explicit about.

First, the recurrent mixer is a native PyTorch approximation, not the fused OLMo-core Gated DeltaNet implementation (I will definitely use this in subsequent runs).

Second, FlashAttention acceleration only applies to the attention blocks. The recurrent blocks remain sequential by design.

If this baseline performs well, the natural next step is to move the recurrent side closer to the public runtime stack.

Thoughts?

The repository now contains an OLMo-hybrid-style music model that is large enough to be meaningful, but still small enough to train on a single A100. That makes it a strong next baseline.

The contribution here is seemingly straightforward: we have a concrete architecture, decent preset, and also a decent clear evaluation plan. Whether it actually beats the existing baselines is now an experimental question rather than an implementation gap.

Janky memory cache over mamba hybrid for music?

2026-03-04T00:00:00+00:00

So I’ve been attending house type like dance events more frequently. Not to make this too romantic, but i really enjoy music, and dancing, and like, house really is the perfect manifestation of that haha.

Anyway, I decided to try a janky re-implementation of this new memory-caching fix over rnn or just linear attention type operations.

Now to be more specific: State space models in general are a cool more efficent sequence modeler. They are pretty good at global coherence and understanding and unlike self-attention with its quadratic burden, State space methods like mamba are subquadratic. Linear by input sequence etc. Mamba processes sequences by maintaining a sort compressed hidden, or for analogy sake, latent state. It holds a fixed-size summary of everything it’s seen, which im sure you might already be seeing why this is an efficency boost, but might degrade over large enough corpi, or atleast at some point be unable to easily decode local detail. More specifically, that state gets updated with every new token, and old information gets gradually overwritten. For a 30-second music clip tokenized at 86 frames/second across 9 codebooks, that’s over 23,000 tokens. A lot of information to compress into one fixed-size state vector.

So, imagine a scenario where the model could sort of remember what happened earlier? Not in the vague, compressed-state-vector sense. Actually remember it. Store specific checkpoints from earlier in the sequence and pull them back when they’re relevant?

Yeah, see below…

The Memory Caching Paper

Some days ago actually, Behrouz et al. published “Memory Caching: RNNs with Growing Memory” (arXiv:2602.24281). The idea is pretty sweet:

Segment the sequence. Divide the input into fixed-size segments of S tokens.
Cache boundary states. At the end of each segment, save the model’s hidden state.
Retrieve when needed. For each new token, let the model query the cache – “which of my past states is most relevant right now?” – and inject that information via a learned gating mechanism.

The retrieval uses what they call Gated Residual Memory (GRM). For each position, the model computes a softmax distribution over all cached segments plus the current segment. The gates decide: how much should I trust my current computation vs. what I cached from 5 segments ago?

It’s a way of giving RNNs an explicit, growing memory bank without abandoning the efficiency of recurrence. Think of it as Mamba’s usual compressed state, supplemented by like a series of snapshots from earlier in the sequence.

And it seems they only tested it on linear attention and Titans and havent tried State space models. I mean, it is a pretty recent paper though lol.

I did want to know if it could work for music generation though. A bit of a redundant side project but i was just bored and have extra runpod creds.

Mamba’s Hidden state seems difficult to penetrate

Here’s the thing about Mamba that makes this non-trivial.

The Memory Caching paper assumes you can grab the model’s hidden state at any point during the forward pass. For linear attention that’s fairly straightforward. The recurrent state is an explicit matrix $S_t \in \mathbb{R}^{d_k \times d_v}$ that you compute and can cache directly.

Mamba doesn’t work that way. Its fused CUDA kernel runs the entire selective scan inside GPU registers. The hidden state (a (d_inner, d_state) tensor per layer) is never materialized in accessible memory during training. You can’t just reach in and grab it.

So I decided to not rewrite the kernel and just cache a proxy of the hidden state (i know i know , janky, sloppy… haha but this really is purely experimental)

The implementations and the primary

I built two models to create a controlled comparison:

MC-Linear-Attention (the “faithful” version)

Linear attention does expose its state matrices. So I built a version of MC that caches the actual recurrent state S_t at segment boundaries to match exactly what the paper as best i could. When the model wants to retrieve cached context, it does phi(q_t) @ S_i, querying the cached state matrix with the current token’s feature-mapped key. This is as close to the paper’s formulation as you can get.

MC-Mamba (the “proxy” version)

For Mamba, I cache the output activations at segment boundaries instead of hidden states. After each Mamba block processes the full sequence (preserving the fast CUDA kernel), I extract the output vector at positions S-1, 2S-1, 3S-1, etc. These d_model-dimensional vectors serve as compressed summaries of what the model computed at each segment boundary.

The GRM gating mechanism is identical in both: softmax over segment means from the input space, with a single learnable W_u projection per layer. The only difference is what gets cached and how it gets retrieved.

If MC works on linear attention but fails on Mamba, the proxy approach is the bottleneck, not the mechanism itself. That’s the controlled comparison.

Train Setup

Both models were trained on music tokenized with DAC (Descript Audio Codec) at 44.1kHz – 9 codebooks at ~86 tokens/second. The dataset is FMA-Large, a collection of 106,574 freely licensed tracks spanning a wide range of genres. Training ran on A100 80GB GPUs rented through RunPod.

The task is unconditional music generation (again, seriously gpu constrained even though im doing this): predict the next audio token across all 9 codebooks, autoregressively.

Experiment 1: The 48M Model (Proof of Life)

Before going big, I ran a small-scale test to see if the MC mechanism could learn at all.

	MC-LA 48M
Architecture	MC-Linear-Attention
Parameters	48.6M (2.5% MC overhead)
Dataset	25k tracks (FMA subset)
Segment size	256 tokens
Peak LR	1e-4
Batch size	32
Steps	~24,700

This was an early version of the code. Not optimized, some rough edges. I’m telling you this upfront because what happened next is quite important and matters more than the absolute numbers.

It learned, haha wtf.

The val loss dropped steadily across the entire run:

Step	Val Loss	Perplexity
2,000	6.854	947.5
4,000	6.746	850.6
8,000	6.363	579.8
12,000	6.139	463.6
16,000	6.046	422.3
20,000	5.997	402.1
24,000	5.969	391.3

No instability. No divergence. Just a clean, steady decline over 24,000 steps.

But the really interesting signal was the GRM entropy.

GRM entropy measures how spread out the attention weights are across cached segments. High entropy means the model is attending uniformly and looking at everything equally, which is basically the same as looking at nothing in particular. Low entropy means the model is being selective and it has learned to focus on specific cached segments when they’re relevant.

The 48M model started with GRM entropy around 1.26 (roughly uniform over cached segments). By step 24,000 it had dropped to 0.22.

The gating mechanism was learning. The model wasn’t just blindly averaging cached states. It seemed to have been developing preferences for specifically which cached segment to retrieve at each position. That’s seems to align with the behavior the MC paper predicts: learned, position-dependent retrieval over a growing memory bank.

Was the audio actually good though? To be honest, with 48M parameters on 25k tracks, “good” is a stretch. But there actually surprisngly was a semblance of something. Like rhythmic patterns that held, frequency content that wasn’t just noise. Enough to warrant scaling up.

Experiment 2: Scaling to 95M (Where Things Went Wrong)

Encouraged by the 48M results, I scaled up. Bigger model, bigger data, improved codebase.

	MC-LA 95M
Architecture	MC-Linear-Attention
Parameters	95.1M (3.1% MC overhead)
Dataset	106k tracks (full FMA-Large)
Segment size	128 tokens
Peak LR	3e-4
Batch size	64-72
MC start layer	10 (only top half gets MC)

I made several changes from the 48M run: halved the segment size (128 vs 256), tripled the learning rate (3e-4 vs 1e-4), only applied MC to the top 10 layers (leaving layers 0-9 as plain linear attention), and switched to the full 106k-track dataset.

I ran it twice. Both times, the same thing happened.

Gradient explosions

The gradient norms tell the story:

step   2900 | grad_norm 18.98
step   3100 | grad_norm  5.55
step   3500 | grad_norm  3.95
step   3800 | grad_norm  2.82
step   3900 | grad_norm  3.83
step   4500 | grad_norm  5.10
step   4600 | grad_norm  5.41
step   5000 | grad_norm 13.76
step   5100 | grad_norm 15.64
step   5300 | grad_norm 14.42
step   5500 | grad_norm  8.41

Between the spikes, gradient norms sat around 0.15-0.20 which was pretty ok. But every few hundred steps, something in the MC pathway would fire and blow the norm up by two orders of magnitude. These seem mechanistic and not necessarily random, again it could just be a weird artifact of early training dynamics, idk.

The gate never opened

The model logs a gate value representing the learned bias on the GRM gate. It initialized at -2.0, which corresponds to sigmoid(-2.0) = 0.12, which is a very conservative starting point that barely lets cached context through.

Over 5,500 steps, the gate moved from -2.00 to -1.91.

That’s nothing. The gate was supposed to learn when to open and trust the cached memory. Instead, it stayed almost exactly where it started. The MC mechanism was architecturally present but functionally dormant.

GRM entropy stayed high

Remember how the 48M model’s entropy dropped from 1.26 to 0.22? The 95M model’s GRM entropy sat at ~2.4 the entire run. That’s near-uniform attention – the model never learned to selectively retrieve from its cache. It was looking at everything equally, which is the same as ignoring the cache entirely.

The loss seems fine (which made it more annoying)

The loss was actually quite reasonable.

step   2000 | EVAL: val_loss=6.6908
step   4000 | EVAL: val_loss=6.0978

The underlying linear attention was doing all the work while the cache sat there with its gate barely open and likely contributing to gradient noise.

What to do?

The 48M model does seem to work though.

What I suspect:

The core issue is a cold-start problem. The MC mechanism needs to learn useful retrieval patterns, but it starts nearly disabled (gate at 0.12) while the base model races ahead. By the time the gate bias might have wandered open, the base linear attention has already found a loss landscape that doesn’t need the cache. The MC gradients become noise and they push the gate around randomly without a clear signal, maybe occasionally causing a spike hard enough to disrupt training.

The 48M model avoided this maybe because: (a) the lower learning rate gave the GRM more time to adapt gradually, (b) the smaller model capacity meant the base attention couldn’t “solve” the task on its own as easily, creating demand for the cached context, and (c) the larger segment size (256) meant fewer cache entries, making the softmax distribution easier to sharpen.

Limitation :)

To be clear this was more of a learning thourgh experiment thing, and i’d definitely take it further if I had more GPU time.

For 48m: It’s a smaller model, smaller dataset, lower learning rate, and an older version of the code. The declining GRM entropy is a genuinely positive signal and the model by all metric, learned selective retrieval. But I can’t rule out that the entropy drop is an artifact of the training dynamics rather than evidence that the cache is meaningfully improving generation quality (compute and data to rule out)

I never ran a 48M model without MC for comparison to get a decent baseline: The 48M result stands alone. I can’t tell you how much of its final loss is attributable to MC vs. how much the base linear attention would have achieved on its own. That’s a gap I need to fill.

The 95M runs were cut short. 5,500 steps isnt usually enough to declare failure in evry circumstance, as i said earlier. It’s definitely possible the grad spikes would have settled and the gate would have eventually opened. But the trajectory wasn’t promising, and I didn’t want to burn GPU hours watching it not work.

Three variables changed: between the 48M and 95M runs: model size, dataset size, and learning rate. Any one of these could be the culprit. Proper experimental design would change one at a time. I didn’t, because GPU hours cost money and a bit of impatience.

Learnings?

1. The mechanism works (when it works)

The 48M experiment is proof that GRM-based memory caching can learn selective retrieval over cached segment states in a music generation context. The entropy dropping from 1.26 to 0.22 is not nothing. The model seemingly learns to look at specific past segments when generating new tokens. That’s the whole promise of MC.

2. Initialization and learning rate matter more than I expected

The gate bias at -2.0 combined with LR 3e-4 seems to be a bad combination. The MC paper doesn’t discuss initialization sensitivity because they’re working with linear attention models where the state matrices are already part of the standard forward pass. When you’re bolting MC onto an architecture as an addition, the optimization dynamics change. The MC pathway needs to be initialized assertively enough to receive gradient signal, but not so aggressively that it destabilizes the base model.

3. The “proxy vs. faithful” comparison is still pending

I haven’t actually trained MC-Mamba (the output-activation proxy version) yet. The experiments above are all MC-Linear-Attention. The original question I did have was can you cache Mamba’s output activations and get useful retrieval?. I think thats remains a bit open. The 48M result suggests the GRM mechanism itself works for music, so the next step is testing whether the proxy representation (boundary output vectors) is informative enough to retrieve from.

4. Scale-up isn’t free

Also learned the hard way that a mechanism that works at small scale can fail at large scale for reasons that have nothing to do with the mechanism’s fundamental validity. Initialization, learning rate schedules, interaction with other training hyperparameters etc etc can all obviously change the outcome. The 48M model’s success didn’t predict the 95M model’s failure.

Maybe next steps

Some ideas will work on later.

Outlook

I started this project because I thought recurrent models needed better long-range memory for music generation, and MC seemed like an elegant way to provide it. The results so far are… humbling.

The 48M experiment proved the concept can work, like a pretty good concept level proof. Although I messed up with trying to scale prematurely. The gate staying shut, the entropy staying flat, the grad norms spiking are all symptoms of general issues in my models optimization landscape. So the whitepill there is it might not necessarily be anything wrong with my architecture “design”.

What I have right now is a mechanism that learned selective memory retrieval on a small model, and a larger model where that same mechanism refused to engage because of mostly dumb mistakes on my part.

All code is available in the mc-mamba repository. Training logs for all runs are included in pod_runs/.

If you’ve dealt with similar cold-start problems when adding auxiliary mechanisms to neural networks, I’d genuinely love to hear how you solved it. Sometimes the difference between “doesn’t work” and “works beautifully” is one initialization trick.

Transport Geometry as a Testable Hypothesis?

2026-02-22T00:00:00+00:00

mongoobi, Feb 2026

Continuation of Part I.

To summarize it though, in mid-layer Pythia residual streams at fixed sparsity budget $k$, $R^2$ can go up with model scale while patched CE loss gets worse. I called it the proxy gap.

This post is about what happens when you train longer, what that rules out, and where the explanation might live. I found some more seemingly interesting stuff. But unbeknownst to me, naively; this seems to be a fairly active research region. I’ll expand more below/

The sign mismatch was real but temporary

The more salient portion of the previous post as the apparent signflip in certain regimes. More specifically, $R^2$ says the 410M SAE is better, CE says the 70M SAE is better. That was on a 10M-token training budget.

I extended to 50M and 100M tokens with seed repeats. Mid-layer, $k=8$:

tokens	seeds	$\Delta_{CE}$	95% CI	$\Delta_{R^2}$	mismatch rate
10M	4	-0.099	[-0.117, -0.082]	+0.150	100%
50M	1	+0.018	—	+0.108	0%
100M	3	+0.009	[+0.004, +0.013]	+0.096	0%

Sign mismatch: gone by 50M. The larger model just needed more tokens to converge. Fair criticism of Part I, and I’m reporting it.

Although some anomolies seem to persist. 10x magnitude distortion still? The magnitudes at 100M are still off by roughly 10x. $R^2$ says the 410M SAE is substantially better ($\Delta = 0.096$). CE says they’re basically tied ($\Delta = 0.009$). At $k=16$ the ratio is about 3x. I may be missing something but this seems like a plausible issue. A sort of systematic, scale-dependent distortion in the metric everyone defaults to.

So, H0 (pure optimization artifact) explains the sign flip but not the magnitude gap. Which means something else may be going on. We should investigate, haha.

Three spaces and the mismatch between them

Ok, so here’s my running mental model right now. There are three different spaces SAE reconstruction generally lives in or occupies, and they usually don’t necessarily agree about what “close” means.

Activation space. This is where SAEs optimize. The error is Euclidean: $|h - \hat{h}|^2$. The metric that normalizes this is $R^2$. This space is shaped by the covariance of activations, usually leading to the situation where high-variance directions dominate.

Probability-sensitive space. Ok so this is i’d say the real meat, it seems to be what downstream computation actually cares about. The natural local metric here is the pullback of the output Fisher through the downstream Jacobian:

\[G_L = J_L^\top F_{out} J_L\]

where $F_{out} = \text{diag}(p) - pp^\top$ for the output distribution. A perturbation that’s small in activation space can be large in this space if it points along a direction the model is sensitive to.

Attention-plan space. Although a bit more speculative, and I’ll likely not go too deep on this due to my lack of sufficent background atm, but really interesting recent work frames attention weights as entropy-regularized transport plans (Litman, 2025). If you buy that framing, then patching activations changes the transport plan, and the cost of that change isn’t Euclidean either. This is a second-wave thing. I will definitely dive alot deeper into this very soon.

Ok, so we may say the proxy gap is, in this framing, a gap between space 1 and space 2. $R^2$ measures fidelity in activation space. CE measures fidelity in probability-sensitive space. When the geometry of those two spaces diverges (which it does seem to do more at larger scales, especially in mid-layer low-PR regimes), the metrics disagree.

This isn’t a new idea in the abstract. “Reconstruction isn’t behavior” is something people say and maybe usually know intuitively. What I’m trying to do is make it testable and localized: where exactly does it happen, how bad is it, and can you build a cheap proxy that tracks the right space?

SWD: a first-pass bridge metric

Before trying to estimate the full pullback Fisher (expensive, probably unstable at this scale), there’s a generally much simpler object. We can take the CE gradient at the hookpoint:

\[g_L = \nabla_{h_L} \mathcal{L}_{CE}\]

and define sensitivity-weighted distortion:

\[\text{SWD}_L = \mathbb{E}\left[(g_L^\top \delta h_L)^2\right]\]

where $\delta h_L = \hat{h}_L - h_L$ is the reconstruction error.

$R^2$ weights error by the covariance structure of activations. SWD weights error by the sensitivity structure of the loss. If those two weightings diverge, $R^2$ is lying to you, and SWD should catch it.

The concrete test: across a grid of (model, $k$, seed) conditions, does $1/\text{SWD}$ correlate with $CE_{rec}$ better than $R^2$? I’m building a proxy leaderboard for $R^2$, cosine sim, $1/\text{NMSE}$, $1/\text{SWD}$, $1/

g^\top\delta

$ — ranked by Pearson and Spearman with bootstrap CIs.

This is running on my local mac right now. Actually quite slowly. I’ll append the results with accompanying edits when the run in done in 28hrs.

The hypothesis stack right now

I want to be precise about what claims are on the table and what gates them.

H0 (optimization-only). The cross-scale CE gap shrinks to zero as SAE training budget increases. Status: partially supported. Sign mismatch gone. Magnitude gap not gone.

H1 (residual intrinsic component). After convergence, a nonzero magnitude gap remains. Status: supported at 100M tokens, but I’d ideally much prefer higher token budgets and more seeds to increase confidence.

H2 (geometry-aware proxies explain residual). SWD or similar sensitivity-weighted metrics predict CE better than $R^2$. Status: running.

H3 (task-relevant dimensional mismatch). MI-derived task-relevant dimensionality diverges from geometric PR in exactly the regimes where proxy failure is worst. Status: planned, contingent on H2.

The rule I set before writing any code: don’t interpret H2/H3 until H0 vs H1 is adjudicated, lol. Phase 2 data adjudicates H0 vs H1 in favor of H1 (magnitude gap persists). So H2 is now live.

Where this is going

The decision tree, stated plainly:

If SWD outpredicts $R^2$ for CE implies the sensitivity geometry story has teeth, and I have a cheap diagnostic. Proceed maybe?

If SWD doesn’t beat $R^2$ implies the magnitude gap is real but the explanation isn’t sensitivity geometry. Maybe it’s purely an SST normalization artifact? Maybe the gap is in higher-order terms that a linear sensitivity proxy can’t capture, sigh. Either way, the empirical characterization still stands as a fair contribution really, and I’ll probably branch to the anisotropy/deflated-PR story or to pullback Fisher approximations as a second wave.

Second-wave experiments, if the first wave works:

Pullback Fisher approximation. Approximate $G_L$ via subsampled Jacobians and diagonal $F_{out}$, compute Fisher-Euclidean deviation, test whether it adds predictive value beyond SWD.
MI critic transfer. Run separable vs hybrid MI estimators on $(h_L, \text{logits})$, compare inferred task-relevant $k^*$ profiles with proxy gap magnitude.

I’m not pre-committing to these. They’re on the shelf if the simple version works and I want to push the mechanism story deeper.

Rate-distortion-geometry

One framing I seem to keep coming back to is conceiving an SAE primarily as a lossy code. The sparsity budget $k$ determines the rate. What changes is which distortion measure you’d want to evaluate against.

It’s so weird that in euclidean distortion space, the SAE looks increasingly good with scale (high $R^2$). In sensitivity-weighted distortion, the picture may be different. The rate-distortion curve depends on the distortion geometry, and if you’re measuring in the wrong geometry, your curve is wrong.

This connects, loosely, to the bounded-observer information frameworks (Finzi et al., 2026, on epiplexity). The common thread: the same representation can contain structure that’s “there” in an information-theoretic sense but not extractable by a given tool class under a certain budgets. The proxy gap is a concrete instance of this: $R^2$ says the structure is there, CE says your SAE didn’t extract it in a way that matters.

Although, to be clear I’m not claiming a formal reduction at all. I’m moreso trying to use it as a lens.

Concurrent work worth reading if you’re really interested:

Braun et al. (2024) trained SAEs to minimize KL divergence instead of reconstruction error (“end-to-end SAEs”) and showed a Pareto improvement: more CE explained with fewer features. Their argument is basically “reconstruction objectives learn dataset structure, not computational structure.”

Same core problem, different angle. They built a better training objective. I’m trying to build a better evaluation diagnostic. The e2e paper is evidence for the assumption that a separate group independently builds an entire training pipeline to work around this problem, the problem is real. But most people in the interpretability community are using existing SAEs (SAELens, Anthropic’s published dictionaries), not retraining from scratch. A cheap post-hoc metric that flags “your $R^2$ is misleading here” is useful even in that world.

Some more limitations?

Everything from Part I still applies (two model sizes, one decoder class, not ablation-normalized, late-layer confound). New:

Seed counts are low. 3-4 at anchor conditions. Not enough for robust seed-level CIs.
The 50M row has one seed per $k$. Single data point.
Exp-B (SWD vs classical proxies) hasn’t finished. I’m reporting setup and rationale, not results.
I haven’t computed deflated PR yet. The canyon story is suggestive but might not survive deflation.

Repro

Phase 2 data: info-geo/outputs/phase2_repeat_analysis.md
Token sweep: info-geo/outputs/proxy_gap_lowk_10m_100m_report.md
Exp-B script: info-geo/run_expb.py (standalone, runs on MPS/CUDA/CPU)
Paper outline: info-geo/full_paper_outline.md

References

Litman, E. (2025). Scaled-Dot-Product Attention as One-Sided Entropic Optimal Transport. arXiv:2508.08369.
Park, K., et al. (2026). The Information Geometry of Softmax: Probing and Steering. arXiv:2602.15293.
Gulati, P., et al. (2026). Mutual Information and Task-Relevant Latent Dimensionality. arXiv:2602.08105.
Braun et al. (2024). Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning. arXiv:2405.12241.
Gao, L., et al. (2024). Scaling and Evaluating Sparse Autoencoders. arXiv:2406.04093.
Bricken et al. (2023). Towards Monosemanticity. Anthropic.
Finzi et al. (2026). From Entropy to Epiplexity. arXiv:2601.03220.

Your SAE looks solved. Your model disagrees

2026-02-09T00:00:00+00:00

This is a research log with a small argument:

If you’re using SAEs as interpretability tools or safety monitors, reconstruction fidelity alone is not a reliable acceptance test — behavioral metrics like patched CE loss or ablation-normalized CE recovery should be primary.
In one common regime (mid-layer residual stream), $R^2$ can be systematically inflated across scale at fixed sparsity, due to a concrete mechanism (activation variance scaling), while behavioral preservation gets worse.

This is not a scaling-law fit and not evidence for a hard interpretability ceiling.

Update note (Feb 21):

The CI-overlap statement is now explicitly scoped to the Fast3 mid-only sweep.
I added a new 50M-token in-tree anchor run for 70M mid-layer $k=8$.

Brief Definitions

Sparse autoencoder (SAE). A learned lossy codec for activations: encode dense activations into a sparse vector of “features,” then decode back to the activation space.

TopK SAE. An SAE where each token activates exactly k nonzero features (hard sparsity). In this post, the decoder is linear.

Hookpoint. Where I patch the activation: blocks.{L}.hook_resid_post (residual stream post-block).

Reconstruction fidelity ($R^2$). Variance explained in activation space:

\[R^2 := 1 - \frac{\mathrm{SSE}}{\mathrm{SST}}\]

with mean-centered totals:

\[\mathrm{SSE} = \sum \lVert a - \hat{a} \rVert^2, \quad \mathrm{SST} = \sum \lVert a - \mathbb{E}[a] \rVert^2.\]

Patched loss score ($CE_{rec}$). For each eval batch, let $L_{orig}$ be original next-token cross-entropy and $L_{recon}$ be CE after replacing the hookpoint activation with the SAE reconstruction. I report:

\[CE_{rec} := 1 - \frac{L_{recon} - L_{orig}}{L_{orig}} = 2 - \frac{L_{recon}}{L_{orig}}.\]

Notes (to be specific):

This is not ablation-normalized. It is not the same as “CE loss recovered” in Bricken et al. (2023) or SAE Lens’s ce_loss_score, which normalize against a zero-ablation baseline. Do not compare magnitudes across papers without converting.
Because this metric normalizes by each model’s own $L_{orig}$, cross-model comparisons can be affected by baseline-loss differences. I mainly use it here as a within-depth behavior preservation score and for sign patterns across $k$.
$CE_{rec}=1$ means perfect preservation ($L_{recon}=L_{orig}$).
$CE_{rec}=0$ means loss doubled ($L_{recon}=2 L_{orig}$).
It can be negative.

Per-token MSE (mse_mean). Mean squared reconstruction error, averaged over model dimensions:

\[\mathrm{MSE} := \mathbb{E}\left[\frac{1}{d_{model}}\lVert a - \hat{a} \rVert^2\right].\]

Cosine similarity (cosine sim). Mean tokenwise cosine between original and reconstructed activation vectors:

\[\cos(a,\hat{a}) := \frac{a^\top \hat{a}}{\lVert a\rVert\,\lVert \hat{a}\rVert}, \quad \text{report } \mathbb{E}[\cos(a,\hat{a})].\]

Relative error norm (relative error norm). Mean tokenwise relative L2 error:

\[\mathrm{RelErr} := \mathbb{E}\left[\frac{\lVert a - \hat{a}\rVert}{\lVert a\rVert + \varepsilon}\right], \quad \varepsilon=10^{-10}.\]

Alive fraction (alive %). Fraction of SAE features that fire at least once on the eval set (for TopK, “fires” means encoder output $\neq 0$ on some token):

\[\mathrm{alive} := \frac{\{i : \exists\,t \text{ s.t. } z_{t,i}\neq 0\}}{d_{sae}}.\]

Participation ratio (PR). A geometry diagnostic for effective dimensionality, computed from eigenvalues $\lambda_i$ of the mean-centered covariance:

\[PR := \frac{\left(\sum_i \lambda_i\right)^2}{\sum_i \lambda_i^2}.\]

Rate proxy (bits/token). An explicit coding-budget proxy for TopK codes: entropy-coded active indices plus fixed-point values (8 bits/value in this sweep). This is not mutual information; it’s just a concrete budget.

Experiment Setup (still quite crude atm)

I trained a grid of TopK SAEs and evaluated them two ways: (1) patching reconstructions back into the model and measuring CE change, and (2) direct reconstruction metrics in activation space.

Main sweep (Fast2): two models, three depths, six sparsity budgets = 36 SAEs.

Models:

pythia-70m (d_model=512, n_layers=6)
pythia-410m (d_model=1024, n_layers=24)

Hookpoints (matched by rough relative depth; late is confounded). Due to intense compute constraints, I’ve been limited to only 2 models at the moment. Will follow up.

depth regime	70M layer	410M layer	caveat
early	L1	L4	roughly pre-“canyon”
mid	L3	L12	inside the “canyon” regime
late	L5	L20	70M late is final block; 410M late is not

Supplementary sweep (Fast3): three models at mid depth only, $k \in \{8, 16, 32, 64\}$ = 12 SAEs. Adds Pythia-160M (d_model=768) as a bridge point.

SAE class and budgets (both sweeps):

Decoder class: single-layer TopK SAE with linear decoder.
Expansion: $d_{sae} = 32 \cdot d_{model}$.
Training budget: 10,000,000 tokens per SAE (budgeted pilot).
Dataset: streaming NeelNanda/pile-small-tokenized-2b.
Eval: 200 held-out sequences of length 256 (≈51k tokens) from the same stream (separated by skipping ahead in the stream), with 95% CIs from a bootstrap over sequences (5,000 resamples).

The Number That Started This

Mid-layer, $k=8$ (Fast2):

model	layer	$R^2$	$CE_{rec}$ [95% CI]	implied $L_{recon}/L_{orig}$
70M	3	0.807	0.340 [0.169, 0.469]	1.660
410M	12	0.961	0.235 [0.050, 0.382]	1.765

So the larger model looks “nearly solved” by variance explained, but patched loss is still much worse than baseline. A blunt translation of $CE_{rec}=0.235$ is: “patching reconstruction increases loss by about 76.5%.”

Stats caution: the CIs overlap at $k=8$, so do not treat this single row as decisive. The stronger evidence is the consistent sign across the full $k$ sweep (in this 10M-token/SAE budgeted regime) below.

Result 1: The Proxy Gap Is Depth-Localized

Across depths, the relationship between reconstruction and patched loss behaves differently.

The most direct evidence is the mid-layer delta table. At every $k$ in this 10M-token/SAE budgeted sweep (Fast2), scaling from 70M to 410M increases $R^2$ while decreasing $CE_{rec}$:

$k$	$\Delta R^2$ (410M minus 70M)	$\Delta CE_{rec}$ (410M minus 70M)
8	+0.154	-0.106
16	+0.112	-0.068
32	+0.084	-0.017
64	+0.057	-0.027
128	+0.036	-0.032
256	+0.017	-0.037

All six rows show the same sign pattern: $\Delta R^2 > 0$, $\Delta CE_{rec} < 0$.

The gap is largest at low $k$ and shrinks substantially by $k \ge 32$, where the absolute CE delta drops below 0.04. This is consistent with the first ~8-16 features being where variance-capture and loss-sensitivity diverge most: at low budgets, the SAE prioritizes high-variance directions (inflating $R^2$) while missing loss-sensitive structure. At higher budgets, there’s enough capacity to cover both.

For reference, correlations between $R^2$ and $CE_{rec}$ from the full Fast2 grid:

depth	corr pooling $k$ and scale	corr within fixed $k$ (scale-only)
early	+0.907	+0.353
mid	+0.474	-0.943
late	+0.702	+0.976 (confounded)

Important caveat: corr within fixed $$k$$ is computed by subtracting the mean within each $k$ (i.e. “demeaning by $k$”) and then correlating the residuals across all points in that depth ($n=12$ here = 2 models $\times$ 6 $k$ values). With only two model sizes, treat it mainly as a sign diagnostic for how scaling moves $R^2$ vs $CE_{rec}$ at fixed $k$, not as a stable statistic. The delta table above is the real evidence.

Why This Can Happen (no alien neuralese required)

The mid-layer sign flip (higher $R^2$ but worse $CE_{rec}$ at fixed $k$) is explainable with two mundane facts, plus one important confound.

1) $R^2$ is variance-normalized, and variance scale changes with model size

Mid-layer, $k=8$:

model	mse mean	relative error norm	$R^2$	SSE/token	SST/token
70M	0.0854	0.509	0.807	43.7	226.7
410M	0.1575	0.576	0.961	161.3	4131.1

Here SSE/token = mse_mean times d_model and SST/token is backed out from $R^2$ via $R^2 = 1 - \mathrm{SSE}/\mathrm{SST}$. The 410M mid layer has about 18x larger mean-centered variance scale (SST/token), so variance explained can look great even when absolute errors are not small.

Prediction: metrics that do not divide by SST (e.g. cosine similarity, relative error norm, raw MSE) should not show the same “looks solved” inflation. And they don’t. At mid-layer $k=8$ (Fast3 three-model check):

model	$R^2$	cosine sim	relative error norm	$CE_{rec}$
70M	0.810	0.856	0.506	0.372
160M	0.907	0.869	0.485	0.331
410M	0.961	0.809	0.574	0.269

$R^2$ improves monotonically with scale. Cosine similarity and relative error norm both show the 410M reconstruction is worse, consistent with $CE_{rec}$. The proxy gap here is specific to variance-normalized reconstruction metrics like $R^2$.

2) Loss sensitivity weights directions differently than covariance geometry

Let $\hat{a} = a + \delta$ be the reconstruction. For downstream loss $L(a)$, a first-order Taylor approximation gives:

\[L(\hat{a}) - L(a) \approx g^\top \delta, \quad g := \nabla_a L.\]

Reconstruction metrics weight directions by the activation distribution (covariance). Loss change weights directions by sensitivity $g$ (and, beyond first order, curvature). If sensitivity mass lives in comparatively low-variance directions, you can have high $R^2$ and still hurt loss.

This is the “proxy gap” mechanism in one sentence: MSE/variance and loss sensitivity are different measures on activation space.

Confound: fixed SAE training budget undertrains larger SAEs

In these sweeps, every SAE gets the same 10M-token training budget, while $d_{sae} \propto d_{model}$. Empirically, alive fraction often falls with model size (especially early/mid at low $k$) under this fixed budget. That is consistent with larger SAEs being relatively more undertrained, which can degrade behavior preservation and direction-sensitive reconstruction metrics even when $R^2$ looks strong. (See Limitations and the 100M-token check below.)

Result 2: Three-Model Check (Fast3, Mid-Layer Only)

A separate mid-only sweep adds Pythia-160M as a third point:

model	$d_{model}$	$k$	$R^2$	$CE_{rec}$ [95% CI]	cosine	alive %
70M	512	8	0.810	0.372 [0.232, 0.478]	0.856	15.2%
160M	768	8	0.907	0.331 [0.122, 0.477]	0.869	12.9%
410M	1024	8	0.961	0.269 [0.094, 0.404]	0.809	11.9%
70M	512	16	0.855	0.580 [0.439, 0.671]	0.893	37.1%
160M	768	16	0.924	0.538 [0.369, 0.653]	0.908	28.5%
410M	1024	16	0.968	0.530 [0.394, 0.634]	0.883	25.4%
70M	512	32	0.891	0.729 [0.622, 0.796]	0.920	63.2%
160M	768	32	0.938	0.696 [0.569, 0.780]	0.932	51.1%
410M	1024	32	0.974	0.723 [0.618, 0.801]	0.919	51.4%
70M	512	64	0.922	0.836 [0.758, 0.882]	0.944	83.8%
160M	768	64	0.951	0.809 [0.712, 0.871]	0.934	71.2%
410M	1024	64	0.979	0.814 [0.730, 0.875]	0.901	65.0%

Three observations:

At $k=8$ and $k=16$, the pattern holds across three models: $R^2$ increases monotonically with scale while $CE_{rec}$ decreases monotonically.
At $k \ge 32$, monotonicity breaks for $CE_{rec}$. At $k=32$: 410M ($CE_{rec}=0.723$) recovers more behavior than 160M (0.696). At $k=64$: 410M (0.814) slightly outperforms 160M (0.809). The proxy gap is a low-$k$ phenomenon.
In Fast3 (mid-only), all pairwise CIs overlap at every $k$. At $k=8$ (the largest gap): 70M CI = [0.232, 0.478], 410M CI = [0.094, 0.404]. Per-batch variability is large — CE standard deviations range from 0.36 to 0.93 across conditions. Cohen’s $d$ for the 70M-vs-410M difference at $k=8$ is approximately 0.14. The consistent sign across $k$ values is more informative than any individual comparison.

For clarity: this does not hold for Fast2 late depth, where 70M-vs-410M CE CIs are non-overlapping at $k=32,64,128,256$.

Result 3: There Is a Measurable $k^*$ Tax (Early/Mid)

To make scaling concrete, define $k^*(t)$ as the minimum $k$ needed to reach $CE_{rec} \ge t$ (linear interpolation on the Fast2 $k$ grid; single-seed, no uncertainty estimate on the interpolation itself).

depth	target $t$	$k^*$(70M)	$k^*$(410M)	ratio
early	0.85	40.6	55.0	1.36
early	0.90	62.0	106.9	1.72
mid	0.85	71.2	98.0	1.38
mid	0.90	116.5	187.6	1.61

This is not a blow-up (at these scales), but it is also not zero. Under a fixed decoder class and a fixed tool-training budget, larger scale can require more active features to preserve loss. However, see the alive-fraction caveat in Limitations — this may partly reflect undertraining rather than intrinsic difficulty.

Geometric Context: A Raw PR “Canyon”

Recall: this is diagnostic, not the primary claim.

I computed raw PR of hook_resid_post across layers.

Selected points:

model	layer	raw PR	PR/$d_{model}$
70M	1	51.914	0.101
70M	3	3.961	0.008
410M	4	111.967	0.109
410M	5	2.045	0.002
410M	12	1.177	0.001

So both models show a sharp drop into a long low-PR band, with 410M exhibiting an especially extreme canyon ($PR \approx 1$ for many layers).

Two important caveats:

Raw PR can be dominated by a leading direction. Standard practice is to “deflate” (remove) the top eigenvector and recompute PR. When you do that, the absolute PR values change a lot, but the “expand then compress” depth profile often remains. That’s a known phenomenon, and I’m not claiming novelty on it.
In this post, PR is a warning label: “this is where variance-normalized reconstruction proxies are likely to lie.” The mid-layer proxy gap above happens inside this low-PR regime.

I haven’t included deflated PR plots yet because my current RunPod environment was unstable (numpy/datasets stack broke). I’ll add deflated PR once it’s cleanly reproducible.

Information-Theoretic Framing

If you squint, an SAE is a lossy code. Then “interpretability under budgets” starts looking like rate-distortion:

Rate: bits/token of your sparse code (I log a crude proxy).
Distortion: either reconstruction distortion ($1-R^2$) or behavior distortion ($1-CE_{rec}$).

The practical point is not that “mutual information is falling” (I did not measure MI). The point is that you can put a real budget axis on the x-axis instead of just reporting $k$ or $R^2$.

This also connects (loosely) to recent “bounded observer” information frameworks, such as Finzi et al. (2026) on epiplexity. The common theme is: the same object can contain structure that is “there,” but not extractable by a weaker observer/tool class under a budget. In this sweep, a fixed SAE class looks better by $R^2$ but worse by patched loss in mid layers as scale increases, which is at least qualitatively consistent with “extractable structure for this tool class” degrading even when “variance structure” looks easy.

I am not claiming a formal reduction from SAE failure modes to epiplexity or time-bounded entropy. I’m using it as a framing: interpretability tools are observers with constraints, and proxy metrics can hide when you’re falling out of the extractable regime.

Relevance?

Given that PR/anisotropy phenomena are known, any novel claim here is fairly narrow, but still we see:

A depth-localized, reproducible proxy gap where $R^2$ can improve with scale at fixed $k$ while patched loss gets worse — strongest at low $k$, closing by $k \ge 32$.
A concrete mechanism (SST inflation in anisotropic layers) that predicts which reconstruction metrics are affected: $R^2$ yes, cosine/relative error norm no — but the deeper point is that reconstruction metrics in general can diverge from behavioral ones.
An operational “interpretability budget” object ($k^*$ for loss targets, plus rate proxies) that moves with depth and scale.
A concrete diagnostic (raw PR canyon) that flags where reconstruction-only evaluation is especially untrustworthy.

If you already believe “reconstruction isn’t behavior,” you may find this useful as an attempt to make that belief operational, with a knob you can sweep and a failure mode you can reproduce.

Practical methodology tweak: if you’re evaluating SAE quality, behavioral metrics like patched CE loss or ablation-normalized CE recovery (e.g. SAE Lens’s ce_loss_score) should be the acceptance test — not reconstruction fidelity alone. Among reconstruction metrics, cosine similarity or relative error norm are more robust than $R^2$ in mid-layer residual streams, but they still measure reconstruction, not behavior.

Limitations (Things A Reviewer Should Hit Me For)

Two model sizes in the main sweep is not a scaling law fit. Fast3 adds a third mid-layer point but does not span depths.
One SAE training seed per condition. Bootstrap CIs reflect eval-batch variability, not training variability. In Fast3 mid-only, all pairwise CIs overlap at every $k$; in Fast2 late depth, 70M-vs-410M CIs do not overlap at $k=32,64,128,256$.
Fixed tool-training budget (10M tokens/SAE). These are budgeted curves, not best-achievable. Alive fraction often decreases with model size in early/mid at low $k$ under this budget (with exceptions at higher $k$ / different depths), consistent with larger SAEs being relatively more undertrained under the same token budget. The observed $CE_{rec}$ gap may partly reflect differential undertraining rather than intrinsic representation difficulty. See the 100M-token check below.
Late depth is confounded (final block vs non-final), and late 70M runs show very low alive fractions consistent with undertraining.
My $CE_{rec}$ definition is nonstandard. It is not the ablation-normalized “CE loss recovered” used in Bricken et al. (2023), Gao et al. (2024), and SAE Lens. Do not compare magnitudes without converting. See Definitions section.
The $R^2$ inflation is specific to variance-normalized metrics. Cosine similarity and relative error norm agree with $CE_{rec}$ that larger-model reconstructions are worse at fixed $k$. But more broadly, reconstruction metrics and behavioral metrics can diverge — behavioral acceptance tests (patched CE, ablation-normalized CE recovery) are the more direct measure of what matters.

Update: A 100M-Token Mid-Layer $k=32$ Check (Token-Budget Sensitivity)

To probe the “fixed token budget” confound, I retrained the mid-layer $k=32$ TopK SAE for both models with a 10x larger training budget (100M tokens/SAE; SAE Lens v6.36.0, context size 256). Compared to the budgeted Fast2 runs:

model	train tokens	$k$	$CE_{rec}$	implied $L_{recon}/L_{orig}$	alive %
70M (Fast2)	10M	32	0.731	1.269	61.7%
410M (Fast2)	10M	32	0.714	1.286	50.1%
70M (100M)	100M	32	0.898	1.102	94.7%
410M (100M)	100M	32	0.920	1.080	94.1%

This suggests the mid-layer $k=32$ “scale hurts $CE_{rec}$” effect in the 10M-token sweep is at least partly an undertraining artifact: alive fraction jumps, behavior preservation improves sharply, and the scale ordering flips. This does not test the low-$k$ proxy gap (where the original effect is strongest).

Update: A 50M-Token Mid-Layer $k=8$ Anchor (In-Tree)

To add a low-$k$ budget anchor in the same regime as the main headline, I trained a 70M mid-layer $k=8$ SAE for 50M tokens (in-tree artifact):

interpretability/interpretability/workspace/results/budget_check_m4_mid_k8_50M/run_pythia-70m_L3_mid_k8.json

Compared to the 10M Fast2 anchor:

model/depth/k	train tokens	$CE_{rec}$ [95% CI]	$R^2$	alive %
70M mid $k=8$ (Fast2)	10M	0.340 [0.169, 0.469]	0.807	13.2%
70M mid $k=8$ (new)	50M	0.670 [0.629, 0.706]	0.863	44.2%

This is a large improvement in both behavior preservation and feature utilization under higher training budget, which strengthens the undertraining caveat for low-$k$ settings.

Next Steps (High Leverage)

Bridge model at all depths: add Pythia-160M at early/mid/late to get three-point depth-resolved comparisons.
Token-budget sensitivity: extend low-$k$ checks to include 410M (and 160M) at 50M/100M, and add early/late anchors, to separate undertraining from intrinsic difficulty and equalize alive fractions across scale.
Deflated PR and anisotropy controls: mean subtraction, top-eigen removal, to tighten the geometry story.
Sensitivity-weighted distortions: Fisher/Hessian approximations to predict loss impact better than $R^2$.
Legibility evaluation: SAEBench, MDL-style probing, to connect fidelity to human-usable features.

Repro (Minimal)

Fast2 results: interpretability/workspace/results/k_scaling_early-mid-late_fast2/

Fast3 results: interpretability/workspace/results/k_scaling_mid_fast3/

50M-token mid-layer $k=8$ anchor (in-tree): interpretability/interpretability/workspace/results/budget_check_m4_mid_k8_50M/

100M-token mid-layer $k=32$ check: /Users/oboh/Downloads/experiment_results.tar.gz (metrics in results/*.json).

Regenerate the tables:

python3 interpretability/analyze_k_scaling_results.py \
  interpretability/workspace/results/k_scaling_early-mid-late_fast2 \
  --include-legibility \
  --markdown-out interpretability/workspace/results/k_scaling_early-mid-late_fast2/writeup_ready_tables.md

References (Non-Exhaustive)

Bricken et al. (2023). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. Anthropic.
Cunningham, Ewart, Riggs, Huben and Sharkey (2023). Sparse Autoencoders Find Highly Interpretable Features in Language Models. ICLR 2024.
Gao et al. (2024). Scaling and Evaluating Sparse Autoencoders. arXiv:2406.04093.
Elhage et al. (2022). Toy Models of Superposition. arXiv:2209.10652.
Finzi et al. (2026). From Entropy to Epiplexity: Rethinking Information for Computationally Bounded Intelligence. arXiv:2601.03220.

Weird

2026-02-01T00:00:00+00:00

TASTE? AIR?

Things are getting stranger. You feel it?

Hmm

2025-12-31T00:00:00+00:00

title: Hmm date: 2025-12-31 kind: diary —

Hello. It’s been about 2 and a half years since I made one of these. It’s good to look back and reflect on life really.

I started writing these initially on Substack. Although I do have a bunch of unfinished posts, as my signature style engenders.

I’m not sure what to say really. I do have a lot of new things going on vs when I was 17 in the heat of the pandemic dealing with all sorts of internal consternation and tumult. Things are materially much better now though in certain ways. Not absolutely.

Well, I’m about to undertake my 2nd serious business venture. Actually I’d say 3rd. Aside from my current nascent AI/GRAPHICS service thing.

So I guess in regards to these little hobbies of mine I have made some material progress, although all this has been mostly within this year of 2025, and mostly within the second half of it as well.

I think I may have been in some sort of unpleasant melancholic basin; I do believe I am making good step sizes and almost see the light.

I think I see a way out. A way to win. Hopefully my sins don’t catch up to me before I can achieve actualization. As a true Catholic, I still can’t escape guilt driven morality.

Happy new year to myself. Let’s see what can be done, amen.

model/depth/k	train tokens	\(CE_{rec}\) [95% CI]	\(R^2\)	alive %
70M mid \(k=8\) (Fast2)	10M	0.340 [0.169, 0.469]	0.807	13.2%
70M mid \(k=8\) (new)	50M	0.670 [0.629, 0.706]	0.863	44.2%

obiohagwu

Topological Signal In Learned Odor Embeddings

What I Actually Tested

The Main Result

Why I Still Think The Result Is Interesting

Utility: Beyond descriptive topology?

the gains are conditional and not currently enough to claim that topology is a uniquely powerful odor-modeling advantage

Clear ?’s

Why?

Hmm1

Completed 12k Run: Early Speech Emergence in a 21.9M OLMo-Hybrid Speech LM

Setup

Validation Progression

Sample Progression

What I Think This Says

Next Step

A 21.9M hybrid recurrent attention to learn speech on 1.8k steps

Setup

Validation Trajectory

What The Samples Sound Like

What Failed

Significance?

Next Steps

OLMo Hybrid for Autoregressive Music Token Modeling: A 300M Single-A100 Baseline

Why this architecture may be really useful for music

Tokenization

The model

Current implementation status

The 300M baseline configuration

Training plan

Why this is a useful baseline and the ACTUAL crux

Evaluation plan

Limitations

Thoughts?

Janky memory cache over mamba hybrid for music?

The Memory Caching Paper

Mamba’s Hidden state seems difficult to penetrate

The implementations and the primary

MC-Linear-Attention (the “faithful” version)

MC-Mamba (the “proxy” version)

Train Setup

Experiment 1: The 48M Model (Proof of Life)

It learned, haha wtf.

Experiment 2: Scaling to 95M (Where Things Went Wrong)

Gradient explosions

The gate never opened

GRM entropy stayed high

The loss seems fine (which made it more annoying)

What to do?

Limitation :)

Learnings?

Maybe next steps

Outlook

Transport Geometry as a Testable Hypothesis?

The sign mismatch was real but temporary

Three spaces and the mismatch between them

SWD: a first-pass bridge metric

The hypothesis stack right now

Where this is going

Rate-distortion-geometry

Concurrent work worth reading if you’re really interested:

Some more limitations?

Repro

References

Your SAE looks solved. Your model disagrees

Brief Definitions

Experiment Setup (still quite crude atm)

The Number That Started This

Result 1: The Proxy Gap Is Depth-Localized

Why This Can Happen (no alien neuralese required)

1) \(R^2\) is variance-normalized, and variance scale changes with model size

2) Loss sensitivity weights directions differently than covariance geometry

Confound: fixed SAE training budget undertrains larger SAEs

Result 2: Three-Model Check (Fast3, Mid-Layer Only)

Result 3: There Is a Measurable \(k^*\) Tax (Early/Mid)

Geometric Context: A Raw PR “Canyon”

Information-Theoretic Framing

Relevance?

Limitations (Things A Reviewer Should Hit Me For)

Update: A 100M-Token Mid-Layer \(k=32\) Check (Token-Budget Sensitivity)