<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://obiohagwu.github.io/obiohagwu/feed.xml" rel="self" type="application/atom+xml" /><link href="https://obiohagwu.github.io/obiohagwu/" rel="alternate" type="text/html" /><updated>2026-04-08T20:35:05+00:00</updated><id>https://obiohagwu.github.io/obiohagwu/feed.xml</id><title type="html">obiohagwu</title><subtitle>Independent research notes on interpretability, representation geometry, and sequence and speech modeling.</subtitle><entry><title type="html">Topological Signal In Learned Odor Embeddings</title><link href="https://obiohagwu.github.io/obiohagwu/2026/04/08/Topological-signal-in-learned-odor-embeddings.html" rel="alternate" type="text/html" title="Topological Signal In Learned Odor Embeddings" /><published>2026-04-08T00:00:00+00:00</published><updated>2026-04-08T00:00:00+00:00</updated><id>https://obiohagwu.github.io/obiohagwu/2026/04/08/Topological-signal-in-learned-odor-embeddings</id><content type="html" xml:base="https://obiohagwu.github.io/obiohagwu/2026/04/08/Topological-signal-in-learned-odor-embeddings.html"><![CDATA[<p>Got nerdsniped recently.</p>

<p>Q:</p>
<blockquote>
  <p>if a learned odor embedding appears to show nontrivial topology, does that signal survive metric choice, subsampling, null models, checkpoint variation, and comparison to ordinary chemical baselines?</p>
</blockquote>

<p>That is a much more defensible question than “what is the topology of odor space?”</p>

<p>Useful links:</p>

<ul>
  <li><a href="/obiohagwu/images/odor-topology/robust-but-not-unique-odor-topology.pdf">paper (PDF)</a></li>
  <li><a href="https://github.com/Obiohagwu/odor-topology">code and analysis repo</a></li>
</ul>

<hr />

<h3 id="what-i-actually-tested">What I Actually Tested</h3>

<p>The representation under audit was OpenPOM, the open implementation of the Principal Odor Map line of work.</p>

<p>The comparison stack was:</p>

<ul>
  <li>two primary OpenPOM checkpoints for matched direct comparisons</li>
  <li>all <code class="language-plaintext highlighter-rouge">10</code> released OpenPOM ensemble checkpoints for stability checks</li>
  <li>a curated <code class="language-plaintext highlighter-rouge">4,983</code>-molecule GoodScents/Leffingwell table</li>
  <li>a broader <code class="language-plaintext highlighter-rouge">5,862</code>-molecule GS/LF table</li>
  <li>a <code class="language-plaintext highlighter-rouge">1,600</code>-molecule non-overlap subset</li>
  <li>Morgan bit fingerprints</li>
  <li>Morgan count fingerprints</li>
  <li>a small RDKit physicochemical descriptor baseline</li>
</ul>

<p>The main target was persistent <code class="language-plaintext highlighter-rouge">H1</code>, not “all topology.” The whole point was to see whether any apparent loop-like structure was actually robust enough to survive the usual failure modes.</p>

<hr />

<h3 id="the-main-result">The Main Result</h3>

<p>The shortest honest summary is:</p>

<ul>
  <li>POM really does show reproducible <code class="language-plaintext highlighter-rouge">H1</code> signal</li>
  <li>that signal survives repeated subsampling, matched nulls, dataset changes, and checkpoint variation</li>
  <li>but it is <strong>not unique to POM</strong></li>
</ul>

<p>That last point matters. The strongest Morgan bit fingerprint baselines were at least as strong and often stronger than POM on the raw robustness ratios used here.</p>

<p>So this is <strong>not</strong> a “we found hidden loops in odor space and POM wins” paper.</p>

<p>It is a much narrower and, I think, more useful result:</p>

<blockquote>
  <p>topological signal in learned odor representations is real enough to survive serious scrutiny, but current evidence does not justify treating that topology as uniquely favorable to POM.</p>
</blockquote>

<hr />

<h3 id="why-i-still-think-the-result-is-interesting">Why I Still Think The Result Is Interesting</h3>

<p>The part I find most interesting is not the trivial leaderboard question of which space gets the biggest persistence ratio.</p>

<p>The interesting part is the compression mismatch.</p>

<p>POM here is a <code class="language-plaintext highlighter-rouge">256</code>-dimensional dense learned embedding. The strongest chemical baselines are <code class="language-plaintext highlighter-rouge">2048</code>-bit sparse combinatorial encodings. Those are not the same kind of objects.</p>

<p>So we have:</p>
<ul>
  <li>sparse fingerprints preserve lots of raw combinatorial chemical detail, which can naturally produce strong persistent structure</li>
  <li>POM is much more compressed, so the fact that it still retains robust <code class="language-plaintext highlighter-rouge">H1</code> at all is nontrivial</li>
  <li>but that still does <strong>not</strong> prove that POM preserves <em>better</em> odor-relevant structure</li>
</ul>

<p>That distinction is basically the spine of the paper now.</p>

<p>If I push this line further, the next version would likely compare POM against <strong>matched-bottleneck chemical baselines</strong> rather than only against much larger sparse spaces. PCA-compressed fingerprints, random projections, or autoencoder-compressed baselines would make the fairness question much cleaner.</p>

<hr />

<h3 id="utility-beyond-descriptive-topology">Utility: Beyond descriptive topology?</h3>

<p>I also did a fairly narrow first utility pass, because I did not want the whole project to stop at “look, persistence diagrams.”</p>

<p>The utility question was:</p>

<blockquote>
  <p>do topology-derived local features explain neighborhood-level odor-label structure better than local geometry alone?</p>
</blockquote>

<p>The answer was cautiously yes, mostly in a limited and target-dependent way.</p>

<p>Local topology sometimes improved prediction of neighborhood ambiguity / entropy style targets beyond local geometry. But again, the gains were not uniquely favorable to POM. Some ordinary chemistry baselines seemed to also benefit.</p>

<p>So the utility result is quite modest:</p>

<ul>
  <li>topology can add explanatory value</li>
  <li>
    <h2 id="the-gains-are-conditional-and-not-currently-enough-to-claim-that-topology-is-a-uniquely-powerful-odor-modeling-advantage">the gains are conditional and not currently enough to claim that topology is a uniquely powerful odor-modeling advantage</h2>
  </li>
</ul>

<h3 id="clear-s">Clear ?’s</h3>

<p>I want to be very explicit about the non-claims here.</p>

<p>I am <strong>not</strong> claiming:</p>

<ul>
  <li>that this reveals the true topology of human odor perception</li>
  <li>that persistent <code class="language-plaintext highlighter-rouge">H1</code> implies clean interpretable circular perceptual dimensions</li>
  <li>that current learned odor embeddings uniquely beat strong chemical baselines</li>
  <li>that topology has already been shown to confer large practical gains for odor design</li>
</ul>

<p>The paper is strongest as a representation audit.</p>

<p>That is still worth something. In a lot of modern embedding work, people jump from “the space predicts well” to “the geometry must mean something deep.” This project is an attempt to put that move under pressure in a domain where the representations are interesting, but the claims can get fuzzy very quickly.</p>

<hr />

<h3 id="why">Why?</h3>
<p>I think there is real value in results that sharpen what a representation does and does not justify.</p>

<p>The cleanest positive claim from this line is mostly:</p>

<ul>
  <li>“compressed learned odor spaces can preserve real topological structure, but robust topology is not automatically a learned-embedding-only phenomenon”</li>
</ul>

<p>That is a stricter and less glamorous claim, but it is the one I can currently defend.</p>

<p>If you want the compact artifact version, the paper is here again:</p>

<ul>
  <li><a href="/obiohagwu/images/odor-topology/robust-but-not-unique-odor-topology.pdf">Topological signal in learned odor embeddings (PDF)</a></li>
</ul>]]></content><author><name></name></author><summary type="html"><![CDATA[Got nerdsniped recently.]]></summary></entry><entry><title type="html">Hmm1</title><link href="https://obiohagwu.github.io/obiohagwu/2026/03/28/Hmm1.html" rel="alternate" type="text/html" title="Hmm1" /><published>2026-03-28T00:00:00+00:00</published><updated>2026-03-28T00:00:00+00:00</updated><id>https://obiohagwu.github.io/obiohagwu/2026/03/28/Hmm1</id><content type="html" xml:base="https://obiohagwu.github.io/obiohagwu/2026/03/28/Hmm1.html"><![CDATA[<p>I’m not entirely sure. Wouldn’t you also want to?</p>

<p>There’s nothing else to all this really, friends.</p>

<ul>
  <li>More personal Diary entry post.</li>
</ul>]]></content><author><name></name></author><summary type="html"><![CDATA[I’m not entirely sure. Wouldn’t you also want to?]]></summary></entry><entry><title type="html">Completed 12k Run: Early Speech Emergence in a 21.9M OLMo-Hybrid Speech LM</title><link href="https://obiohagwu.github.io/obiohagwu/2026/03/14/Completed-12k-Run-Early-Speech-Emergence-in-a-21.9M-OLMo-Hybrid-Speech-LM.html" rel="alternate" type="text/html" title="Completed 12k Run: Early Speech Emergence in a 21.9M OLMo-Hybrid Speech LM" /><published>2026-03-14T00:00:00+00:00</published><updated>2026-03-14T00:00:00+00:00</updated><id>https://obiohagwu.github.io/obiohagwu/2026/03/14/Completed%2012k%20Run%20Early%20Speech%20Emergence%20in%20a%2021.9M%20OLMo-Hybrid%20Speech%20LM</id><content type="html" xml:base="https://obiohagwu.github.io/obiohagwu/2026/03/14/Completed-12k-Run-Early-Speech-Emergence-in-a-21.9M-OLMo-Hybrid-Speech-LM.html"><![CDATA[<p>This is the follow-up to my earlier <code class="language-plaintext highlighter-rouge">1800</code>-step pilot. I resumed the same <code class="language-plaintext highlighter-rouge">21.9M</code> OLMo-hybrid speech codec LM on A100 and let it run all the way to <code class="language-plaintext highlighter-rouge">12000</code> steps.</p>

<p>The short version is that the run finished cleanly, validation kept improving the whole time, and the final samples are noticeably stronger than the earlier <code class="language-plaintext highlighter-rouge">15%</code>-budget checkpoint.</p>

<p>Final result:</p>

<ul>
  <li>best checkpoint: <code class="language-plaintext highlighter-rouge">step 12000</code></li>
  <li>EMA validation loss: <code class="language-plaintext highlighter-rouge">3.8207</code></li>
  <li>perplexity: <code class="language-plaintext highlighter-rouge">45.63</code></li>
  <li>dataset: <code class="language-plaintext highlighter-rouge">LJ Speech</code></li>
  <li>tokenizer: <code class="language-plaintext highlighter-rouge">EnCodec 24 kHz</code>, <code class="language-plaintext highlighter-rouge">8</code> codebooks</li>
  <li>hardware: <code class="language-plaintext highlighter-rouge">A100-SXM4-80GB</code></li>
</ul>

<p>Useful links:</p>

<ul>
  <li><a href="/obiohagwu/images/olmo-hybrid-speech/final_report.pdf">full technical report (PDF)</a></li>
  <li><a href="https://github.com/Obiohagwu/olmo-hybrid-speech">code</a></li>
  <li><a href="/obiohagwu/images/olmo-hybrid-speech/a100_eval_history_full.csv">full eval history CSV</a></li>
</ul>

<hr />

<h3 id="setup">Setup</h3>

<p>This is still an unconditional codec language model, not a text-conditioned TTS system yet.</p>

<table>
  <thead>
    <tr>
      <th>Item</th>
      <th>Value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Model</td>
      <td>OLMo-hybrid speech LM</td>
    </tr>
    <tr>
      <td>Parameters</td>
      <td><code class="language-plaintext highlighter-rouge">21.9M</code></td>
    </tr>
    <tr>
      <td>Backbone</td>
      <td><code class="language-plaintext highlighter-rouge">8</code> layers = <code class="language-plaintext highlighter-rouge">6</code> Gated DeltaNet blocks + <code class="language-plaintext highlighter-rouge">2</code> attention blocks</td>
    </tr>
    <tr>
      <td>Width</td>
      <td><code class="language-plaintext highlighter-rouge">d_model=384</code>, <code class="language-plaintext highlighter-rouge">d_ff=1024</code></td>
    </tr>
    <tr>
      <td>Attention</td>
      <td><code class="language-plaintext highlighter-rouge">6</code> heads, <code class="language-plaintext highlighter-rouge">2</code> KV heads</td>
    </tr>
    <tr>
      <td>Hybrid schedule</td>
      <td>attention every <code class="language-plaintext highlighter-rouge">4th</code> block, final block forced to attention</td>
    </tr>
    <tr>
      <td>Data</td>
      <td><code class="language-plaintext highlighter-rouge">LJ Speech</code></td>
    </tr>
    <tr>
      <td>Tokenizer</td>
      <td><code class="language-plaintext highlighter-rouge">EnCodec 24 kHz</code>, <code class="language-plaintext highlighter-rouge">8</code> codebooks, vocab <code class="language-plaintext highlighter-rouge">1027</code></td>
    </tr>
    <tr>
      <td>Chunking</td>
      <td><code class="language-plaintext highlighter-rouge">8s</code> chunks</td>
    </tr>
    <tr>
      <td>Split</td>
      <td><code class="language-plaintext highlighter-rouge">12,624</code> train / <code class="language-plaintext highlighter-rouge">666</code> val</td>
    </tr>
    <tr>
      <td>Context</td>
      <td><code class="language-plaintext highlighter-rouge">1024</code> delayed steps</td>
    </tr>
    <tr>
      <td>Hardware</td>
      <td><code class="language-plaintext highlighter-rouge">A100-SXM4-80GB</code></td>
    </tr>
    <tr>
      <td>Runtime</td>
      <td><code class="language-plaintext highlighter-rouge">bf16</code>, fused AdamW, CUDA SDPA flash path on, fused FLA GDN path off</td>
    </tr>
    <tr>
      <td>Stable batch</td>
      <td>true batch <code class="language-plaintext highlighter-rouge">24</code>, grad accum <code class="language-plaintext highlighter-rouge">1</code></td>
    </tr>
    <tr>
      <td>Throughput</td>
      <td>about <code class="language-plaintext highlighter-rouge">18k tok/s</code></td>
    </tr>
  </tbody>
</table>

<p>One important systems caveat: this successful run still used the plain PyTorch recurrent fallback for the Gated DeltaNet blocks. The intended fused FLA recurrent kernel was unstable on the available pod stack, so this is a clean modeling result more than a clean kernel-stack result.</p>

<hr />

<h3 id="validation-progression">Validation Progression</h3>

<p>The important part is that the run never really rolled over. It just kept getting better more slowly.</p>

<table>
  <thead>
    <tr>
      <th>Step</th>
      <th style="text-align: right">Train Loss</th>
      <th style="text-align: right">EMA Val Loss</th>
      <th style="text-align: right">PPL</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>200</td>
      <td style="text-align: right">5.0049</td>
      <td style="text-align: right">6.7878</td>
      <td style="text-align: right">886.99</td>
    </tr>
    <tr>
      <td>1000</td>
      <td style="text-align: right">4.1367</td>
      <td style="text-align: right">4.8510</td>
      <td style="text-align: right">127.87</td>
    </tr>
    <tr>
      <td>1800</td>
      <td style="text-align: right">4.2224</td>
      <td style="text-align: right">4.2847</td>
      <td style="text-align: right">72.58</td>
    </tr>
    <tr>
      <td>3200</td>
      <td style="text-align: right">4.2821</td>
      <td style="text-align: right">4.0500</td>
      <td style="text-align: right">57.40</td>
    </tr>
    <tr>
      <td>5200</td>
      <td style="text-align: right">3.7050</td>
      <td style="text-align: right">3.9194</td>
      <td style="text-align: right">50.37</td>
    </tr>
    <tr>
      <td>7400</td>
      <td style="text-align: right">3.8102</td>
      <td style="text-align: right">3.8551</td>
      <td style="text-align: right">47.23</td>
    </tr>
    <tr>
      <td>10000</td>
      <td style="text-align: right">3.8445</td>
      <td style="text-align: right">3.8267</td>
      <td style="text-align: right">45.91</td>
    </tr>
    <tr>
      <td>12000</td>
      <td style="text-align: right">3.7043</td>
      <td style="text-align: right"><strong>3.8207</strong></td>
      <td style="text-align: right"><strong>45.63</strong></td>
    </tr>
  </tbody>
</table>

<p>So the original <code class="language-plaintext highlighter-rouge">1800</code>-step pilot was real, but it was not the end of the useful training regime.</p>

<hr />

<h3 id="sample-progression">Sample Progression</h3>

<p>What I cared about most was whether this architecture could stay in a clearly speech-like regime and get cleaner with more training. It did.</p>

<p>Earlier pilot checkpoint: <code class="language-plaintext highlighter-rouge">step 1800</code></p>

<audio controls="" preload="none" src="/obiohagwu/images/olmo-hybrid-speech/step1800_sample1.wav"></audio>

<p>Mid-run checkpoint: <code class="language-plaintext highlighter-rouge">step 7400</code></p>

<audio controls="" preload="none" src="/obiohagwu/images/olmo-hybrid-speech/step7400_sample0.wav"></audio>

<p>Final-best checkpoint: <code class="language-plaintext highlighter-rouge">step 12000</code></p>

<p>Sample 1</p>

<audio controls="" preload="none" src="/obiohagwu/images/olmo-hybrid-speech/step12000_sample0.wav"></audio>

<p>Sample 2</p>

<audio controls="" preload="none" src="/obiohagwu/images/olmo-hybrid-speech/step12000_sample1.wav"></audio>

<p>Sample 3</p>

<audio controls="" preload="none" src="/obiohagwu/images/olmo-hybrid-speech/step12000_sample2.wav"></audio>

<p>These are still babbly and not semantically grounded. But they are much less in the “barely holding together” regime than the early pilot, and that matters.</p>

<hr />

<h3 id="what-i-think-this-says">What I Think This Says</h3>

<p>The narrow claim I am comfortable making is:</p>

<blockquote>
  <p>a small OLMo-hybrid / Gated-DeltaNet-style speech codec LM can learn enough local speech structure on a clean single-speaker corpus to produce stable, clearly voice-like audio, and it keeps improving well past the first emergence point.</p>
</blockquote>

<p>What I am <strong>not</strong> claiming:</p>

<ul>
  <li>that this is a finished TTS system</li>
  <li>that it beats a matched transformer baseline</li>
  <li>that the current no-FLA run reflects the architecture’s ideal efficiency</li>
  <li>that these samples are semantically meaningful speech</li>
</ul>

<p>This is still an architecture viability result. But it is a real one.</p>

<hr />

<h3 id="next-step">Next Step</h3>

<p>This is enough to justify moving into text conditioning.</p>

<p>The next question is whether I can inject text cleanly enough to get controllable reading rather than only speech-like babble.</p>

<p>So the plan from here is:</p>

<ul>
  <li>keep the current audio decoder</li>
  <li>add a small text encoder</li>
  <li>add cross-attention from selected decoder blocks into text states</li>
  <li>start with sentence-level LJ Speech transcripts before worrying about anything longer-form</li>
</ul>]]></content><author><name></name></author><summary type="html"><![CDATA[This is the follow-up to my earlier 1800-step pilot. I resumed the same 21.9M OLMo-hybrid speech codec LM on A100 and let it run all the way to 12000 steps.]]></summary></entry><entry><title type="html">A 21.9M hybrid recurrent attention to learn speech on 1.8k steps</title><link href="https://obiohagwu.github.io/obiohagwu/2026/03/10/A-21.9M-hybrid-recurrent-attention-to-learn-speech-on-1.8k-steps.html" rel="alternate" type="text/html" title="A 21.9M hybrid recurrent attention to learn speech on 1.8k steps" /><published>2026-03-10T00:00:00+00:00</published><updated>2026-03-10T00:00:00+00:00</updated><id>https://obiohagwu.github.io/obiohagwu/2026/03/10/A%2021.9M%20hybrid%20recurrent%20attention%20to%20learn%20speech%20on%201.8k%20steps</id><content type="html" xml:base="https://obiohagwu.github.io/obiohagwu/2026/03/10/A-21.9M-hybrid-recurrent-attention-to-learn-speech-on-1.8k-steps.html"><![CDATA[<p>Ok, so I might have had a light W with this little speech experiment. Yes, Olmo type recurrent attention hybrids are pretty good.</p>

<p>I wanted a cheap answer to a simple question: can a tiny OLMo-hybrid style recurrent-attention language model learn enough codec structure to produce <strong>voice-like audio instead of static</strong>?</p>

<p>Seems like it did, haha.</p>

<p>By about step <code class="language-plaintext highlighter-rouge">1800</code>, the <code class="language-plaintext highlighter-rouge">21.9M</code> parameter OLMo-hybrid trained on <code class="language-plaintext highlighter-rouge">LJ Speech</code> with <code class="language-plaintext highlighter-rouge">EnCodec 24 kHz</code> tokens was already producing samples with a clear speaking voice. The outputs were still mildly babbly and not semantically coherent, but they were decisively quite clearly past the “hiss / drone / decoder garbage” failure mode.</p>

<p>That happened at only about <code class="language-plaintext highlighter-rouge">15%</code> of the originally planned <code class="language-plaintext highlighter-rouge">12k</code> training budget.</p>

<hr />

<h3 id="setup">Setup</h3>

<table>
  <thead>
    <tr>
      <th>Item</th>
      <th>Value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Model</td>
      <td>OLMo-hybrid speech LM</td>
    </tr>
    <tr>
      <td>Parameters</td>
      <td><code class="language-plaintext highlighter-rouge">21.9M</code></td>
    </tr>
    <tr>
      <td>Backbone</td>
      <td><code class="language-plaintext highlighter-rouge">8</code> layers = <code class="language-plaintext highlighter-rouge">6</code> Gated DeltaNet blocks + <code class="language-plaintext highlighter-rouge">2</code> attention blocks</td>
    </tr>
    <tr>
      <td>Width</td>
      <td><code class="language-plaintext highlighter-rouge">d_model=384</code>, <code class="language-plaintext highlighter-rouge">d_ff=1024</code></td>
    </tr>
    <tr>
      <td>Attention</td>
      <td><code class="language-plaintext highlighter-rouge">6</code> heads, <code class="language-plaintext highlighter-rouge">2</code> KV heads</td>
    </tr>
    <tr>
      <td>Hybrid schedule</td>
      <td>attention every <code class="language-plaintext highlighter-rouge">4th</code> block</td>
    </tr>
    <tr>
      <td>Data</td>
      <td><code class="language-plaintext highlighter-rouge">LJ Speech</code></td>
    </tr>
    <tr>
      <td>Tokenizer</td>
      <td><code class="language-plaintext highlighter-rouge">EnCodec 24 kHz</code>, <code class="language-plaintext highlighter-rouge">8</code> codebooks, vocab <code class="language-plaintext highlighter-rouge">1027</code></td>
    </tr>
    <tr>
      <td>Chunking</td>
      <td><code class="language-plaintext highlighter-rouge">8s</code> chunks</td>
    </tr>
    <tr>
      <td>Split</td>
      <td><code class="language-plaintext highlighter-rouge">12,624</code> train / <code class="language-plaintext highlighter-rouge">666</code> val</td>
    </tr>
    <tr>
      <td>Context</td>
      <td><code class="language-plaintext highlighter-rouge">1024</code> delayed steps</td>
    </tr>
    <tr>
      <td>Hardware</td>
      <td><code class="language-plaintext highlighter-rouge">A100-SXM4-80GB</code></td>
    </tr>
    <tr>
      <td>Runtime</td>
      <td><code class="language-plaintext highlighter-rouge">bf16</code>, fused AdamW, CUDA SDPA flash path on, fused FLA GDN path off</td>
    </tr>
    <tr>
      <td>Stable batch</td>
      <td>true batch <code class="language-plaintext highlighter-rouge">24</code>, grad accum <code class="language-plaintext highlighter-rouge">1</code></td>
    </tr>
    <tr>
      <td>Throughput</td>
      <td>about <code class="language-plaintext highlighter-rouge">17.9k tok/s</code></td>
    </tr>
  </tbody>
</table>

<p>A bit of an important note is that this run did <strong>not</strong> use the intended fused Flash Linear Attention Gated DeltaNet kernel. The Triton backward kernel was unstable on my runpod stack and kept causing crashes, so I decided to just have the recurrent core fall back to the plain PyTorch scan implementation. That means the run was still much slower than ideal for an A100. Which is a sort of upside because it means it could much quicker.</p>

<hr />

<h3 id="validation-trajectory">Validation Trajectory</h3>

<p>The A100 run seemed to have also improved monotonically through every saved eval up to <code class="language-plaintext highlighter-rouge">1800</code>.</p>

<table>
  <thead>
    <tr>
      <th>Step</th>
      <th style="text-align: right">EMA Val Loss</th>
      <th style="text-align: right">PPL</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>200</td>
      <td style="text-align: right">6.7878</td>
      <td style="text-align: right">886.99</td>
    </tr>
    <tr>
      <td>400</td>
      <td style="text-align: right">6.1530</td>
      <td style="text-align: right">470.13</td>
    </tr>
    <tr>
      <td>600</td>
      <td style="text-align: right">5.6474</td>
      <td style="text-align: right">283.54</td>
    </tr>
    <tr>
      <td>800</td>
      <td style="text-align: right">5.1973</td>
      <td style="text-align: right">180.78</td>
    </tr>
    <tr>
      <td>1000</td>
      <td style="text-align: right">4.8510</td>
      <td style="text-align: right">127.87</td>
    </tr>
    <tr>
      <td>1200</td>
      <td style="text-align: right">4.6169</td>
      <td style="text-align: right">101.18</td>
    </tr>
    <tr>
      <td>1400</td>
      <td style="text-align: right">4.4626</td>
      <td style="text-align: right">86.72</td>
    </tr>
    <tr>
      <td>1600</td>
      <td style="text-align: right">4.3585</td>
      <td style="text-align: right">78.14</td>
    </tr>
    <tr>
      <td>1800</td>
      <td style="text-align: right">4.2847</td>
      <td style="text-align: right">72.58</td>
    </tr>
  </tbody>
</table>

<p>This matters because my earlier M4 Mini run had already shown the model could get to “speech-like babble,” but it plateaued and got operationally ugly. The A100 run kept moving cleanly.</p>

<hr />

<h3 id="what-the-samples-sound-like">What The Samples Sound Like</h3>

<p>Qualitatively:</p>

<ul>
  <li>there is clearly a human voice present</li>
  <li>the outputs are not static</li>
  <li>they are not tonal droning failures</li>
  <li>they are still largely babble rather than intelligible text</li>
  <li>later checkpoints sound more stable than earlier ones, but sample quality variance is still real</li>
</ul>

<p>That is exactly the threshold I cared about for this pilot. I was not trying to get polished TTS from a <code class="language-plaintext highlighter-rouge">21.9M</code> unconditioned model; I wanted evidence that the hybrid architecture could learn speech codec structure at all.</p>

<p>It did.</p>

<p>Local sample bundles were saved for:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">step 1200</code></li>
  <li><code class="language-plaintext highlighter-rouge">step 1400</code></li>
  <li><code class="language-plaintext highlighter-rouge">step 1600</code></li>
  <li><code class="language-plaintext highlighter-rouge">step 1800</code></li>
</ul>

<p>and the best saved checkpoint by the end of the session was <code class="language-plaintext highlighter-rouge">step 1800</code>.</p>

<hr />

<h2 id="what-failed">What Failed</h2>

<p>The systems story was much worse than the learning story.</p>

<p>Main problems:</p>

<ul>
  <li>the fused FLA Gated DeltaNet kernel did not train cleanly on the pod software stack</li>
  <li>cheap pod infrastructure was volatile and restarted at bad times</li>
  <li>pod-local storage could not be trusted</li>
  <li>the run likely died right as <code class="language-plaintext highlighter-rouge">step 2000</code> was being saved</li>
</ul>

<p>That said, I did manage to pull the important checkpoints locally before the pod disappeared:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">1200</code></li>
  <li><code class="language-plaintext highlighter-rouge">1400</code></li>
  <li><code class="language-plaintext highlighter-rouge">1600</code></li>
  <li><code class="language-plaintext highlighter-rouge">1800</code></li>
  <li>latest <code class="language-plaintext highlighter-rouge">best.pt</code></li>
</ul>

<p>So the final result survived even though the pod did not.</p>

<hr />

<h3 id="significance">Significance?</h3>

<p>I believe the primary lesson is:</p>

<blockquote>
  <p>a tiny OLMo-hybrid style recurrent-attention LM can learn enough speech codec structure on a clean single-speaker corpus to produce clear voice-like audio surprisingly early in training.</p>
</blockquote>

<p>Even without:</p>

<ul>
  <li>a fused recurrent kernel</li>
  <li>a long training budget</li>
  <li>text conditioning</li>
  <li>semantic token hierarchies</li>
  <li>a polished speech decoder stack</li>
</ul>

<p>So it seems fairly obvious my next step is <strong>text conditioning</strong>. Once the model is already making convincing voice-like audio, the high-value problem moves the goalpost from simple unconditioned sampling to actual steering.</p>

<hr />

<h2 id="next-steps">Next Steps</h2>

<ul>
  <li>use the <code class="language-plaintext highlighter-rouge">1800</code> checkpoint as the main saved pilot result</li>
  <li>compare <code class="language-plaintext highlighter-rouge">1400 / 1600 / 1800</code> sample quality side-by-side</li>
  <li>add a matched transformer baseline later on real GPU access</li>
  <li>move to text-conditioned speech generation next</li>
  <li>revisit the fused GDN kernel on a more stable CUDA/Triton stack</li>
</ul>

<p>This run was messy to execute, but the core result was clean: the model learned real speech structure.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[Ok, so I might have had a light W with this little speech experiment. Yes, Olmo type recurrent attention hybrids are pretty good.]]></summary></entry><entry><title type="html">OLMo Hybrid for Autoregressive Music Token Modeling: A 300M Single-A100 Baseline</title><link href="https://obiohagwu.github.io/obiohagwu/2026/03/06/OLMo-Hybrid-for-Autoregressive-Music-Token-Modeling-A-300M-Single-A100-Baseline.html" rel="alternate" type="text/html" title="OLMo Hybrid for Autoregressive Music Token Modeling: A 300M Single-A100 Baseline" /><published>2026-03-06T00:00:00+00:00</published><updated>2026-03-06T00:00:00+00:00</updated><id>https://obiohagwu.github.io/obiohagwu/2026/03/06/OLMo%20Hybrid%20for%20Autoregressive%20Music%20Token%20Modeling:%20A%20300M%20Single-A100%20Baseline</id><content type="html" xml:base="https://obiohagwu.github.io/obiohagwu/2026/03/06/OLMo-Hybrid-for-Autoregressive-Music-Token-Modeling-A-300M-Single-A100-Baseline.html"><![CDATA[<p>Something really cool happened yesterday.</p>

<p>It’s really convenient that while i was working on alternate architecures for music generation that are subquadratic, olmo hybrid just gets dropped.</p>

<p>The funny weird cool thing is we seem to have had similar (maybe obvious?) intuitions about the required architectural tweaks haha, although they are much more comptetent and serious about their research and implementation.</p>

<p>But yeah, their olmo7bhybrid model is not necessarily for music, but the subquadratic scaling that the architecure the model uses is really quite promising as a decoder for my DAC tokens.</p>

<p>I’m really excited to break ground on this.</p>

<p>Here is a more thorough view of my intentions:</p>

<p>Long-context music generation lives in an awkward middle ground for sequence models.</p>

<p>On one end, full attention is dependable and expressive, but expensive. On the other, purely recurrent models are efficient, but they can blur local detail or struggle to preserve the exact interactions that matter for music. For autoregressive music token modeling, that tradeoff matters a lot: music depends on both fine local structure and longer-range organization.</p>

<p>This post attempts to describes a new baseline I added to my existing music language modeling codebase: an <strong>OLMo-hybrid-style decoder for DAC-token music generation</strong>. The model will use a repeating <strong>3:1 layer schedule</strong>, three DeltaNet-style recurrent blocks followed by one full attention block—and combines that with <strong>RoPE</strong>, <strong>per-head QK RMS normalization</strong>, <strong>SwiGLU feed-forwards</strong>, <strong>grouped-query attention</strong>, and <strong>PyTorch scaled dot-product attention</strong> so it can take advantage of FlashAttention on A100 hardware.</p>

<p>The instantiated model lands at <strong>294.3M parameters</strong> and is designed as a <strong>single-A100 baseline</strong> for unconditional music modeling on <strong>44.1 kHz DAC tokens</strong>.</p>

<p>This is not a results post, lol. I’m posting this moreso as a methdology: the goal is to define the architecture, training configuration, and evaluation plan for a new baseline in the repository.</p>

<h2 id="why-this-architecture-may-be-really-useful-for-music">Why this architecture may be really useful for music</h2>

<p>Autoregressive music token modeling needs to preserve local acoustic structure while also carrying information across long spans.</p>

<p>With DAC tokens, a 24-second clip turns into roughly <strong>2,000 autoregressive steps</strong> once you apply a MusicGen-style delay pattern over <strong>9 RVQ codebooks</strong>. That is long enough for quadratic attention to become costly, but not so long that abandoning exact attention entirely feels justified (PAIN!).</p>

<p>That makes hybrid architectures a lot more appealing from an efficiency point of view as we’ve already covered in prior experiments.</p>

<p>Recent OLMo Hybrid models suggest a useful compromise: let recurrent sequence layers do most of the work, but insert periodic full-attention layers so the model can refresh exact token-to-token interactions. For music, that seems especially natural. Musical sequences contain local events—chord color, rhythmic edges, timbral transitions—but also slower-moving structure such as motif reuse, phrasing, and section-level form.</p>

<p>The goal here is simple: define a practical <strong>300M-class hybrid baseline</strong> that can be trained on a single A100 and compared against the transformer, Mamba, hybrid, and memory-caching baselines already present in the codebase.</p>

<h2 id="tokenization">Tokenization</h2>

<p>Audio is represented using <strong>Descript Audio Codec (DAC)</strong> at <strong>44.1 kHz</strong>.</p>

<p>The codec emits <strong>9 RVQ codebooks</strong> at roughly <strong>86 frames per second</strong>, with a <strong>codebook size of 1024</strong>. As in MusicGen, the model uses a <strong>delay pattern</strong> that offsets codebook (k) by (k) steps, allowing all codebooks to be predicted in parallel within each autoregressive step.</p>

<p>The final vocabulary size is <strong>1027</strong>, after adding <strong>pad</strong>, <strong>BOS</strong>, and <strong>EOS</strong> tokens. During training, sequences are truncated to a maximum length of <strong>2048 timesteps</strong>.</p>

<h2 id="the-model">The model</h2>

<p>The architecture added to the repository is called <code class="language-plaintext highlighter-rouge">olmo_hybrid</code>.</p>

<p>At a high level, it is an OLMo-hybrid-style decoder with the following design:</p>

<ul>
  <li>a repeating <strong>3:1 block schedule</strong>, with three recurrent DeltaNet-style blocks followed by one full attention block</li>
  <li><strong>RoPE</strong> in attention layers</li>
  <li><strong>per-head RMS normalization</strong> on queries and keys before attention</li>
  <li><strong>SwiGLU</strong> feed-forward layers</li>
  <li><strong>grouped-query attention (GQA)</strong> with separate query-head and KV-head counts</li>
  <li><strong>PyTorch scaled dot-product attention</strong>, enabling the FlashAttention fast path on A100s when masks are full-length</li>
  <li><strong>no learned positional embeddings</strong> on this path</li>
</ul>

<p>Each DeltaNet-style recurrent block uses a standard pre-norm residual structure:</p>

<ol>
  <li>RMS normalization on the input</li>
  <li>a recurrent mixer with learned forget and update gates</li>
  <li>a residual connection</li>
  <li>a second RMS normalization</li>
  <li>a SwiGLU feed-forward layer</li>
  <li>a second residual connection</li>
</ol>

<p>The attention blocks mirror that same structure, replacing the recurrent mixer with RoPE-based attention.</p>

<h2 id="current-implementation-status">Current implementation status</h2>

<p>One important detail: the attention path is aligned with the public OLMo recipe in terms of architecture and kernel choice, but the recurrent path is <strong>not</strong> a verbatim import of fused OLMo-core kernels, just for my usecase at this point at least.</p>

<p>Instead, it uses a <strong>native PyTorch DeltaNet-style approximation</strong> implemented inside the local codebase.</p>

<p>That is on purpose. The immediate goal is to test the architecture at music scale first, before investing in lower-level kernel work.</p>

<h2 id="the-300m-baseline-configuration">The 300M baseline configuration</h2>

<p>The repository should now have a preset called <code class="language-plaintext highlighter-rouge">music_olmo_hybrid_300m_a100</code>.</p>

<p>Here is the model configuration:</p>

<ul>
  <li><strong>Model dimension:</strong> 1024</li>
  <li><strong>Layers:</strong> 22</li>
  <li><strong>Attention heads:</strong> 16</li>
  <li><strong>KV heads:</strong> 4</li>
  <li><strong>Feed-forward width:</strong> 2816</li>
  <li><strong>Dropout:</strong> 0.1</li>
  <li><strong>Max sequence length:</strong> 2048</li>
  <li><strong>Attention period:</strong> 4</li>
  <li><strong>Total parameters:</strong> 294,263,072</li>
</ul>

<p>With an attention period of 4, the 22-layer model contains <strong>17 recurrent blocks</strong> and <strong>5 full-attention blocks</strong>. The grouped-query setup reduces KV projection cost while preserving full query resolution.</p>

<h2 id="training-plan">Training plan</h2>

<p>The default preset is designed for <strong>mixed-precision training on a single A100 GPU</strong>.</p>

<p>Initial training configuration:</p>

<ul>
  <li><strong>Dataset:</strong> FMA-Large DAC tokens</li>
  <li><strong>Optimizer:</strong> AdamW</li>
  <li><strong>Learning rate:</strong> (2 \times 10^{-4})</li>
  <li><strong>Warmup:</strong> 2000 steps</li>
  <li><strong>Max steps:</strong> 200,000</li>
  <li><strong>Per-device batch size:</strong> 4</li>
  <li><strong>Gradient accumulation:</strong> 8</li>
  <li><strong>Effective batch size:</strong> 32 sequences</li>
  <li><strong>Precision:</strong> bfloat16 autocast on CUDA</li>
  <li><strong>Evaluation interval:</strong> 2000 steps</li>
  <li><strong>Checkpoint interval:</strong> 5000 steps</li>
</ul>

<p>For a <strong>40GB A100</strong>, the repository should also now expose <code class="language-plaintext highlighter-rouge">--grad_accum_steps</code>, which makes it easier to shift toward smaller microbatches without changing the effective batch size too aggressively.</p>

<h3 id="why-this-is-a-useful-baseline-and-the-actual-crux">Why this is a useful baseline and the ACTUAL crux</h3>

<p>This model is meant to answer a fairly narrow but important question:</p>

<blockquote>
  <p>Can an OLMo-hybrid-style decoder preserve more musical structure than the current transformer and Mamba-family baselines at roughly the same parameter scale?</p>
</blockquote>

<p>I think this is a worthwhile music baseline for three reasons.</p>

<p>First, it avoids a false choice between exact attention and recurrent state. Music benefits from both.</p>

<p>Second, it is practical. A <strong>294M model that trains on one A100</strong> is much easier to iterate on than a multi-billion-parameter reproduction of the full public OLMo stack.</p>

<p>Third, it is easy to ablate. The <strong>3:1 schedule</strong>, <strong>RoPE setup</strong>, <strong>GQA configuration</strong>, and <strong>feed-forward width</strong> can all be varied cleanly in follow-up experiments.</p>

<h3 id="evaluation-plan">Evaluation plan</h3>

<p>The evaluation setup right now below.</p>

<p>Primary metrics include:</p>

<ul>
  <li>validation loss and perplexity</li>
  <li>codebook-wise token accuracy</li>
  <li>long-range coherence across sliding windows</li>
  <li>Fréchet Audio Distance on decoded generations… maybe?</li>
  <li>listening-based inspection of rhythmic stability, harmonic continuity, and phrase development</li>
</ul>

<p>The comparison set should include:</p>

<ul>
  <li>the existing transformer baseline</li>
  <li>pure Mamba baselines</li>
  <li>earlier hybrid models already in the repo</li>
  <li>memory-caching variants where relevant</li>
</ul>

<h3 id="limitations">Limitations</h3>

<p>This is more so an implementation and experiment-design note, not a completed empirical claim. No training results are reported here.</p>

<p>There are also two technical limitations worth being explicit about.</p>

<p>First, the recurrent mixer is a <strong>native PyTorch approximation</strong>, not the fused <strong>OLMo-core Gated DeltaNet</strong> implementation (I will definitely use this in subsequent runs).</p>

<p>Second, FlashAttention acceleration only applies to the <strong>attention blocks</strong>. The recurrent blocks remain sequential by design.</p>

<p>If this baseline performs well, the natural next step is to move the recurrent side closer to the public runtime stack.</p>

<h3 id="thoughts">Thoughts?</h3>

<p>The repository now contains an OLMo-hybrid-style music model that is large enough to be meaningful, but still small enough to train on a single A100. That makes it a strong next baseline.</p>

<p>The contribution here is seemingly straightforward: we have a concrete architecture, decent preset, and also a decent clear evaluation plan. Whether it actually beats the existing baselines is now an experimental question rather than an implementation gap.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[Something really cool happened yesterday.]]></summary></entry><entry><title type="html">Janky memory cache over mamba hybrid for music?</title><link href="https://obiohagwu.github.io/obiohagwu/2026/03/04/Janky-memory-cache-over-mamba-hybrid-for-music.html" rel="alternate" type="text/html" title="Janky memory cache over mamba hybrid for music?" /><published>2026-03-04T00:00:00+00:00</published><updated>2026-03-04T00:00:00+00:00</updated><id>https://obiohagwu.github.io/obiohagwu/2026/03/04/Janky%20memory%20cache%20over%20mamba%20hybrid%20for%20music</id><content type="html" xml:base="https://obiohagwu.github.io/obiohagwu/2026/03/04/Janky-memory-cache-over-mamba-hybrid-for-music.html"><![CDATA[<script>
MathJax = {
  tex: {
    inlineMath: [['$', '$'], ['\\(', '\\)']],
    displayMath: [['$$', '$$'], ['\\[', '\\]']]
  }
};
</script>

<script src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js" async=""></script>

<style>
html, body {
  overflow-x: hidden;
  max-width: 100%;
}
table {
  display: block;
  overflow-x: auto;
  white-space: nowrap;
  max-width: 100%;
  font-size: 0.9em;
}
.post-content {
  overflow-x: hidden;
}
</style>

<p>So I’ve been attending house type like dance events more frequently. Not to make this too romantic, but i really enjoy music, and dancing, and like, house really is the perfect manifestation of that haha.</p>

<p>Anyway, I decided to try a janky re-implementation of this new memory-caching fix over rnn or just linear attention type operations.</p>

<p><strong>Now to be more specific:</strong>
State space models in general are a cool more efficent sequence modeler. They are pretty good at global coherence and understanding and unlike self-attention with its quadratic burden, State space methods like mamba are subquadratic. Linear by input sequence etc. 
Mamba processes sequences by maintaining a sort compressed hidden, or for analogy sake, latent state. It holds a fixed-size summary of everything it’s seen, which im sure you might already be seeing why this is an efficency boost, but might degrade over large enough corpi, or atleast at some point be unable to easily decode local detail. 
More specifically, that state gets updated with every new token, and old information gets gradually overwritten. For a 30-second music clip tokenized at 86 frames/second across 9 codebooks, that’s over 23,000 tokens. A lot of information to compress into one fixed-size state vector.</p>

<p>So, imagine a scenario where the model could sort of <em>remember</em> what happened earlier? Not in the vague, compressed-state-vector sense. Actually remember it. Store specific checkpoints from earlier in the sequence and pull them back when they’re relevant?</p>

<p>Yeah, see below…</p>

<hr />

<h2 id="the-memory-caching-paper">The Memory Caching Paper</h2>

<p>Some days ago actually, Behrouz et al. published “Memory Caching: RNNs with Growing Memory” (arXiv:2602.24281). The idea is pretty sweet:</p>

<ol>
  <li><strong>Segment the sequence.</strong> Divide the input into fixed-size segments of S tokens.</li>
  <li><strong>Cache boundary states.</strong> At the end of each segment, save the model’s hidden state.</li>
  <li><strong>Retrieve when needed.</strong> For each new token, let the model query the cache – “which of my past states is most relevant right now?” – and inject that information via a learned gating mechanism.</li>
</ol>

<p>The retrieval uses what they call <strong>Gated Residual Memory (GRM)</strong>. For each position, the model computes a softmax distribution over all cached segments plus the current segment. The gates decide: how much should I trust my current computation vs. what I cached from 5 segments ago?</p>

<p>It’s a way of giving RNNs an explicit, growing memory bank without abandoning the efficiency of recurrence. Think of it as Mamba’s usual compressed state, supplemented by like a series of snapshots from earlier in the sequence.</p>

<p>And it seems they only tested it on linear attention and Titans and havent tried State space models. I mean, it is a pretty recent paper though lol.</p>

<p>I did want to know if it could work for music generation though. A bit of a redundant side project but i was just bored and have extra runpod creds.</p>

<hr />

<h2 id="mambas-hidden-state-seems-difficult-to-penetrate">Mamba’s Hidden state seems difficult to penetrate</h2>

<p>Here’s the thing about Mamba that makes this non-trivial.</p>

<p>The Memory Caching paper assumes you can grab the model’s hidden state at any point during the forward pass. For linear attention that’s fairly straightforward. The recurrent state is an explicit matrix $S_t \in \mathbb{R}^{d_k \times d_v}$ that you compute and can cache directly.</p>

<p>Mamba doesn’t work that way. Its fused CUDA kernel runs the entire selective scan inside GPU registers. The hidden state (a (d_inner, d_state) tensor per layer) is never materialized in accessible memory during training. You can’t just reach in and grab it.</p>

<p>So I decided to not rewrite the kernel and just cache a proxy of the hidden state (i know i know , janky, sloppy… haha but this really is purely experimental)</p>

<hr />

<h3 id="the-implementations-and-the-primary">The implementations and the primary</h3>

<p>I built two models to create a controlled comparison:</p>

<h3 id="mc-linear-attention-the-faithful-version">MC-Linear-Attention (the “faithful” version)</h3>

<p>Linear attention <em>does</em> expose its state matrices. So I built a version of MC that caches the actual recurrent state S_t at segment boundaries to match exactly what the paper as best i could. When the model wants to retrieve cached context, it does <code class="language-plaintext highlighter-rouge">phi(q_t) @ S_i</code>, querying the cached state matrix with the current token’s feature-mapped key. This is as close to the paper’s formulation as you can get.</p>

<h3 id="mc-mamba-the-proxy-version">MC-Mamba (the “proxy” version)</h3>

<p>For Mamba, I cache the <strong>output activations</strong> at segment boundaries instead of hidden states. After each Mamba block processes the full sequence (preserving the fast CUDA kernel), I extract the output vector at positions S-1, 2S-1, 3S-1, etc. These d_model-dimensional vectors serve as compressed summaries of what the model computed at each segment boundary.</p>

<p>The GRM gating mechanism is identical in both: softmax over segment means from the input space, with a single learnable W_u projection per layer. The only difference is <em>what</em> gets cached and <em>how</em> it gets retrieved.</p>

<p>If MC works on linear attention but fails on Mamba, the proxy approach is the bottleneck, not the mechanism itself. That’s the controlled comparison.</p>

<hr />

<h3 id="train-setup">Train Setup</h3>

<p>Both models were trained on music tokenized with <strong>DAC</strong> (Descript Audio Codec) at 44.1kHz – 9 codebooks at ~86 tokens/second. The dataset is <strong>FMA-Large</strong>, a collection of 106,574 freely licensed tracks spanning a wide range of genres. Training ran on A100 80GB GPUs rented through RunPod.</p>

<p>The task is unconditional music generation (again, seriously gpu constrained even though im doing this): predict the next audio token across all 9 codebooks, autoregressively.</p>

<hr />

<h3 id="experiment-1-the-48m-model-proof-of-life">Experiment 1: The 48M Model (Proof of Life)</h3>

<p>Before going big, I ran a small-scale test to see if the MC mechanism could learn at all.</p>

<table>
  <thead>
    <tr>
      <th> </th>
      <th>MC-LA 48M</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Architecture</td>
      <td>MC-Linear-Attention</td>
    </tr>
    <tr>
      <td>Parameters</td>
      <td>48.6M (2.5% MC overhead)</td>
    </tr>
    <tr>
      <td>Dataset</td>
      <td>25k tracks (FMA subset)</td>
    </tr>
    <tr>
      <td>Segment size</td>
      <td>256 tokens</td>
    </tr>
    <tr>
      <td>Peak LR</td>
      <td>1e-4</td>
    </tr>
    <tr>
      <td>Batch size</td>
      <td>32</td>
    </tr>
    <tr>
      <td>Steps</td>
      <td>~24,700</td>
    </tr>
  </tbody>
</table>

<p>This was an early version of the code. Not optimized, some rough edges. I’m telling you this upfront because what happened next is quite important and matters more than the absolute numbers.</p>

<h3 id="it-learned-haha-wtf">It learned, haha wtf.</h3>

<p>The val loss dropped steadily across the entire run:</p>

<table>
  <thead>
    <tr>
      <th>Step</th>
      <th>Val Loss</th>
      <th>Perplexity</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>2,000</td>
      <td>6.854</td>
      <td>947.5</td>
    </tr>
    <tr>
      <td>4,000</td>
      <td>6.746</td>
      <td>850.6</td>
    </tr>
    <tr>
      <td>8,000</td>
      <td>6.363</td>
      <td>579.8</td>
    </tr>
    <tr>
      <td>12,000</td>
      <td>6.139</td>
      <td>463.6</td>
    </tr>
    <tr>
      <td>16,000</td>
      <td>6.046</td>
      <td>422.3</td>
    </tr>
    <tr>
      <td>20,000</td>
      <td>5.997</td>
      <td>402.1</td>
    </tr>
    <tr>
      <td>24,000</td>
      <td>5.969</td>
      <td>391.3</td>
    </tr>
  </tbody>
</table>

<p>No instability. No divergence. Just a clean, steady decline over 24,000 steps.</p>

<p>But the really interesting signal was the <strong>GRM entropy</strong>.</p>

<p>GRM entropy measures how spread out the attention weights are across cached segments. High entropy means the model is attending uniformly and looking at everything equally, which is basically the same as looking at nothing in particular. Low entropy means the model is being <em>selective</em> and it has learned to focus on specific cached segments when they’re relevant.</p>

<p>The 48M model started with GRM entropy around 1.26 (roughly uniform over cached segments). By step 24,000 it had dropped to <strong>0.22</strong>.</p>

<p>The gating mechanism was learning. The model wasn’t just blindly averaging cached states. It seemed to have been developing preferences for specifically <em>which</em> cached segment to retrieve at each position. That’s seems to align with the behavior the MC paper predicts: learned, position-dependent retrieval over a growing memory bank.</p>

<p>Was the audio actually good though? To be honest, with 48M parameters on 25k tracks, “good” is a stretch. But there actually surprisngly was a <em>semblance of something</em>. Like rhythmic patterns that held, frequency content that wasn’t just noise. Enough to warrant scaling up.</p>

<hr />

<h3 id="experiment-2-scaling-to-95m-where-things-went-wrong">Experiment 2: Scaling to 95M (Where Things Went Wrong)</h3>

<p>Encouraged by the 48M results, I scaled up. Bigger model, bigger data, improved codebase.</p>

<table>
  <thead>
    <tr>
      <th> </th>
      <th>MC-LA 95M</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Architecture</td>
      <td>MC-Linear-Attention</td>
    </tr>
    <tr>
      <td>Parameters</td>
      <td>95.1M (3.1% MC overhead)</td>
    </tr>
    <tr>
      <td>Dataset</td>
      <td>106k tracks (full FMA-Large)</td>
    </tr>
    <tr>
      <td>Segment size</td>
      <td>128 tokens</td>
    </tr>
    <tr>
      <td>Peak LR</td>
      <td>3e-4</td>
    </tr>
    <tr>
      <td>Batch size</td>
      <td>64-72</td>
    </tr>
    <tr>
      <td>MC start layer</td>
      <td>10 (only top half gets MC)</td>
    </tr>
  </tbody>
</table>

<p>I made several changes from the 48M run: halved the segment size (128 vs 256), tripled the learning rate (3e-4 vs 1e-4), only applied MC to the top 10 layers (leaving layers 0-9 as plain linear attention), and switched to the full 106k-track dataset.</p>

<p>I ran it twice. Both times, the same thing happened.</p>

<h3 id="gradient-explosions">Gradient explosions</h3>

<p>The gradient norms tell the story:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>step   2900 | grad_norm 18.98
step   3100 | grad_norm  5.55
step   3500 | grad_norm  3.95
step   3800 | grad_norm  2.82
step   3900 | grad_norm  3.83
step   4500 | grad_norm  5.10
step   4600 | grad_norm  5.41
step   5000 | grad_norm 13.76
step   5100 | grad_norm 15.64
step   5300 | grad_norm 14.42
step   5500 | grad_norm  8.41
</code></pre></div></div>

<p>Between the spikes, gradient norms sat around 0.15-0.20 which was pretty ok. But every few hundred steps, something in the MC pathway would fire and blow the norm up by two orders of magnitude. These seem mechanistic and not necessarily random, again it could just be a weird artifact of early training dynamics, idk.</p>

<h3 id="the-gate-never-opened">The gate never opened</h3>

<p>The model logs a <code class="language-plaintext highlighter-rouge">gate</code> value representing the learned bias on the GRM gate. It initialized at -2.0, which corresponds to sigmoid(-2.0) = 0.12, which is a very conservative starting point that barely lets cached context through.</p>

<p>Over 5,500 steps, the gate moved from -2.00 to -1.91.</p>

<p>That’s nothing. The gate was supposed to learn when to open and trust the cached memory. Instead, it stayed almost exactly where it started. The MC mechanism was architecturally present but functionally dormant.</p>

<h3 id="grm-entropy-stayed-high">GRM entropy stayed high</h3>

<p>Remember how the 48M model’s entropy dropped from 1.26 to 0.22? The 95M model’s GRM entropy sat at <strong>~2.4 the entire run</strong>. That’s near-uniform attention – the model never learned to selectively retrieve from its cache. It was looking at everything equally, which is the same as ignoring the cache entirely.</p>

<h3 id="the-loss-seems-fine-which-made-it-more-annoying">The loss seems fine (which made it more annoying)</h3>

<p>The <em>loss</em> was actually quite reasonable.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>step   2000 | EVAL: val_loss=6.6908
step   4000 | EVAL: val_loss=6.0978
</code></pre></div></div>

<p>The underlying linear attention was doing all the work while the cache sat there with its gate barely open and likely contributing to gradient noise.</p>

<hr />

<h3 id="what-to-do">What to do?</h3>

<p>The 48M model does seem to work though.</p>

<p><strong>What I suspect:</strong></p>

<p>The core issue is a <strong>cold-start problem</strong>. The MC mechanism needs to learn useful retrieval patterns, but it starts nearly disabled (gate at 0.12) while the base model races ahead. By the time the gate bias might have wandered open, the base linear attention has already found a loss landscape that doesn’t need the cache. The MC gradients become noise and they push the gate around randomly without a clear signal, maybe occasionally causing a spike hard enough to disrupt training.</p>

<p>The 48M model avoided this maybe because: (a) the lower learning rate gave the GRM more time to adapt gradually, (b) the smaller model capacity meant the base attention couldn’t “solve” the task on its own as easily, creating demand for the cached context, and (c) the larger segment size (256) meant fewer cache entries, making the softmax distribution easier to sharpen.</p>

<hr />

<h2 id="limitation-">Limitation :)</h2>

<p>To be clear this was more of a learning thourgh experiment thing, and i’d definitely take it further if I had more GPU time.</p>

<p><strong>For 48m:</strong> It’s a smaller model, smaller dataset, lower learning rate, and an older version of the code. The declining GRM entropy is a genuinely positive signal and the model by all metric, learned selective retrieval. But I can’t rule out that the entropy drop is an artifact of the training dynamics rather than evidence that the cache is meaningfully improving generation quality (compute and data to rule out)</p>

<p><strong>I never ran a 48M model <em>without</em> MC for comparison to get a decent baseline:</strong> The 48M result stands alone. I can’t tell you how much of its final loss is attributable to MC vs. how much the base linear attention would have achieved on its own. That’s a gap I need to fill.</p>

<p><strong>The 95M runs were cut short.</strong> 5,500 steps isnt usually enough to declare failure in evry circumstance, as i said earlier. It’s definitely possible the grad spikes would have settled and the gate would have eventually opened. But the trajectory wasn’t promising, and I didn’t want to burn GPU hours watching it not work.</p>

<p><strong>Three variables changed:</strong> between the 48M and 95M runs: model size, dataset size, and learning rate. Any one of these could be the culprit. Proper experimental design would change one at a time. I didn’t, because GPU hours cost money and a bit of impatience.</p>

<hr />

<h2 id="learnings">Learnings?</h2>

<p><strong>1. The mechanism works (when it works)</strong></p>

<p>The 48M experiment is proof that GRM-based memory caching <em>can</em> learn selective retrieval over cached segment states in a music generation context. The entropy dropping from 1.26 to 0.22 is not nothing.  The model seemingly learns to look at specific past segments when generating new tokens. That’s the whole promise of MC.</p>

<p><strong>2. Initialization and learning rate matter more than I expected</strong></p>

<p>The gate bias at -2.0 combined with LR 3e-4 seems to be a bad combination. The MC paper doesn’t discuss initialization sensitivity because they’re working with linear attention models where the state matrices are already part of the standard forward pass. When you’re bolting MC onto an architecture as an <em>addition</em>, the optimization dynamics change. The MC pathway needs to be initialized assertively enough to receive gradient signal, but not so aggressively that it destabilizes the base model.</p>

<p><strong>3. The “proxy vs. faithful” comparison is still pending</strong></p>

<p>I haven’t actually trained MC-Mamba (the output-activation proxy version) yet. The experiments above are all MC-Linear-Attention. The original question I did have was can you cache Mamba’s <em>output activations</em> and get useful retrieval?. I think thats remains a bit open. The 48M result suggests the GRM mechanism itself works for music, so the next step is testing whether the proxy representation (boundary output vectors) is informative enough to retrieve from.</p>

<p><strong>4. Scale-up isn’t free</strong></p>

<p>Also learned the hard way that a mechanism that works at small scale can fail at large scale for reasons that have nothing to do with the mechanism’s fundamental validity. Initialization, learning rate schedules, interaction with other training hyperparameters etc etc can all obviously change the outcome. The 48M model’s success didn’t predict the 95M model’s failure.</p>

<hr />

<h2 id="maybe-next-steps">Maybe next steps</h2>

<p>Some ideas will work on later.</p>

<hr />

<h3 id="outlook">Outlook</h3>

<p>I started this project because I thought recurrent models needed better long-range memory for music generation, and MC seemed like an elegant way to provide it. The results so far are… humbling.</p>

<p>The 48M experiment proved the concept can work, like a pretty good concept level proof. Although I messed up with trying to scale prematurely. The gate staying shut, the entropy staying flat, the grad norms spiking are all symptoms of general issues in my models optimization landscape. So the whitepill there is it might not necessarily be anything wrong with my architecture “design”.</p>

<p>What I have right now is a mechanism that learned selective memory retrieval on a small model, and a larger model where that same mechanism refused to engage because of mostly dumb mistakes on my part.</p>

<hr />

<p><em>All code is available in the <a href="https://github.com/Obiohagwu/mc-la-mamba">mc-mamba repository</a>. Training logs for all runs are included in <code class="language-plaintext highlighter-rouge">pod_runs/</code>.</em></p>

<p><em>If you’ve dealt with similar cold-start problems when adding auxiliary mechanisms to neural networks, I’d genuinely love to hear how you solved it. Sometimes the difference between “doesn’t work” and “works beautifully” is one initialization trick.</em></p>]]></content><author><name></name></author><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Transport Geometry as a Testable Hypothesis?</title><link href="https://obiohagwu.github.io/obiohagwu/2026/02/22/Transport-Geometry-as-a-Testable-Hypothesis.html" rel="alternate" type="text/html" title="Transport Geometry as a Testable Hypothesis?" /><published>2026-02-22T00:00:00+00:00</published><updated>2026-02-22T00:00:00+00:00</updated><id>https://obiohagwu.github.io/obiohagwu/2026/02/22/Transport%20Geometry%20as%20a%20Testable%20Hypothesis</id><content type="html" xml:base="https://obiohagwu.github.io/obiohagwu/2026/02/22/Transport-Geometry-as-a-Testable-Hypothesis.html"><![CDATA[<script>
MathJax = {
  tex: {
    inlineMath: [['$', '$'], ['\\(', '\\)']],
    displayMath: [['$$', '$$'], ['\\[', '\\]']]
  }
};
</script>

<script src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js" async=""></script>

<style>
html, body {
  overflow-x: hidden;
  max-width: 100%;
}
table {
  display: block;
  overflow-x: auto;
  white-space: nowrap;
  max-width: 100%;
  font-size: 0.9em;
}
.post-content {
  overflow-x: hidden;
}
</style>

<p><em>mongoobi, Feb 2026</em></p>

<hr />

<p>Continuation of <a href="/2026/02/09/your-sae-looks-solved-your-model-disagrees.html">Part I</a>.</p>

<p>To summarize it though, in mid-layer Pythia residual streams at fixed sparsity budget $k$, $R^2$ can go up with model scale while patched CE loss gets worse. I called it the proxy gap.</p>

<p>This post is about what happens when you train longer, what that rules out, and where the explanation might live. I found some more seemingly interesting stuff. But unbeknownst to me, naively; this seems to be a fairly active research region. I’ll expand more below/</p>

<hr />

<h2 id="the-sign-mismatch-was-real-but-temporary">The sign mismatch was real but temporary</h2>

<p>The more salient portion of the previous post as the apparent signflip in certain regimes. More specifically, $R^2$ says the 410M SAE is better, CE says the 70M SAE is better. That was on a 10M-token training budget.</p>

<p>I extended to 50M and 100M tokens with seed repeats. Mid-layer, $k=8$:</p>

<table>
  <thead>
    <tr>
      <th style="text-align: right">tokens</th>
      <th style="text-align: right">seeds</th>
      <th style="text-align: right">$\Delta_{CE}$</th>
      <th style="text-align: right">95% CI</th>
      <th style="text-align: right">$\Delta_{R^2}$</th>
      <th style="text-align: right">mismatch rate</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: right">10M</td>
      <td style="text-align: right">4</td>
      <td style="text-align: right">-0.099</td>
      <td style="text-align: right">[-0.117, -0.082]</td>
      <td style="text-align: right">+0.150</td>
      <td style="text-align: right">100%</td>
    </tr>
    <tr>
      <td style="text-align: right">50M</td>
      <td style="text-align: right">1</td>
      <td style="text-align: right">+0.018</td>
      <td style="text-align: right">—</td>
      <td style="text-align: right">+0.108</td>
      <td style="text-align: right">0%</td>
    </tr>
    <tr>
      <td style="text-align: right">100M</td>
      <td style="text-align: right">3</td>
      <td style="text-align: right">+0.009</td>
      <td style="text-align: right">[+0.004, +0.013]</td>
      <td style="text-align: right">+0.096</td>
      <td style="text-align: right">0%</td>
    </tr>
  </tbody>
</table>

<p>Sign mismatch: gone by 50M. The larger model just needed more tokens to converge. Fair criticism of Part I, and I’m reporting it.</p>

<p>Although some anomolies seem to persist. 10x magnitude distortion still? The magnitudes at 100M are still off by roughly 10x. $R^2$ says the 410M SAE is substantially better ($\Delta = 0.096$). CE says they’re basically tied ($\Delta = 0.009$). At $k=16$ the ratio is about 3x. I may be missing something but this seems like a plausible issue. A sort of systematic, scale-dependent distortion in the metric everyone defaults to.</p>

<p>So, <strong>H0 (pure optimization artifact) explains the sign flip but not the magnitude gap.</strong> Which means something else may be going on. We should investigate, haha.</p>

<hr />

<h2 id="three-spaces-and-the-mismatch-between-them">Three spaces and the mismatch between them</h2>

<p>Ok, so here’s my running mental model right now. There are three different spaces SAE reconstruction generally lives in or occupies, and they usually don’t necessarily agree about what “close” means.</p>

<p><strong>Activation space.</strong> This is where SAEs optimize. The error is Euclidean: $|h - \hat{h}|^2$. The metric that normalizes this is $R^2$. This space is shaped by the covariance of activations, usually leading to the situation where high-variance directions dominate.</p>

<p><strong>Probability-sensitive space.</strong> Ok so this is i’d say the real meat, it seems to be what downstream computation actually cares about. The natural local metric here is the pullback of the output Fisher through the downstream Jacobian:</p>

\[G_L = J_L^\top F_{out} J_L\]

<p>where $F_{out} = \text{diag}(p) - pp^\top$ for the output distribution. A perturbation that’s small in activation space can be large in this space if it points along a direction the model is sensitive to.</p>

<p><strong>Attention-plan space.</strong> Although a bit more speculative, and I’ll likely not go too deep on this due to my lack of sufficent background atm, but really interesting recent work frames attention weights as entropy-regularized transport plans (Litman, 2025). If you buy that framing, then patching activations changes the transport plan, and the cost of that change isn’t Euclidean either. This is a second-wave thing. I will definitely dive alot deeper into this very soon.</p>

<p>Ok, so we may say the proxy gap is, in this framing, a gap between space 1 and space 2. $R^2$ measures fidelity in activation space. CE measures fidelity in probability-sensitive space. When the geometry of those two spaces diverges (which it does seem to do more at larger scales, especially in mid-layer low-PR regimes), the metrics disagree.</p>

<p>This isn’t a new idea in the abstract. “Reconstruction isn’t behavior” is something people say and maybe usually know intuitively. What I’m trying to do is make it <em>testable</em> and <em>localized</em>: where exactly does it happen, how bad is it, and can you build a cheap proxy that tracks the right space?</p>

<hr />

<h2 id="swd-a-first-pass-bridge-metric">SWD: a first-pass bridge metric</h2>

<p>Before trying to estimate the full pullback Fisher (expensive, probably unstable at this scale), there’s a generally much simpler object. We can take the CE gradient at the hookpoint:</p>

\[g_L = \nabla_{h_L} \mathcal{L}_{CE}\]

<p>and define sensitivity-weighted distortion:</p>

\[\text{SWD}_L = \mathbb{E}\left[(g_L^\top \delta h_L)^2\right]\]

<p>where $\delta h_L = \hat{h}_L - h_L$ is the reconstruction error.</p>

<p>$R^2$ weights error by the covariance structure of activations. SWD weights error by the sensitivity structure of the loss. If those two weightings diverge, $R^2$ is lying to you, and SWD should catch it.</p>

<table>
  <tbody>
    <tr>
      <td>The concrete test: across a grid of (model, $k$, seed) conditions, does $1/\text{SWD}$ correlate with $CE_{rec}$ better than $R^2$? I’m building a proxy leaderboard for $R^2$, cosine sim, $1/\text{NMSE}$, $1/\text{SWD}$, $1/</td>
      <td>g^\top\delta</td>
      <td>$ — ranked by Pearson and Spearman with bootstrap CIs.</td>
    </tr>
  </tbody>
</table>

<p>This is running on my local mac right now. Actually quite slowly. I’ll append the results with accompanying edits when the run in done in 28hrs.</p>

<hr />

<h2 id="the-hypothesis-stack-right-now">The hypothesis stack right now</h2>

<p>I want to be precise about what claims are on the table and what gates them.</p>

<p><strong>H0 (optimization-only).</strong> The cross-scale CE gap shrinks to zero as SAE training budget increases. <em>Status: partially supported.</em> Sign mismatch gone. Magnitude gap not gone.</p>

<p><strong>H1 (residual intrinsic component).</strong> After convergence, a nonzero magnitude gap remains. <em>Status: supported at 100M tokens, but I’d ideally much prefer higher token budgets and more seeds to increase confidence.</em></p>

<p><strong>H2 (geometry-aware proxies explain residual).</strong> SWD or similar sensitivity-weighted metrics predict CE better than $R^2$. <em>Status: running.</em></p>

<p><strong>H3 (task-relevant dimensional mismatch).</strong> MI-derived task-relevant dimensionality diverges from geometric PR in exactly the regimes where proxy failure is worst. <em>Status: planned, contingent on H2.</em></p>

<p>The rule I set before writing any code: don’t interpret H2/H3 until H0 vs H1 is adjudicated, lol. Phase 2 data adjudicates H0 vs H1 in favor of H1 (magnitude gap persists). So H2 is now live.</p>

<hr />

<h2 id="where-this-is-going">Where this is going</h2>

<p>The decision tree, stated plainly:</p>

<p>If SWD outpredicts $R^2$ for CE implies the sensitivity geometry story has teeth, and I have a cheap diagnostic. Proceed maybe?</p>

<p>If SWD doesn’t beat $R^2$ implies the magnitude gap is real but the explanation isn’t sensitivity geometry. Maybe it’s purely an SST normalization artifact? Maybe the gap is in higher-order terms that a linear sensitivity proxy can’t capture, sigh. Either way, the empirical characterization still stands as a fair contribution really, and I’ll probably branch to the anisotropy/deflated-PR story or to pullback Fisher approximations as a second wave.</p>

<p>Second-wave experiments, if the first wave works:</p>

<ul>
  <li><strong>Pullback Fisher approximation.</strong> Approximate $G_L$ via subsampled Jacobians and diagonal $F_{out}$, compute Fisher-Euclidean deviation, test whether it adds predictive value beyond SWD.</li>
  <li><strong>MI critic transfer.</strong> Run separable vs hybrid MI estimators on $(h_L, \text{logits})$, compare inferred task-relevant $k^*$ profiles with proxy gap magnitude.</li>
</ul>

<p>I’m not pre-committing to these. They’re on the shelf if the simple version works and I want to push the mechanism story deeper.</p>

<hr />

<h2 id="rate-distortion-geometry">Rate-distortion-geometry</h2>

<p>One framing I seem to keep coming back to is conceiving an SAE primarily as a lossy code. The sparsity budget $k$ determines the rate. What changes is which distortion measure you’d want to evaluate against.</p>

<p>It’s so weird that in euclidean distortion space, the SAE looks increasingly good with scale (high $R^2$). In sensitivity-weighted distortion, the picture may be different. The rate-distortion curve <em>depends on the distortion geometry</em>, and if you’re measuring in the wrong geometry, your curve is wrong.</p>

<p>This connects, loosely, to the bounded-observer information frameworks (Finzi et al., 2026, on epiplexity). The common thread: the same representation can contain structure that’s “there” in an information-theoretic sense but not extractable by a given tool class under a certain budgets. The proxy gap is a concrete instance of this: $R^2$ says the structure is there, CE says your SAE didn’t extract it in a way that matters.</p>

<p>Although, to be clear I’m not claiming a formal reduction at all. I’m moreso trying to use it as a lens.</p>

<hr />

<h2 id="concurrent-work-worth-reading-if-youre-really-interested">Concurrent work worth reading if you’re really interested:</h2>

<p><a href="https://arxiv.org/abs/2405.12241">Braun et al. (2024)</a> trained SAEs to minimize KL divergence instead of reconstruction error (“end-to-end SAEs”) and showed a Pareto improvement: more CE explained with fewer features. Their argument is basically “reconstruction objectives learn dataset structure, not computational structure.”</p>

<p>Same core problem, different angle. They built a better training objective. I’m trying to build a better evaluation diagnostic. The e2e paper is evidence for the assumption that a separate group independently builds an entire training pipeline to work around this problem, the problem is real. But most people in the interpretability community are using existing SAEs (SAELens, Anthropic’s published dictionaries), not retraining from scratch. A cheap post-hoc metric that flags “your $R^2$ is misleading here” is useful even in that world.</p>

<hr />

<h2 id="some-more-limitations">Some more limitations?</h2>

<p>Everything from Part I still applies (two model sizes, one decoder class, not ablation-normalized, late-layer confound). New:</p>

<ul>
  <li>Seed counts are low. 3-4 at anchor conditions. Not enough for robust seed-level CIs.</li>
  <li>The 50M row has one seed per $k$. Single data point.</li>
  <li>Exp-B (SWD vs classical proxies) hasn’t finished. I’m reporting setup and rationale, not results.</li>
  <li>I haven’t computed deflated PR yet. The canyon story is suggestive but might not survive deflation.</li>
</ul>

<hr />

<h2 id="repro">Repro</h2>

<ul>
  <li>Phase 2 data: <code class="language-plaintext highlighter-rouge">info-geo/outputs/phase2_repeat_analysis.md</code></li>
  <li>Token sweep: <code class="language-plaintext highlighter-rouge">info-geo/outputs/proxy_gap_lowk_10m_100m_report.md</code></li>
  <li>Exp-B script: <code class="language-plaintext highlighter-rouge">info-geo/run_expb.py</code> (standalone, runs on MPS/CUDA/CPU)</li>
  <li>Paper outline: <code class="language-plaintext highlighter-rouge">info-geo/full_paper_outline.md</code></li>
</ul>

<hr />

<h2 id="references">References</h2>

<ol>
  <li>Litman, E. (2025). <em>Scaled-Dot-Product Attention as One-Sided Entropic Optimal Transport</em>. arXiv:2508.08369.</li>
  <li>Park, K., et al. (2026). <em>The Information Geometry of Softmax: Probing and Steering</em>. arXiv:2602.15293.</li>
  <li>Gulati, P., et al. (2026). <em>Mutual Information and Task-Relevant Latent Dimensionality</em>. arXiv:2602.08105.</li>
  <li>Braun et al. (2024). <em>Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning</em>. arXiv:2405.12241.</li>
  <li>Gao, L., et al. (2024). <em>Scaling and Evaluating Sparse Autoencoders</em>. arXiv:2406.04093.</li>
  <li>Bricken et al. (2023). <em>Towards Monosemanticity</em>. Anthropic.</li>
  <li>Finzi et al. (2026). <em>From Entropy to Epiplexity</em>. arXiv:2601.03220.</li>
</ol>]]></content><author><name></name></author><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Your SAE looks solved. Your model disagrees</title><link href="https://obiohagwu.github.io/obiohagwu/2026/02/09/Your-SAE-looks-solved.-Your-model-disagrees.html" rel="alternate" type="text/html" title="Your SAE looks solved. Your model disagrees" /><published>2026-02-09T00:00:00+00:00</published><updated>2026-02-09T00:00:00+00:00</updated><id>https://obiohagwu.github.io/obiohagwu/2026/02/09/Your%20SAE%20looks%20solved.%20Your%20model%20disagrees</id><content type="html" xml:base="https://obiohagwu.github.io/obiohagwu/2026/02/09/Your-SAE-looks-solved.-Your-model-disagrees.html"><![CDATA[<script>
MathJax = {
  tex: {
    inlineMath: [['$', '$'], ['\\(', '\\)']],
    displayMath: [['$$', '$$'], ['\\[', '\\]']]
  }
};
</script>

<script src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js" async=""></script>

<style>
html, body {
  overflow-x: hidden;
  max-width: 100%;
}
table {
  display: block;
  overflow-x: auto;
  white-space: nowrap;
  max-width: 100%;
  font-size: 0.9em;
}
.post-content {
  overflow-x: hidden;
}
</style>

<hr />

<p>This is a research log with a small argument:</p>

<ol>
  <li>If you’re using SAEs as interpretability tools or safety monitors, <strong>reconstruction fidelity alone is not a reliable acceptance test</strong> — behavioral metrics like patched CE loss or ablation-normalized CE recovery should be primary.</li>
  <li>In one common regime (mid-layer residual stream), \(R^2\) can be <strong>systematically inflated across scale at fixed sparsity</strong>, due to a concrete mechanism (activation variance scaling), while behavioral preservation gets worse.</li>
</ol>

<p>This is not a scaling-law fit and not evidence for a hard interpretability ceiling.</p>

<p><strong>Update note (Feb 21):</strong></p>

<ul>
  <li>The CI-overlap statement is now explicitly scoped to the Fast3 mid-only sweep.</li>
  <li>I added a new 50M-token in-tree anchor run for 70M mid-layer \(k=8\).</li>
</ul>

<hr />

<h2 id="brief-definitions">Brief Definitions</h2>

<p><strong>Sparse autoencoder (SAE).</strong> A learned lossy codec for activations: encode dense activations into a sparse vector of “features,” then decode back to the activation space.</p>

<p><strong>TopK SAE.</strong> An SAE where each token activates exactly <code class="language-plaintext highlighter-rouge">k</code> nonzero features (hard sparsity). In this post, the decoder is linear.</p>

<p><strong>Hookpoint.</strong> Where I patch the activation: <code class="language-plaintext highlighter-rouge">blocks.{L}.hook_resid_post</code> (residual stream post-block).</p>

<p><strong>Reconstruction fidelity (\(R^2\)).</strong> Variance explained in activation space:</p>

\[R^2 := 1 - \frac{\mathrm{SSE}}{\mathrm{SST}}\]

<p>with mean-centered totals:</p>

\[\mathrm{SSE} = \sum \lVert a - \hat{a} \rVert^2, \quad \mathrm{SST} = \sum \lVert a - \mathbb{E}[a] \rVert^2.\]

<p><strong>Patched loss score (\(CE_{rec}\)).</strong> For each eval batch, let \(L_{orig}\) be original next-token cross-entropy and \(L_{recon}\) be CE after replacing the hookpoint activation with the SAE reconstruction. I report:</p>

\[CE_{rec} := 1 - \frac{L_{recon} - L_{orig}}{L_{orig}} = 2 - \frac{L_{recon}}{L_{orig}}.\]

<p>Notes (to be specific):</p>

<ul>
  <li>This is <strong>not</strong> ablation-normalized. It is <strong>not</strong> the same as “CE loss recovered” in Bricken et al. (2023) or SAE Lens’s <code class="language-plaintext highlighter-rouge">ce_loss_score</code>, which normalize against a zero-ablation baseline. Do not compare magnitudes across papers without converting.</li>
  <li>Because this metric normalizes by each model’s own \(L_{orig}\), cross-model comparisons can be affected by baseline-loss differences. I mainly use it here as a within-depth behavior preservation score and for sign patterns across \(k\).</li>
  <li>\(CE_{rec}=1\) means perfect preservation (\(L_{recon}=L_{orig}\)).</li>
  <li>\(CE_{rec}=0\) means loss doubled (\(L_{recon}=2 L_{orig}\)).</li>
  <li>It can be negative.</li>
</ul>

<p><strong>Per-token MSE (<code class="language-plaintext highlighter-rouge">mse_mean</code>).</strong> Mean squared reconstruction error, averaged over model dimensions:</p>

\[\mathrm{MSE} := \mathbb{E}\left[\frac{1}{d_{model}}\lVert a - \hat{a} \rVert^2\right].\]

<p><strong>Cosine similarity (<code class="language-plaintext highlighter-rouge">cosine sim</code>).</strong> Mean tokenwise cosine between original and reconstructed activation vectors:</p>

\[\cos(a,\hat{a}) := \frac{a^\top \hat{a}}{\lVert a\rVert\,\lVert \hat{a}\rVert}, \quad \text{report } \mathbb{E}[\cos(a,\hat{a})].\]

<p><strong>Relative error norm (<code class="language-plaintext highlighter-rouge">relative error norm</code>).</strong> Mean tokenwise relative L2 error:</p>

\[\mathrm{RelErr} := \mathbb{E}\left[\frac{\lVert a - \hat{a}\rVert}{\lVert a\rVert + \varepsilon}\right], \quad \varepsilon=10^{-10}.\]

<p><strong>Alive fraction (<code class="language-plaintext highlighter-rouge">alive %</code>).</strong> Fraction of SAE features that fire at least once on the eval set (for TopK, “fires” means encoder output \(\neq 0\) on some token):</p>

\[\mathrm{alive} := \frac{\{i : \exists\,t \text{ s.t. } z_{t,i}\neq 0\}}{d_{sae}}.\]

<p><strong>Participation ratio (PR).</strong> A geometry diagnostic for effective dimensionality, computed from eigenvalues \(\lambda_i\) of the mean-centered covariance:</p>

\[PR := \frac{\left(\sum_i \lambda_i\right)^2}{\sum_i \lambda_i^2}.\]

<p><strong>Rate proxy (bits/token).</strong> An explicit coding-budget proxy for TopK codes: entropy-coded active indices plus fixed-point values (8 bits/value in this sweep). This is not mutual information; it’s just a concrete budget.</p>

<hr />

<h2 id="experiment-setup-still-quite-crude-atm">Experiment Setup (still quite crude atm)</h2>

<p>I trained a grid of TopK SAEs and evaluated them two ways: (1) patching reconstructions back into the model and measuring CE change, and (2) direct reconstruction metrics in activation space.</p>

<p><strong>Main sweep (Fast2):</strong> two models, three depths, six sparsity budgets = 36 SAEs.</p>

<p>Models:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">pythia-70m</code> (<code class="language-plaintext highlighter-rouge">d_model=512</code>, <code class="language-plaintext highlighter-rouge">n_layers=6</code>)</li>
  <li><code class="language-plaintext highlighter-rouge">pythia-410m</code> (<code class="language-plaintext highlighter-rouge">d_model=1024</code>, <code class="language-plaintext highlighter-rouge">n_layers=24</code>)</li>
</ul>

<p>Hookpoints (matched by rough relative depth; late is confounded). Due to intense compute constraints, I’ve been limited to only 2 models at the moment. Will follow up.</p>

<table>
  <thead>
    <tr>
      <th>depth regime</th>
      <th style="text-align: right">70M layer</th>
      <th style="text-align: right">410M layer</th>
      <th>caveat</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>early</td>
      <td style="text-align: right">L1</td>
      <td style="text-align: right">L4</td>
      <td>roughly pre-“canyon”</td>
    </tr>
    <tr>
      <td>mid</td>
      <td style="text-align: right">L3</td>
      <td style="text-align: right">L12</td>
      <td>inside the “canyon” regime</td>
    </tr>
    <tr>
      <td>late</td>
      <td style="text-align: right">L5</td>
      <td style="text-align: right">L20</td>
      <td>70M late is final block; 410M late is not</td>
    </tr>
  </tbody>
</table>

<p><strong>Supplementary sweep (Fast3):</strong> three models at mid depth only, \(k \in \{8, 16, 32, 64\}\) = 12 SAEs. Adds Pythia-160M (<code class="language-plaintext highlighter-rouge">d_model=768</code>) as a bridge point.</p>

<p>SAE class and budgets (both sweeps):</p>

<ul>
  <li>Decoder class: single-layer TopK SAE with <strong>linear decoder</strong>.</li>
  <li>Expansion: \(d_{sae} = 32 \cdot d_{model}\).</li>
  <li>Training budget: 10,000,000 tokens per SAE (budgeted pilot).</li>
  <li>Dataset: streaming <code class="language-plaintext highlighter-rouge">NeelNanda/pile-small-tokenized-2b</code>.</li>
  <li>Eval: 200 held-out sequences of length 256 (≈51k tokens) from the same stream (separated by skipping ahead in the stream), with 95% CIs from a bootstrap over sequences (5,000 resamples).</li>
</ul>

<hr />

<h2 id="the-number-that-started-this">The Number That Started This</h2>

<p>Mid-layer, \(k=8\) (Fast2):</p>

<table>
  <thead>
    <tr>
      <th>model</th>
      <th style="text-align: right">layer</th>
      <th style="text-align: right">\(R^2\)</th>
      <th style="text-align: right">\(CE_{rec}\) [95% CI]</th>
      <th style="text-align: right">implied \(L_{recon}/L_{orig}\)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>70M</td>
      <td style="text-align: right">3</td>
      <td style="text-align: right">0.807</td>
      <td style="text-align: right">0.340 [0.169, 0.469]</td>
      <td style="text-align: right">1.660</td>
    </tr>
    <tr>
      <td>410M</td>
      <td style="text-align: right">12</td>
      <td style="text-align: right">0.961</td>
      <td style="text-align: right">0.235 [0.050, 0.382]</td>
      <td style="text-align: right">1.765</td>
    </tr>
  </tbody>
</table>

<p>So the larger model looks “nearly solved” by variance explained, but patched loss is still much worse than baseline. A blunt translation of \(CE_{rec}=0.235\) is: “patching reconstruction increases loss by about 76.5%.”</p>

<p>Stats caution: the CIs overlap at \(k=8\), so do not treat this single row as decisive. The stronger evidence is the <strong>consistent sign across the full \(k\) sweep</strong> (in this 10M-token/SAE budgeted regime) below.</p>

<hr />

<h2 id="result-1-the-proxy-gap-is-depth-localized">Result 1: The Proxy Gap Is Depth-Localized</h2>

<p>Across depths, the relationship between reconstruction and patched loss behaves differently.</p>

<p>The most direct evidence is the mid-layer delta table. At every \(k\) in this <strong>10M-token/SAE budgeted sweep</strong> (Fast2), scaling from 70M to 410M <strong>increases</strong> \(R^2\) while <strong>decreasing</strong> \(CE_{rec}\):</p>

<table>
  <thead>
    <tr>
      <th style="text-align: right">\(k\)</th>
      <th style="text-align: right">\(\Delta R^2\) (410M minus 70M)</th>
      <th style="text-align: right">\(\Delta CE_{rec}\) (410M minus 70M)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: right">8</td>
      <td style="text-align: right">+0.154</td>
      <td style="text-align: right">-0.106</td>
    </tr>
    <tr>
      <td style="text-align: right">16</td>
      <td style="text-align: right">+0.112</td>
      <td style="text-align: right">-0.068</td>
    </tr>
    <tr>
      <td style="text-align: right">32</td>
      <td style="text-align: right">+0.084</td>
      <td style="text-align: right">-0.017</td>
    </tr>
    <tr>
      <td style="text-align: right">64</td>
      <td style="text-align: right">+0.057</td>
      <td style="text-align: right">-0.027</td>
    </tr>
    <tr>
      <td style="text-align: right">128</td>
      <td style="text-align: right">+0.036</td>
      <td style="text-align: right">-0.032</td>
    </tr>
    <tr>
      <td style="text-align: right">256</td>
      <td style="text-align: right">+0.017</td>
      <td style="text-align: right">-0.037</td>
    </tr>
  </tbody>
</table>

<p>All six rows show the same sign pattern: \(\Delta R^2 &gt; 0\), \(\Delta CE_{rec} &lt; 0\).</p>

<p>The gap is largest at low \(k\) and shrinks substantially by \(k \ge 32\), where the absolute CE delta drops below 0.04. This is consistent with the first ~8-16 features being where variance-capture and loss-sensitivity diverge most: at low budgets, the SAE prioritizes high-variance directions (inflating \(R^2\)) while missing loss-sensitive structure. At higher budgets, there’s enough capacity to cover both.</p>

<p>For reference, correlations between \(R^2\) and \(CE_{rec}\) from the full Fast2 grid:</p>

<table>
  <thead>
    <tr>
      <th>depth</th>
      <th style="text-align: right">corr pooling \(k\) and scale</th>
      <th style="text-align: right">corr within fixed \(k\) (scale-only)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>early</td>
      <td style="text-align: right">+0.907</td>
      <td style="text-align: right">+0.353</td>
    </tr>
    <tr>
      <td>mid</td>
      <td style="text-align: right">+0.474</td>
      <td style="text-align: right"><strong>-0.943</strong></td>
    </tr>
    <tr>
      <td>late</td>
      <td style="text-align: right">+0.702</td>
      <td style="text-align: right">+0.976 (confounded)</td>
    </tr>
  </tbody>
</table>

<p><strong>Important caveat:</strong> <code class="language-plaintext highlighter-rouge">corr within fixed $$k$$</code> is computed by subtracting the mean within each \(k\) (i.e. “demeaning by \(k\)”) and then correlating the residuals across all points in that depth (\(n=12\) here = 2 models \(\times\) 6 \(k\) values). With only two model sizes, treat it mainly as a <strong>sign diagnostic</strong> for how scaling moves \(R^2\) vs \(CE_{rec}\) at fixed \(k\), not as a stable statistic. The delta table above is the real evidence.</p>

<hr />

<h2 id="why-this-can-happen-no-alien-neuralese-required">Why This Can Happen (no alien neuralese required)</h2>

<p>The mid-layer sign flip (higher \(R^2\) but worse \(CE_{rec}\) at fixed \(k\)) is explainable with two mundane facts, plus one important confound.</p>

<h3 id="1-r2-is-variance-normalized-and-variance-scale-changes-with-model-size">1) \(R^2\) is variance-normalized, and variance scale changes with model size</h3>

<p>Mid-layer, \(k=8\):</p>

<table>
  <thead>
    <tr>
      <th>model</th>
      <th style="text-align: right">mse mean</th>
      <th style="text-align: right">relative error norm</th>
      <th style="text-align: right">\(R^2\)</th>
      <th style="text-align: right">SSE/token</th>
      <th style="text-align: right">SST/token</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>70M</td>
      <td style="text-align: right">0.0854</td>
      <td style="text-align: right">0.509</td>
      <td style="text-align: right">0.807</td>
      <td style="text-align: right">43.7</td>
      <td style="text-align: right">226.7</td>
    </tr>
    <tr>
      <td>410M</td>
      <td style="text-align: right">0.1575</td>
      <td style="text-align: right">0.576</td>
      <td style="text-align: right">0.961</td>
      <td style="text-align: right">161.3</td>
      <td style="text-align: right">4131.1</td>
    </tr>
  </tbody>
</table>

<p>Here SSE/token = mse_mean times d_model and SST/token is backed out from \(R^2\) via \(R^2 = 1 - \mathrm{SSE}/\mathrm{SST}\). The 410M mid layer has about 18x larger mean-centered variance scale (SST/token), so variance explained can look great even when absolute errors are not small.</p>

<p><strong>Prediction:</strong> metrics that do <em>not</em> divide by SST (e.g. cosine similarity, relative error norm, raw MSE) should not show the same “looks solved” inflation. And they don’t. At mid-layer \(k=8\) (Fast3 three-model check):</p>

<table>
  <thead>
    <tr>
      <th>model</th>
      <th style="text-align: right">\(R^2\)</th>
      <th style="text-align: right">cosine sim</th>
      <th style="text-align: right">relative error norm</th>
      <th style="text-align: right">\(CE_{rec}\)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>70M</td>
      <td style="text-align: right">0.810</td>
      <td style="text-align: right">0.856</td>
      <td style="text-align: right">0.506</td>
      <td style="text-align: right">0.372</td>
    </tr>
    <tr>
      <td>160M</td>
      <td style="text-align: right">0.907</td>
      <td style="text-align: right">0.869</td>
      <td style="text-align: right">0.485</td>
      <td style="text-align: right">0.331</td>
    </tr>
    <tr>
      <td>410M</td>
      <td style="text-align: right">0.961</td>
      <td style="text-align: right"><strong>0.809</strong></td>
      <td style="text-align: right"><strong>0.574</strong></td>
      <td style="text-align: right">0.269</td>
    </tr>
  </tbody>
</table>

<p>\(R^2\) improves monotonically with scale. Cosine similarity and relative error norm both show the 410M reconstruction is <strong>worse</strong>, consistent with \(CE_{rec}\). The proxy gap here is specific to variance-normalized reconstruction metrics like \(R^2\).</p>

<h3 id="2-loss-sensitivity-weights-directions-differently-than-covariance-geometry">2) Loss sensitivity weights directions differently than covariance geometry</h3>

<p>Let \(\hat{a} = a + \delta\) be the reconstruction. For downstream loss \(L(a)\), a first-order Taylor approximation gives:</p>

\[L(\hat{a}) - L(a) \approx g^\top \delta, \quad g := \nabla_a L.\]

<p>Reconstruction metrics weight directions by the activation distribution (covariance). Loss change weights directions by sensitivity \(g\) (and, beyond first order, curvature). If sensitivity mass lives in comparatively low-variance directions, you can have high \(R^2\) and still hurt loss.</p>

<p>This is the “proxy gap” mechanism in one sentence: <strong>MSE/variance and loss sensitivity are different measures on activation space.</strong></p>

<h3 id="confound-fixed-sae-training-budget-undertrains-larger-saes">Confound: fixed SAE training budget undertrains larger SAEs</h3>

<p>In these sweeps, every SAE gets the same 10M-token training budget, while \(d_{sae} \propto d_{model}\). Empirically, alive fraction often falls with model size (especially early/mid at low \(k\)) under this fixed budget. That is consistent with larger SAEs being relatively more undertrained, which can degrade behavior preservation and direction-sensitive reconstruction metrics even when \(R^2\) looks strong. (See Limitations and the 100M-token check below.)</p>

<hr />

<h2 id="result-2-three-model-check-fast3-mid-layer-only">Result 2: Three-Model Check (Fast3, Mid-Layer Only)</h2>

<p>A separate mid-only sweep adds Pythia-160M as a third point:</p>

<table>
  <thead>
    <tr>
      <th>model</th>
      <th style="text-align: right">\(d_{model}\)</th>
      <th style="text-align: right">\(k\)</th>
      <th style="text-align: right">\(R^2\)</th>
      <th style="text-align: right">\(CE_{rec}\) [95% CI]</th>
      <th style="text-align: right">cosine</th>
      <th style="text-align: right">alive %</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>70M</td>
      <td style="text-align: right">512</td>
      <td style="text-align: right">8</td>
      <td style="text-align: right">0.810</td>
      <td style="text-align: right">0.372 [0.232, 0.478]</td>
      <td style="text-align: right">0.856</td>
      <td style="text-align: right">15.2%</td>
    </tr>
    <tr>
      <td>160M</td>
      <td style="text-align: right">768</td>
      <td style="text-align: right">8</td>
      <td style="text-align: right">0.907</td>
      <td style="text-align: right">0.331 [0.122, 0.477]</td>
      <td style="text-align: right">0.869</td>
      <td style="text-align: right">12.9%</td>
    </tr>
    <tr>
      <td>410M</td>
      <td style="text-align: right">1024</td>
      <td style="text-align: right">8</td>
      <td style="text-align: right">0.961</td>
      <td style="text-align: right">0.269 [0.094, 0.404]</td>
      <td style="text-align: right">0.809</td>
      <td style="text-align: right">11.9%</td>
    </tr>
    <tr>
      <td>70M</td>
      <td style="text-align: right">512</td>
      <td style="text-align: right">16</td>
      <td style="text-align: right">0.855</td>
      <td style="text-align: right">0.580 [0.439, 0.671]</td>
      <td style="text-align: right">0.893</td>
      <td style="text-align: right">37.1%</td>
    </tr>
    <tr>
      <td>160M</td>
      <td style="text-align: right">768</td>
      <td style="text-align: right">16</td>
      <td style="text-align: right">0.924</td>
      <td style="text-align: right">0.538 [0.369, 0.653]</td>
      <td style="text-align: right">0.908</td>
      <td style="text-align: right">28.5%</td>
    </tr>
    <tr>
      <td>410M</td>
      <td style="text-align: right">1024</td>
      <td style="text-align: right">16</td>
      <td style="text-align: right">0.968</td>
      <td style="text-align: right">0.530 [0.394, 0.634]</td>
      <td style="text-align: right">0.883</td>
      <td style="text-align: right">25.4%</td>
    </tr>
    <tr>
      <td>70M</td>
      <td style="text-align: right">512</td>
      <td style="text-align: right">32</td>
      <td style="text-align: right">0.891</td>
      <td style="text-align: right">0.729 [0.622, 0.796]</td>
      <td style="text-align: right">0.920</td>
      <td style="text-align: right">63.2%</td>
    </tr>
    <tr>
      <td>160M</td>
      <td style="text-align: right">768</td>
      <td style="text-align: right">32</td>
      <td style="text-align: right">0.938</td>
      <td style="text-align: right">0.696 [0.569, 0.780]</td>
      <td style="text-align: right">0.932</td>
      <td style="text-align: right">51.1%</td>
    </tr>
    <tr>
      <td>410M</td>
      <td style="text-align: right">1024</td>
      <td style="text-align: right">32</td>
      <td style="text-align: right">0.974</td>
      <td style="text-align: right">0.723 [0.618, 0.801]</td>
      <td style="text-align: right">0.919</td>
      <td style="text-align: right">51.4%</td>
    </tr>
    <tr>
      <td>70M</td>
      <td style="text-align: right">512</td>
      <td style="text-align: right">64</td>
      <td style="text-align: right">0.922</td>
      <td style="text-align: right">0.836 [0.758, 0.882]</td>
      <td style="text-align: right">0.944</td>
      <td style="text-align: right">83.8%</td>
    </tr>
    <tr>
      <td>160M</td>
      <td style="text-align: right">768</td>
      <td style="text-align: right">64</td>
      <td style="text-align: right">0.951</td>
      <td style="text-align: right">0.809 [0.712, 0.871]</td>
      <td style="text-align: right">0.934</td>
      <td style="text-align: right">71.2%</td>
    </tr>
    <tr>
      <td>410M</td>
      <td style="text-align: right">1024</td>
      <td style="text-align: right">64</td>
      <td style="text-align: right">0.979</td>
      <td style="text-align: right">0.814 [0.730, 0.875]</td>
      <td style="text-align: right">0.901</td>
      <td style="text-align: right">65.0%</td>
    </tr>
  </tbody>
</table>

<p>Three observations:</p>

<ol>
  <li>
    <p><strong>At \(k=8\) and \(k=16\), the pattern holds across three models:</strong> \(R^2\) increases monotonically with scale while \(CE_{rec}\) decreases monotonically.</p>
  </li>
  <li>
    <p><strong>At \(k \ge 32\), monotonicity breaks for \(CE_{rec}\).</strong> At \(k=32\): 410M (\(CE_{rec}=0.723\)) recovers more behavior than 160M (0.696). At \(k=64\): 410M (0.814) slightly outperforms 160M (0.809). The proxy gap is a <strong>low-\(k\) phenomenon</strong>.</p>
  </li>
  <li>
    <p><strong>In Fast3 (mid-only), all pairwise CIs overlap at every \(k\).</strong> At \(k=8\) (the largest gap): 70M CI = [0.232, 0.478], 410M CI = [0.094, 0.404]. Per-batch variability is large — CE standard deviations range from 0.36 to 0.93 across conditions. Cohen’s \(d\) for the 70M-vs-410M difference at \(k=8\) is approximately 0.14. The consistent sign across \(k\) values is more informative than any individual comparison.</p>
  </li>
</ol>

<p>For clarity: this does <strong>not</strong> hold for Fast2 late depth, where 70M-vs-410M CE CIs are non-overlapping at \(k=32,64,128,256\).</p>

<hr />

<h2 id="result-3-there-is-a-measurable-k-tax-earlymid">Result 3: There Is a Measurable \(k^*\) Tax (Early/Mid)</h2>

<p>To make scaling concrete, define \(k^*(t)\) as the minimum \(k\) needed to reach \(CE_{rec} \ge t\) (linear interpolation on the Fast2 \(k\) grid; single-seed, no uncertainty estimate on the interpolation itself).</p>

<table>
  <thead>
    <tr>
      <th>depth</th>
      <th style="text-align: right">target \(t\)</th>
      <th style="text-align: right">\(k^*\)(70M)</th>
      <th style="text-align: right">\(k^*\)(410M)</th>
      <th style="text-align: right">ratio</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>early</td>
      <td style="text-align: right">0.85</td>
      <td style="text-align: right">40.6</td>
      <td style="text-align: right">55.0</td>
      <td style="text-align: right">1.36</td>
    </tr>
    <tr>
      <td>early</td>
      <td style="text-align: right">0.90</td>
      <td style="text-align: right">62.0</td>
      <td style="text-align: right">106.9</td>
      <td style="text-align: right">1.72</td>
    </tr>
    <tr>
      <td>mid</td>
      <td style="text-align: right">0.85</td>
      <td style="text-align: right">71.2</td>
      <td style="text-align: right">98.0</td>
      <td style="text-align: right">1.38</td>
    </tr>
    <tr>
      <td>mid</td>
      <td style="text-align: right">0.90</td>
      <td style="text-align: right">116.5</td>
      <td style="text-align: right">187.6</td>
      <td style="text-align: right">1.61</td>
    </tr>
  </tbody>
</table>

<p>This is not a blow-up (at these scales), but it is also not zero. Under a fixed decoder class and a fixed tool-training budget, larger scale can require more active features to preserve loss. However, see the alive-fraction caveat in Limitations — this may partly reflect undertraining rather than intrinsic difficulty.</p>

<hr />

<h2 id="geometric-context-a-raw-pr-canyon">Geometric Context: A Raw PR “Canyon”</h2>

<p>Recall: this is diagnostic, not the primary claim.</p>

<p>I computed raw PR of <code class="language-plaintext highlighter-rouge">hook_resid_post</code> across layers.</p>

<p>Selected points:</p>

<table>
  <thead>
    <tr>
      <th>model</th>
      <th style="text-align: right">layer</th>
      <th style="text-align: right">raw PR</th>
      <th style="text-align: right">PR/\(d_{model}\)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>70M</td>
      <td style="text-align: right">1</td>
      <td style="text-align: right">51.914</td>
      <td style="text-align: right">0.101</td>
    </tr>
    <tr>
      <td>70M</td>
      <td style="text-align: right">3</td>
      <td style="text-align: right">3.961</td>
      <td style="text-align: right">0.008</td>
    </tr>
    <tr>
      <td>410M</td>
      <td style="text-align: right">4</td>
      <td style="text-align: right">111.967</td>
      <td style="text-align: right">0.109</td>
    </tr>
    <tr>
      <td>410M</td>
      <td style="text-align: right">5</td>
      <td style="text-align: right">2.045</td>
      <td style="text-align: right">0.002</td>
    </tr>
    <tr>
      <td>410M</td>
      <td style="text-align: right">12</td>
      <td style="text-align: right">1.177</td>
      <td style="text-align: right">0.001</td>
    </tr>
  </tbody>
</table>

<p>So both models show a sharp drop into a long low-PR band, with 410M exhibiting an especially extreme canyon (\(PR \approx 1\) for many layers).</p>

<p>Two important caveats:</p>

<ul>
  <li>Raw PR can be dominated by a leading direction. Standard practice is to “deflate” (remove) the top eigenvector and recompute PR. When you do that, the absolute PR values change a lot, but the “expand then compress” depth profile often remains. That’s a known phenomenon, and I’m not claiming novelty on it.</li>
  <li>In this post, PR is a <strong>warning label</strong>: “this is where variance-normalized reconstruction proxies are likely to lie.” The mid-layer proxy gap above happens inside this low-PR regime.</li>
</ul>

<p>I haven’t included deflated PR plots yet because my current RunPod environment was unstable (numpy/datasets stack broke). I’ll add deflated PR once it’s cleanly reproducible.</p>

<hr />

<h2 id="information-theoretic-framing">Information-Theoretic Framing</h2>

<p>If you squint, an SAE is a lossy code. Then “interpretability under budgets” starts looking like <strong>rate-distortion</strong>:</p>

<ul>
  <li>Rate: bits/token of your sparse code (I log a crude proxy).</li>
  <li>Distortion: either reconstruction distortion (\(1-R^2\)) or behavior distortion (\(1-CE_{rec}\)).</li>
</ul>

<p>The practical point is not that “mutual information is falling” (I did not measure MI). The point is that <strong>you can put a real budget axis on the x-axis</strong> instead of just reporting \(k\) or \(R^2\).</p>

<p>This also connects (loosely) to recent “bounded observer” information frameworks, such as Finzi et al. (2026) on epiplexity. The common theme is: the same object can contain structure that is “there,” but not extractable by a weaker observer/tool class under a budget. In this sweep, a fixed SAE class looks better by \(R^2\) but worse by patched loss in mid layers as scale increases, which is at least qualitatively consistent with “extractable structure for this tool class” degrading even when “variance structure” looks easy.</p>

<p>I am not claiming a formal reduction from SAE failure modes to epiplexity or time-bounded entropy. I’m using it as a framing: interpretability tools are observers with constraints, and proxy metrics can hide when you’re falling out of the extractable regime.</p>

<hr />

<h2 id="relevance">Relevance?</h2>

<p>Given that PR/anisotropy phenomena are known, any novel claim here is fairly narrow, but still we see:</p>

<ul>
  <li>A depth-localized, reproducible proxy gap where \(R^2\) can improve with scale at fixed \(k\) while patched loss gets worse — <strong>strongest at low \(k\), closing by \(k \ge 32\)</strong>.</li>
  <li>A concrete mechanism (SST inflation in anisotropic layers) that <strong>predicts which reconstruction metrics are affected</strong>: \(R^2\) yes, cosine/relative error norm no — but the deeper point is that reconstruction metrics in general can diverge from behavioral ones.</li>
  <li>An operational “interpretability budget” object (\(k^*\) for loss targets, plus rate proxies) that moves with depth and scale.</li>
  <li>A concrete diagnostic (raw PR canyon) that flags where reconstruction-only evaluation is especially untrustworthy.</li>
</ul>

<p>If you already believe “reconstruction isn’t behavior,” you may find this useful as an attempt to make that belief operational, with a knob you can sweep and a failure mode you can reproduce.</p>

<p><strong>Practical methodology tweak:</strong> if you’re evaluating SAE quality, behavioral metrics like patched CE loss or ablation-normalized CE recovery (e.g. SAE Lens’s <code class="language-plaintext highlighter-rouge">ce_loss_score</code>) should be the acceptance test — not reconstruction fidelity alone. Among reconstruction metrics, cosine similarity or relative error norm are more robust than \(R^2\) in mid-layer residual streams, but they still measure reconstruction, not behavior.</p>

<hr />

<h2 id="limitations-things-a-reviewer-should-hit-me-for">Limitations (Things A Reviewer Should Hit Me For)</h2>

<ul>
  <li><strong>Two model sizes in the main sweep</strong> is not a scaling law fit. Fast3 adds a third mid-layer point but does not span depths.</li>
  <li><strong>One SAE training seed per condition.</strong> Bootstrap CIs reflect eval-batch variability, not training variability. In Fast3 mid-only, all pairwise CIs overlap at every \(k\); in Fast2 late depth, 70M-vs-410M CIs do not overlap at \(k=32,64,128,256\).</li>
  <li><strong>Fixed tool-training budget (10M tokens/SAE).</strong> These are budgeted curves, not best-achievable. Alive fraction <strong>often decreases with model size</strong> in early/mid at low \(k\) under this budget (with exceptions at higher \(k\) / different depths), consistent with larger SAEs being relatively more undertrained under the same token budget. The observed \(CE_{rec}\) gap may partly reflect differential undertraining rather than intrinsic representation difficulty. See the 100M-token check below.</li>
  <li><strong>Late depth is confounded</strong> (final block vs non-final), and late 70M runs show very low alive fractions consistent with undertraining.</li>
  <li><strong>My \(CE_{rec}\) definition is nonstandard.</strong> It is not the ablation-normalized “CE loss recovered” used in Bricken et al. (2023), Gao et al. (2024), and SAE Lens. Do not compare magnitudes without converting. See Definitions section.</li>
  <li><strong>The \(R^2\) inflation is specific to variance-normalized metrics.</strong> Cosine similarity and relative error norm agree with \(CE_{rec}\) that larger-model reconstructions are worse at fixed \(k\). But more broadly, reconstruction metrics and behavioral metrics can diverge — behavioral acceptance tests (patched CE, ablation-normalized CE recovery) are the more direct measure of what matters.</li>
</ul>

<h2 id="update-a-100m-token-mid-layer-k32-check-token-budget-sensitivity">Update: A 100M-Token Mid-Layer \(k=32\) Check (Token-Budget Sensitivity)</h2>

<p>To probe the “fixed token budget” confound, I retrained the mid-layer \(k=32\) TopK SAE for both models with a 10x larger training budget (100M tokens/SAE; SAE Lens v6.36.0, context size 256). Compared to the budgeted Fast2 runs:</p>

<table>
  <thead>
    <tr>
      <th>model</th>
      <th style="text-align: right">train tokens</th>
      <th style="text-align: right">\(k\)</th>
      <th style="text-align: right">\(CE_{rec}\)</th>
      <th style="text-align: right">implied \(L_{recon}/L_{orig}\)</th>
      <th style="text-align: right">alive %</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>70M (Fast2)</td>
      <td style="text-align: right">10M</td>
      <td style="text-align: right">32</td>
      <td style="text-align: right">0.731</td>
      <td style="text-align: right">1.269</td>
      <td style="text-align: right">61.7%</td>
    </tr>
    <tr>
      <td>410M (Fast2)</td>
      <td style="text-align: right">10M</td>
      <td style="text-align: right">32</td>
      <td style="text-align: right">0.714</td>
      <td style="text-align: right">1.286</td>
      <td style="text-align: right">50.1%</td>
    </tr>
    <tr>
      <td>70M (100M)</td>
      <td style="text-align: right">100M</td>
      <td style="text-align: right">32</td>
      <td style="text-align: right">0.898</td>
      <td style="text-align: right">1.102</td>
      <td style="text-align: right">94.7%</td>
    </tr>
    <tr>
      <td>410M (100M)</td>
      <td style="text-align: right">100M</td>
      <td style="text-align: right">32</td>
      <td style="text-align: right">0.920</td>
      <td style="text-align: right">1.080</td>
      <td style="text-align: right">94.1%</td>
    </tr>
  </tbody>
</table>

<p>This suggests the mid-layer \(k=32\) “scale hurts \(CE_{rec}\)” effect in the 10M-token sweep is at least partly an undertraining artifact: alive fraction jumps, behavior preservation improves sharply, and the scale ordering flips. This does <strong>not</strong> test the low-\(k\) proxy gap (where the original effect is strongest).</p>

<hr />

<h2 id="update-a-50m-token-mid-layer-k8-anchor-in-tree">Update: A 50M-Token Mid-Layer \(k=8\) Anchor (In-Tree)</h2>

<p>To add a low-\(k\) budget anchor in the same regime as the main headline, I trained a 70M mid-layer \(k=8\) SAE for 50M tokens (in-tree artifact):</p>

<p><code class="language-plaintext highlighter-rouge">interpretability/interpretability/workspace/results/budget_check_m4_mid_k8_50M/run_pythia-70m_L3_mid_k8.json</code></p>

<p>Compared to the 10M Fast2 anchor:</p>

<table>
  <thead>
    <tr>
      <th>model/depth/k</th>
      <th style="text-align: right">train tokens</th>
      <th style="text-align: right">\(CE_{rec}\) [95% CI]</th>
      <th style="text-align: right">\(R^2\)</th>
      <th style="text-align: right">alive %</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>70M mid \(k=8\) (Fast2)</td>
      <td style="text-align: right">10M</td>
      <td style="text-align: right">0.340 [0.169, 0.469]</td>
      <td style="text-align: right">0.807</td>
      <td style="text-align: right">13.2%</td>
    </tr>
    <tr>
      <td>70M mid \(k=8\) (new)</td>
      <td style="text-align: right">50M</td>
      <td style="text-align: right">0.670 [0.629, 0.706]</td>
      <td style="text-align: right">0.863</td>
      <td style="text-align: right">44.2%</td>
    </tr>
  </tbody>
</table>

<p>This is a large improvement in both behavior preservation and feature utilization under higher training budget, which strengthens the undertraining caveat for low-\(k\) settings.</p>

<hr />

<h2 id="next-steps-high-leverage">Next Steps (High Leverage)</h2>

<ol>
  <li><strong>Bridge model at all depths:</strong> add Pythia-160M at early/mid/late to get three-point depth-resolved comparisons.</li>
  <li><strong>Token-budget sensitivity:</strong> extend low-\(k\) checks to include 410M (and 160M) at 50M/100M, and add early/late anchors, to separate undertraining from intrinsic difficulty and equalize alive fractions across scale.</li>
  <li><strong>Deflated PR and anisotropy controls:</strong> mean subtraction, top-eigen removal, to tighten the geometry story.</li>
  <li><strong>Sensitivity-weighted distortions:</strong> Fisher/Hessian approximations to predict loss impact better than \(R^2\).</li>
  <li><strong>Legibility evaluation:</strong> SAEBench, MDL-style probing, to connect fidelity to human-usable features.</li>
</ol>

<hr />

<h2 id="repro-minimal">Repro (Minimal)</h2>

<p>Fast2 results: <code class="language-plaintext highlighter-rouge">interpretability/workspace/results/k_scaling_early-mid-late_fast2/</code></p>

<p>Fast3 results: <code class="language-plaintext highlighter-rouge">interpretability/workspace/results/k_scaling_mid_fast3/</code></p>

<p>50M-token mid-layer \(k=8\) anchor (in-tree): <code class="language-plaintext highlighter-rouge">interpretability/interpretability/workspace/results/budget_check_m4_mid_k8_50M/</code></p>

<p>100M-token mid-layer \(k=32\) check: <code class="language-plaintext highlighter-rouge">/Users/oboh/Downloads/experiment_results.tar.gz</code> (metrics in <code class="language-plaintext highlighter-rouge">results/*.json</code>).</p>

<p>Regenerate the tables:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python3 interpretability/analyze_k_scaling_results.py <span class="se">\</span>
  interpretability/workspace/results/k_scaling_early-mid-late_fast2 <span class="se">\</span>
  <span class="nt">--include-legibility</span> <span class="se">\</span>
  <span class="nt">--markdown-out</span> interpretability/workspace/results/k_scaling_early-mid-late_fast2/writeup_ready_tables.md
</code></pre></div></div>

<hr />

<h2 id="references-non-exhaustive">References (Non-Exhaustive)</h2>

<ul>
  <li>Bricken et al. (2023). <em>Towards Monosemanticity: Decomposing Language Models With Dictionary Learning</em>. Anthropic.</li>
  <li>Cunningham, Ewart, Riggs, Huben and Sharkey (2023). <em>Sparse Autoencoders Find Highly Interpretable Features in Language Models</em>. ICLR 2024.</li>
  <li>Gao et al. (2024). <em>Scaling and Evaluating Sparse Autoencoders</em>. arXiv:2406.04093.</li>
  <li>Elhage et al. (2022). <em>Toy Models of Superposition</em>. arXiv:2209.10652.</li>
  <li>Finzi et al. (2026). <em>From Entropy to Epiplexity: Rethinking Information for Computationally Bounded Intelligence</em>. arXiv:2601.03220.</li>
</ul>]]></content><author><name></name></author><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Weird</title><link href="https://obiohagwu.github.io/obiohagwu/2026/02/01/Weird.html" rel="alternate" type="text/html" title="Weird" /><published>2026-02-01T00:00:00+00:00</published><updated>2026-02-01T00:00:00+00:00</updated><id>https://obiohagwu.github.io/obiohagwu/2026/02/01/Weird</id><content type="html" xml:base="https://obiohagwu.github.io/obiohagwu/2026/02/01/Weird.html"><![CDATA[<h1 id="taste-air">TASTE? AIR?</h1>

<p>Things are getting stranger. You feel it?</p>]]></content><author><name></name></author><summary type="html"><![CDATA[TASTE? AIR?]]></summary></entry><entry><title type="html">Hmm</title><link href="https://obiohagwu.github.io/obiohagwu/2025/12/31/Hmm.html" rel="alternate" type="text/html" title="Hmm" /><published>2025-12-31T00:00:00+00:00</published><updated>2025-12-31T00:00:00+00:00</updated><id>https://obiohagwu.github.io/obiohagwu/2025/12/31/Hmm</id><content type="html" xml:base="https://obiohagwu.github.io/obiohagwu/2025/12/31/Hmm.html"><![CDATA[<hr />
<p>title: Hmm
date: 2025-12-31
kind: diary
—</p>

<p>Hello. It’s been about 2 and a half years since I made one of these. It’s good to look back and reflect on life really. 
<br /></p>

<p>I started writing these initially on Substack. Although I do have a bunch of unfinished posts, as my signature style engenders. 
<br /></p>

<p>I’m not sure what to say really. I do have a lot of new things going on vs when I was 17 in the heat of the pandemic dealing with all sorts of internal consternation and tumult. Things are materially much better now though in certain ways. Not absolutely.</p>

<p><img width="1980" height="1226" alt="image" src="https://github.com/user-attachments/assets/7d0bc4a1-3c7f-4d1d-a116-806ed8522119" /></p>

<hr />

<p>Well, I’m about to undertake my 2nd serious business venture. Actually I’d say 3rd. Aside from my current nascent AI/GRAPHICS service thing.
<br /></p>

<p>So I guess in regards to these little hobbies of mine I have made some material progress, although all this has been mostly within this year of 2025, and mostly within the second half of it as well.</p>

<hr />

<p>I think I may have been in some sort of unpleasant melancholic basin; I do believe I am making good step sizes and almost see the light. 
<br /></p>

<p>I think I see a way out. A way to win. Hopefully my sins don’t catch up to me before I can achieve actualization. As a true Catholic, I still can’t escape guilt driven morality.
<br /></p>

<p>Happy new year to myself. Let’s see what can be done, amen.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[title: Hmm date: 2025-12-31 kind: diary —]]></summary></entry></feed>