Reading the Machine
Automating AI Understanding
Automated Interpretability: A 90-Minute Deep Dive into AI’s Own Mechanics
A recurring theme in AI development is the question of when systems will be able to meaningfully contribute to their own improvement. The recursive self-improvement hypothesis holds that this creates a compounding feedback loop. A recent 90-minute exercise in mechanistic interpretability gave me a concrete, if small, data point on where that stands today.
For more see jasonsteiner.xyz or find me on LinkedIn
Interpreting Inscrutable Numbers
Modern AI systems are composed of billions of numbers that are mathematically strung together in massively long equations. They can be very difficult to interpret, yet the ability to understand what is actually happening inside of these models is central to the field of AI alignment, safety, and our ability to create systems that behave in ways that we can predict and understand. Large models are not so much programmed as they are grown — and indeed the analogy to biology is common among interpretability researchers.
There is a natural curiosity I’ve had about this field for a while. It was indulged marvelously by the beautiful illustrations from Welch Labs on “The most complex model we actually understand,” which breaks down in detail the idea of “grokking” in neural networks described by Neel Nanda et. al. in the seminal 2022 paper detailing the grokking behavior of a neural network trained to perform modular arithmetic. It’s quite a beautiful video, paper, and phenomenon. I was interested in seeing what I could do.
An Even Simpler Model
Browsing X recently, I ran across a post from a Microsoft researcher that posed the simple question of an agentic coding system: “train the smallest possible transformer that can do 10-digit addition”. The result was a surprisingly small 777-parameter network composed of a single attention layer and a hidden dimension of just 7. The post was written up in a paper that describes how the model was found, and the code was posted. This seemed like an interesting case to explore.
Like any good practitioner of AI, I started with the standard tools. What proceeded was about 90 minutes of investigation that resulted in a rather detailed mechanistic interpretation of how a 777-parameter neural network that performed 10-digit addition actually worked. Most of it was from AI.
Apart from the actual learnings about the model and the process of investigating mechanistic interpretability, the most striking takeaways were the speed and autonomy of the analysis.
While this particular model is simple, it is quite easy to see how AI systems will begin (or are already) writing their own code, improving their own architectures, and understanding their own mechanics.
The Basic Steps
My process was strikingly simple. I didn’t even use a coding agent — just a back and forth with a Claude chat. It’s a process I used to understand most new concepts in AI or software, and it goes roughly as follows:
If there is a GitHub repo, you can swap out the “hub” with the word “ingest,” and it will take you to a page that parses the whole repo into tokens. You can choose what types of files to include or exclude if context windows become an issue. Do this for any repo you want to understand, and “Copy All” into a chatbot and start asking questions. It’s incredibly useful.
I dropped the Nanda et. al. paper on grokking and the paper from X into the context as well for good measure.
I asked what a mechanistic interpretability evaluation of the model described would look like.
I didn’t have the trained model weights, so the LLM drafted a brief script that stripped the code of all unnecessary bells and whistles to just leave a basic training setup for 10-digit addition. This ran for a few minutes, and I had a trained model (it’s very small and simple).
From there, it was a series of copy/paste code blocks and results into a Jupyter notebook to investigate the mechanics. This was directed and interpreted by Claude. Apart from my own learning process, I was largely an ancillary pair of meat hands.
The investigative process lasted for about an hour, and at the end, I asked it to produce a report. The full report is linked below — it’s technical, but readable, and walks through exactly how the model implements carry propagation through its 14 FFN neurons and a single attention head.
It’s a 19-page, detailed, and annotated description of how this 777-parameter model works to achieve 100% on 10-digit arithmetic (on all samples it tested). In proper meta-researcher fashion, I dropped the report back in the context and started asking clarifying questions and suggesting that certain sections be expanded for comprehension and clarity (largely my own). After a few iterations, it got better, and I learned an enormous amount.
I won’t include all of the details of the report for brevity, but for interested readers, I would recommend the workflow above. I don’t vouch for the precision or absolute technical or interpretative accuracy of the report, but it reads coherently to me.
What Does This Imply?
I had originally set out on a brief exploration to see whether this 777-parameter toy model would be a suitable test bed to learn about mechanistic interpretability. I knew the general direction and areas I wanted to probe, but I had never actually done it before. I had figured it would be at least a few days, if not a week or more, to get my head around the model, the basic concepts of interpretability, and the experimental setup. This was not the case.
The speed of learning — not just writing code but getting up to speed on anything — continues to surprise me even after two years of watching it compound. A similar pattern showed up in my recent experience with agentic coding, described below.
What’s distinct here is the subject matter. This wasn’t coding a pipeline or standing up a website. This was an AI system analyzing the internal mechanics of another AI system — coherently, precisely, and without much direction from me. The theoretical case for why this matters — recursive self-improvement and evolutionary systems — is something I laid out earlier this year, also below.
Had I not been present, an agentic system could have made substantial progress to generate the model, design the experiments, interpret the results, and produce the report autonomously. The human in the loop was largely there to learn and occasionally catch errors or hallucinations. That is a different kind of milestone than benchmark performance — it is a qualitative shift in what these systems can do with their own internals.
For more see jasonsteiner.xyz or find me on LinkedIn





