For more like this see: jasonsteiner.xyz
The Growing Capabilities of AI Agents
Two weeks ago, Anthropic published an article on how they designed their multi-agent research system for Claude 41. It provided a detailed view of their overall architecture and some of the key lessons learned. In general, the ideas were not surprising and were in line with the idea of designing agents from a managerial perspective vs. a tactical one. This is the single most differentiating feature of this new type of software design. As Andrej Karpathy has famously said that “The hottest new programming language is English”, it is also appropriate to say that:
The hottest skill for agent design is management.
This is categorically different than what has existed in software design and architecture in the past and is enabling new possibilities in the automation and scaling of human resources, which I have written about previously.
This article is about lessons in building complex agents for biomedical research and strategy, observations for enterprise implementations, future potential, and thoughts on AI research.
Lessons and Terminology
I have built several different types of agents — of varying degrees of capability. This article will discuss building a generalist biomedical agent that performs a wide range of tasks from specialized technical and computational analysis, to research and strategic business synthesis. There are several observations and lessons that resonate with the work that Anthropic has recently published that are worth pointing out alongside my own experiences developing in this new category of AI engineers.
A good place to start is with some definitions. Common terms that often get confounded include agent, orchestrator, multi-agent, and LLM. These are some of the working definitions that I have found useful:
Agent — an agent is a piece of software that has a non-deterministic workflow moderated by LLMs. This is different from just using LLMs. LLMs can be used for a wide range of deterministic workflows such as data extraction from documents, but these are not agents. Agents have decision capabilities that determine the actual workflow.
Orchestrator — an orchestrator is an agent that manages the workflows of other agents. This can be thought of as a manager of a team. The most straightforward way of doing this is to assign tasks, gather responses, integrate them, and reallocate new tasks. This is often done in a synchronous way —for example in step-wise iterations. This can be thought of as similar to all gather and all reduce methods in distributed computing.
Multi-Agent — a multi-agent system can be thought of as an asynchronous communication system between multiple agents. Instead of a step-wise process of an orchestrator, these systems have agents that pass messages to each other dynamically. Multi-agent systems can be significantly more complex to design.
LLM — an LLM by itself is not an agent and it is a misnomer to use it as such. This extends even to using multiple “prompt” roles, such as an “LLM as a judge” which is also not inherently agentic unless the evaluation results in a non-deterministic workflow outcome. It is very possible to use many LLMs in a workflow in a non-agentic manner.
In practice, most agents today are currently in the orchestrator category. There are developments in genuine multi-agent architectures like Google’s Agent2Agent framework2, but these are still early in development.
Lessons from Agent Development and LLM coding
Agent development is about architecture and the best way to approach it is to start with an architecture diagram of the pieces and how they talk to each other. Once this architecture is developed, LLMs can generate vast quantities of code to populate the necessary methods and they are quite good at providing a reasonable starting point for the big blocks.
But this process breaks very quickly in practice and the result is often globs of code that don’t really work. Trying to get LLMs to correct large blocks of code often results in massive amounts of more unnecessary code that then takes time to sort through and figure out.
In my experience, LLM coding is best characterized as code curation vs code generation. There are several benefits, for example, LLMs will generate nice formatting code, tests, and helper functions, but they will also excessively complicate, rewrite, and confuse large pieces of code that then take time to sort through. The experience has been one of a gardener, taking a tangle of code, sorting through it, curating, trimming, and deleting.
That said, the process is surprisingly effective.
Injecting Content to Architecture
The central element of any agentic process is the content that flows through it, and this becomes particularly important to understand closely. For example, research agents will gather text information from the web or other sources, code-writing agents will generate outputs and errors, etc. All of this information gets placed into various internal variables that get passed between methods as the agent progresses. The actual content that moves around is critically important to understand for several purposes.
Tactically, token management is critical, not just from a cost perspective, but also API rate limits if they are being used. It is very easy for a naive agent to rapidly accumulate content in these seemingly innocuous variables that can break workflows.
Strategically, the information in the variables and how is structured is the valuable content for the agent. Creating effective message content, including strategic summaries where necessary, considering the impact of truncating content and potential for lost information, and ensuring that information synthesis across different inputs, such as iteration rounds, is informative are all critical aspects that determine the agent’s performance.
Specialist vs Generalist Agents
Most of the agents that are familiar from the major AI labs are text/image based research agents that can search the web or perhaps the enterprise documents of a company and provide research summaries and synthesis. However, there is an increasing space for developing domain-specific agents such as in the life sciences. These agents can perform similar text/image based synthesis, but they can also use domain-specific tools, such as generative models for protein folding, or chemical synthesis, and do domain-specific technical analysis such as handling specific types, datasets, and computational platforms. The utility of these agents will be largely determined by the domain expertise that is injected into them as well as the resources they can access.
In previous articles I have written about the construction of AI scientists — most of which have focused on aspects of reasoning over published literature content.
Below is an extension of that approach that incorporates domain-specific technical analysis.
Demo
Agent processing a single cell RNA data set. The agent identified the correct file and proceeded to analyze it. It dynamically wrote analysis code, automatically debugged it when necessary, and routed content to subsequent steps. Outputs are displayed at the end.
This agent also processes a range of other data types, writes code in multiple languages, and conducts integrative deep research reports.
Technical Deep Research Report Example
Importantly, this is a generalist architecture that incorporates broad spectrum capabilities. The specific agent trajectories are query dependent.
I expect that in the future we may have a trifecta definition of “API” with two thirds of definitions be co-opted by the life sciences.
Application Programming Interface
Active Pharmaceutical Ingredient
Artificial Pharmaceutical Intelligence
Enterprise Agents
Companies that embrace the idea of agents will have advantages over those that do not, however, enterprise implementation of agents is a significant undertaking. The image below from Sierra, a company specializing in enterprise deployment of agents, provides some intuition for the pieces that are involved.
Reference3
For enterprises, the lifeblood of effective agents will be proprietary data, and accessing that data in structured, consistent, and reliable formats will be critical. The open sourcing of the Model Context Protocol (MCP) from Anthropic is providing an ecosystem to enable this at scale.
MCP is an app store for LLMs
To summarize what MCP enables consider the following:
An enterprise has several data sources which may include: databases, cloud storage, enterprise SaaS resources, and others. Some of these resources may come with programmatic ways of accessing their data, and some may not. Moreover, third-party providers may update their documentation, endpoints, or other structures periodically. For an enterprise developing agents that rely on being able to access these data sources, this is a brittle architecture. Moreover, it often lacks any middleware features like security, authorization, etc. What MCP does is provide this intermediary. It provides a standardized way to present back-end information to an LLM such that the LLM can make requests for resources. Importantly, the LLM does not need to construct custom API calls or concern itself with how the resource is actually accessed or processed on the back-end, it is entirely mediated through a standard interface.
The development of this protocol has enabled any enterprise platform to develop and release an MCP server for their platform which enables all of their customers to have standardized LLM based access to their data. There has been a rapid explosion in this space. Nearly 5000 as of the date of this article are searchable here:
Reference4
Biology, Agents, AI
This article discusses two main ideas:
The importance of management skill in agent development
The development of domain-specific and enterprise agent adoption
The specific area that I am interested in is the biological sciences and much of this Substack covers topics related to AI and Biology. This intersection is definitely an evolving one with the historical perspectives of computer science and engineering being distinct from the perspectives of biology and discovery. The essence of this distinction is whether or not humans have designed the system (and thus can engineer it) or must discover the system (where subsequent engineering is more challenging).
However, the development of LLMs and the research into their behavior has brought these two domains closer conceptually. Anthropic has been at the forefront of the field of Interpretability and recently published a paper on the “Biology of a Large Language Model”5 which presents LLMs as being “grown” vs. built. This idea, and the idea that we do not have a good understanding of how LLMs work, brings the research space of AI much closer to that of biology. The experimental space extends to agents and their behavior which can comprise a surprisingly high dynamic range depending on data, models, prompts, architecture, etc.
The Future of Knowledge Work
The maximalist argument of agent development is that it will consume knowledge work. I don’t think this is the case, but it will change it substantially. Service industries like the consulting and legal profession are already starting to see this. Companies like Harvey are developing domain-specific legal AI that has the potential to perform a large fraction of legal work. The strategic path that these companies are taking can be described as the path from “software product —> work product” where the software product is a co-pilot support product and the work product is the actual deliverable. In the legal profession, specifically, the entire business is built around the billable hour, which is not particularly amenable to efficiency products, but other professions have different dynamics.
I think that the view that AI agents will consume knowledge work, however, is a limited one, particularly in high-end professional services. In these industries, clients are not strictly paying for work product — they are paying for confidence, accountability, and personal service. It is likely that these attributes of business will not be easily displaced by AI.
Agents will continue to power both individuals and enterprises with increasingly valuable leverage, but the human aspect of business likely has substantial staying power.
For more like this see: jasonsteiner.xyz
References
https://www.anthropic.com/engineering/built-multi-agent-research-system?utm_source=alphasignal
https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/
https://www.linkedin.com/posts/brettaylor_its-never-been-easier-to-build-a-demo-ai-activity-7340771650281291778-L9m2?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAKwIE4BojzYOthNlFI1zxG6cY6z304yfcU
https://www.pulsemcp.com/servers
https://transformer-circuits.pub/2025/attribution-graphs/biology.html