AI Scientists - On Scientific Taste

What to Build When You Can Build Anything

May 05, 2025

There is no unique, correct answer in most cases. It is a matter of taste, depending on the circumstances... and the particular age you live in.... Gradually, you will develop your own taste, and along the way you may occasionally recognize that your taste may be the best one! It is the same as an art course.
-Richard Hamming, Methods of Mathematics Applied to Calculus, Probability, and Statistics

If you are an engineer or AI scientist working on the topics in this article, consider getting in touch:

jason@layerzero.bio for more details jasonsteiner.xyz

Introduction & TLDR:

This article is about scientific taste and its importance in the future of AI scientists. Research is inherently about exploring the unknown; but novelty is not an inherent good. Perhaps the most important skill of scientific research is defining what to work on — and that involves taste. Current LLMs and agentic frameworks have no inherent taste in the way of scientific inquiry. They are heavy on novelty, light on insight. They have very little intuition. This is a function of two things:

Their training data does not comprise a holistic world model
Their skill is largely derived from the quality of their human designers

The first point is something that may resolve over time, particularly as real-world interaction data, like robotic data, vision language models (VLMs), and other sensory inputs become incorporated into the multimodal training sets.

The latter point is more challenging, at least at the moment. It might be argued that this is just another version of the “Bitter Lesson”1 — that the attempts to imbue AI systems with human design will ultimately be subsumed by scaling the system’s compute with simple rules. As I have written before, I think this is almost certainly true for any formalizable system—e.g., any system that is verifiable2, however, in open-ended tasks where creativity and novelty matter, subjectivity will still reign, and subjectivity involves taste. Indeed, recent LLM advances have demonstrated improvements in areas like math and coding but have retreated on progress in areas like creative writing.

When the internet, specifically web2.0, democratized the ability for individuals to self-express, it was considered an inherent good. Yet the evolution has devolved into a junkyard of bots and content garbage giving rise to the “dead internet” phenomenon. The same happened in the domain of scientific publications with the explosions of paper-mills and junk science.

The rise of scientific agents and coding LLMs will add rocket fuel to this content generation. The central differentiator will be what to do, not what can be done.

This is the essence of taste.

Abundance of Agents

They are coming fast and furious; just a few days ago Future House released APIs and web interfaces for a few of their new agents — largely focused on assessment of scientific literature with an experimental version of a tools platform for chemistry3. They did not, however, release the underlying code (or at least updates to their current repository4)

I do, however, expect that more academic institutions will be releasing interesting and useful agentic code bases that can serve as helpful resources for researchers and builders. However, it is also very much the case that AI coding assistants are quite good at translating an agentic logic process into code.

This does not just satisfy the mechanics of the code but also a fair bit of the additional infrastructure. For example, building Model Context Protocol servers (MCP) can be done by simply dropping in the specification from Anthropic, API documentation from the resources of interest, and a narrative description of what methods you’d like to perform on the data you retrieve. Sonnet 3.7 will do a pretty decent job of drafting a Python script to spin up a server and function calling schemas so that your LLM can select a method and deliver the right function specification format to directly call the method. Not everything is 100% out of the box, but it does a pretty good job, particularly for APIs with good documentation.

Below is an example of an “LLM in a loop” executing a request to holistically outline the drug development landscape of ALS (though this is an arbitrary task). It builds queries, qualification criteria, continuously updated strategic approaches based on observations, calls a range of tools, has failure retries, stores memory, and a host of other bells and whistles.

It’s a “vibe-coded” draft but gets a lot of the key points right. As I have written before (Ref 10), the basic architecture of agentic frameworks is not particularly complex to understand, and coding assistants can replicate most logical execution loops with high fidelity to provide the scaffolding. Agents are less a technical computer science or programming issue than they are a language issue — more like a manager instructing a team than an engineering architecting an algorithm.

Figure: Demo agent running a research task showing a variation on internal mechanics such as action storage, strategy iteration, tool calling, etc. This runs for a while so the whole clip is not displayed. Initial queries are arbitrary.

A Few Practical Lessons Learned

Some of the key lessons learned in this quick prototype are the following:

The specific logic and architecture of how agents work will be a constantly evolving process but is not particularly hard to set up or iterate on. The evolution of, for example, how to evaluate outputs, when to consolidate and summarize, how to set soft guidelines for tool selection, what to store in short or long-term memory, etc. are all dials to turn in the logic and all of them will result in different outputs. This will be a constantly evolving experimental space.
Building tools is very straightforward provided you have a good schema for the inputs and outputs. These will generally just go into a tool registry that has the descriptions and parameter schemas that the LLM can select from. LLM outputs for tool calling are generally pretty accurate in terms of having correct formatting, but it’s a good idea to provide robust parsing for variations to avoid getting errors. Overall, tools are not complicated, and most coding assistants can write quick drafts of many of them either from scratch or from open-source repositories like tool lists from Genentech5 or others.
In general, the construction of comprehensive scientific agents is more a software and engineering effort and less a scientific one — at least for the basic setups. This will suffice for basic agents whose primary purpose is to be a second “set of hands” — for example, agents that build and run computational biology workflows. However, for agents that aspire to be scientific co-pilots, or even more broadly to do autonomous science, the role of taste will become increasingly important. It will not just be a measure of what can be done, but what should be done.

On Scientific Taste

Perhaps the most important decision that a researcher makes is what problem to work on. It is also among the most difficult. Many labs throw new researchers directly into the mix working on a project that is ongoing already or directly getting hands-on tactical experience. However, I would argue it would be much more effective for new researchers to spend a disproportionate amount of early time determining what they should do, not just what they can do. The cultivation of scientific taste is probably the single most important determinant of research success.

This concept of “taste” has also been increasing in the common AI vernacular. The context is that when people can build anything, what’s worth building? Andrej Karpathy muses on this in a recent vibe coding session post6, and Rick Rubin has been oft-cited as the paragon of “taste” that all vibe coders should aspire to7 — the legendary music producer who says he has barely any technical musical abilities8.

If you don’t know Rick Rubin — look him up.

Figure: vibe coding nirvana

I had this experience recently with loveable.dev. With barely an instructive prompt on “AI Agents in Bio”, it was able to spin up a rather convincing landing page complete with menu drop downs, placeholder states, template code block examples, etc. It all looked rather convincing for a startup mockup. But I couldn’t help but wonder, if for those who are not familiar with how easy it is for these tools to generate superficially impressive outputs, how many might be easily misled? A thousand and one beautiful landing pages, and not a single viable product. It takes the “fake it till you make it” to a whole new level.

So, what is “taste”?

Taste is the recognition not of what can be done, but of what should be done.

Right now, LLMs don’t have a lot of taste. In the context of agents, taste is written into your prompts; it is embodied in your logic. This is where agents will be differentiated. There is definitely technical piping and detail to building tools and making sure they are valid, functional, and accurate but these are largely just a wiring challenge.

However, LLMs are not necessarily great at choosing which tools to use and when — or what to do if a tool returns an unexpected error or a failure. It is possible, for example, if a tool returns an error that is logged as an observation and you are using observations to update a strategy, that the strategy may spend inordinate amounts of effort trying to fix the error, when in fact it may be better to move on to the next step. How much time/resources to spend here is a core part of taste.

In the world of general research agents, like the Deep Research agents that have been released by most of the major AI labs, this idea of taste is relatively straightforward. This is because the research progress is generally:

linguistically based
open-ended

This means that you can effectively use LLMs themselves to be guardrails to keep the agent on track — for example, always including a check on the original prompt to ensure that updates in agent strategy are still keeping in line with the original objective and not deviating too far.

This is considerably more challenging with scientific agents for three notable reasons:

The tools are not always language-based — for example, tools may do things such as run code, or produce data, graphics, or other formats. (There are increasingly robust parsing tools that use VLMs/OCR and other techniques to effectively translate many data types to LLM consumable language formats).
The tool diversity, quality, and proper selection are considerably more challenging. This is potentially addressable with reinforcement learning for agent trajectory optimization.
Scientific insight or analysis is inherently a low-frequency event which doesn’t make it as amenable to the stochastic training of LLMs. Gathering information is one thing, determining whether that information has an as-of-yet-unknown discovery is a more nuanced challenge.

In short, developing “taste” to effectively train deep research agents on the web is more about assessing subjective good/bad whereas developing taste for effective scientific research is more about assessing objective true/false. The latter is considerably more difficult for LLMs.

Cultivating Taste

Cultivating taste and imbuing it into scientific agents has two parts — the prompting part and the reinforcement part.

It seems a bit silly, but the style and content of prompts, the way they are chained together and the specific roles and details of each are surprisingly divergent in terms of the directionality of an agent’s workflow. This is still very much in the realm of “craft” and is the primary place where scientific taste can be injected into an agent.

There is an idea that taste can be gathered through reinforcement learning and to an extent this may be possible in certain use cases. However, there are two criteria that RL needs to be effective — it needs either objective feedback from the real world (e.g., experimental data) to determine whether its designs or outputs are directionally useful and valid, or it needs another agent (or human) to qualify its outputs and establish the reward function. The former has real-world latencies in feedback and the latter is a chicken or egg training scenario, at least if you want to train at scale.

The experience of using o3 or Deep Research is both quite impressive and yet also somewhat flat. These models definitely accelerate the acquisition and synthesis of information and can appear strikingly insightful at times, especially for new domains. If we think of knowledge as an ocean, these models help us reach a bit deeper than we could before across the entire surface, but they do not provide much in the way of a true deep dive in any one area.

Autonomous Organizations

I recently wrote an article on Autonomous Organizations — one of the ideas behind this is that there will be a rising value attributed to human judgment and decision-making coupled with a fleet of agents performing different tasks.

This human judgment will ultimately be adopted and incorporated by agents in the form of prompt tuning, stored memories, and interactive engagement with human judges. We have already seen AI clones of individuals9 debating their real-life counterparts. While this particular example isn’t a paragon of virtue, it is a potentially fascinating avenue of “taste” transfer to AI systems. Who doesn’t love a good debate with themself?

An important observation in the world of computing is that when the task is not deterministic (for example, fraud detection or intelligence gathering), the best outcomes typically come from the combination of computers and humans. This will be the case in scientific research as well. The question of where this balance lies will be a question of how much human “taste” can be imbued into LLM agents.

Intuition is the latent space of experience

This post is a more literary view of AI agents. For more technical or detailed articles, the following are related: