AI Accuracy for the Enterprise

Nov 01, 2025

Preamble: This article is a brief summary of a few conversations I have had lately with executives interested in AI implementations. The intent of this article is to provide a framework for understanding how AI systems—specifically those built on large language models (LLMs)—actually work, and to outline the core tactics that can improve their accuracy, reliability, and performance in enterprise settings.

for more see: jasonsteiner.xyz

Five Pillars

A key insight from these discussions is the significance of understanding LLM system mechanics. Technical implementation knowledge is not required, but grasping the core principles and techniques—as well as realistic expectations—is essential.

This article focuses primarily on domains where LLMs are the primary tool and accuracy is essential. Example domains would be law, healthcare, finance, regulatory fields, etc. Similar principles can be extended to more open-ended domains like scientific research, though the definitions of “accuracy” are more flexible in domains with unknowns.

Readers can consider this a collection of methods that should be used in coordination for robust system building. The five methods discussed will cover:

Each of these steps contributes a different piece of the accuracy pie, and business leaders should understand them. They don’t need to be applied in a strict order, and each can be valuable on its own—but their combination is often what delivers enterprise-grade reliability.

Model Fine-Tuning - Level 1 Semantics

One of the first things a company may want to do is fine-tune a model. There are several approaches to this, but they share the same core concept. Any model you choose—GPT, Gemini, Claude, Qwen, or any variant—is essentially a set of “weights” trained primarily on internet data. These weights compute mathematical functions that estimate the probability of each exact token (a word or word fragment) in a given context. These probabilities are based on terms found in vast corpora of general text across numerous domains. This is what makes it a “foundation model.” The probabilities are integrated from these broad domains.

These foundational models are pretty good for general-use chatbots, where the expectation is text that will look like a general conversation. They are not ideal for producing specific types of text, like legal or regulatory documentation. To resolve this, as a first step, companies may wish to consider creating a “fine-tuned” model. That is, using the original model as a starting point and then providing it with a high concentration of training examples of a specific type. Training on this concentrated data will cause the weights of the model to adjust slightly (but they do not need to start from scratch) so that they become better at producing text that looks specifically like the concentrated training set. This concentrated training set can be anything, for example, legal documents, a specific creative writing style, financial documents, etc. The point is that the resulting model is different from the original model, but better at producing the type of language suited to the use case. This specialization comes at a cost: the model becomes less general and less suited for broad, open-ended tasks.

The idea of fine-tuning is that a model can be updated so that the text it produces looks like the desired text. However, this does not mean that the text actually produces a correct context — it only mimics it.

Model Fine-Tuning - Level 2 Reasoning

To produce more correct text — e.g., text that is logically consistent — it is possible to do additional fine-tuning specifically on correct statements. This is often called “chain of thought” or “reasoning” and is focused on the long-term consistency of the text, not just semantic resemblance. This is another form of fine-tuning, but it requires training examples that are actually long-term correct and consistent. Taking from the legal field, these could be complete logical case arguments that are used end-to-end as training examples. This is a more difficult form of fine-tuning, in many cases for a lack of sufficiently long and abundant high-quality training examples. In many real-world cases, these training examples are human-generated, which makes them more difficult to acquire at scale. For theoretical domains like mathematics, correct reasoning traces can be generated algorithmically. This process is often implemented with reinforcement learning techniques described in more detail in this article:

Thinking & Evolving Machines

Jason Steiner

Aug 6

Read full story

The combination of a semantic fine-tuning and a reasoning fine-tuning can produce models that more accurately reflect both the vocabulary and the logical context of specific domains in a way that can dramatically improve the quality of the output.

Retrieval and Variations

For the enterprise, perhaps the most important aspect of accuracy is internal and external grounded data sources, and retrieval is the most effective way of getting this information.

There are several different types of “retrieval” that LLMs can use. The original instantiation of Retrieval Augmented Generation (RAG) was in the context of vector databases for semantic embeddings. This is a component of the technology underpinning LLMs that is used to embed the semantic meanings of prompts and compare them to the semantic embeddings from a database of content. This matching is done using vector similarity math, which assesses how semantically similar two pieces of information are.

Vector-based RAG can be very effective for unstructured text or other data types, but it also has limitations. The methods of producing vector embeddings are numerous and heavily impact the quality of the retrieval. For example, some of the variables include what size of text you choose to embed — a sentence? a paragraph? a structured format field? Other parameters include what embedding model you use, as all RAG systems require the embedding models to match between the database and the queries, and this can be both a non-trivial cost and a latency issue depending on the size and precision of the models.

Limitations of vector RAG include long context consistency. For example, you may embed in paragraph chunks, but texts might exist where an earlier paragraph nullifies the subsequent one. In a retrieval context, an LLM would match the target paragraph but may not capture the prior negating one. These are all important considerations when designing and thinking about the utility of vector-based RAG. The scope of relevant RAG parameters for effective implementation is high.

A second major RAG implementation is the Model Context Protocol (MCP) standard and its ability to wrap many retrieval methods. MCP is a communication standard that maps arbitrary API endpoints to LLM consumable formats such that LLMs can do things like generate correct API calls for external resources or write correct SQL queries to fetch data.

MCPs can be thought of as an “app store” for LLMs where a diverse range of resources, data, business logic, pipelines, methods, etc., can be made readily available in a standard format for LLMs to use.

While a vector RAG application can also be wrapped in an MCP format, most MCP methods are used to provide a robust structure for deterministic data fetching or processing code. This type of data retrieval doesn’t rely on the “fuzziness” of semantic embeddings or other LLM stochasticity and, as such, is typically a very robust way of bringing grounded context into an LLM workflow. This is, of course, assuming that the data being fetched has the appropriate queries, which naturally leads to prompt engineering.

Prompt Engineering

Prompt engineering has had waves of popularity in its relatively short existence, but it remains a critical part of the effective implementation of LLMs. There are several released examples of system prompts from major AI labs that show the level of detail that can go into a prompt1. Even prior to fetching additional outside information, a system prompt can routinely be several hundred words or longer and provide detailed few-shot examples of exactly how the model should produce outputs. The model will be much more likely to follow the format of the examples provided, which can add to the reliability of the output. For example, when using the MCP protocols described above, each of the methods offered by the MCP has a detailed example of exactly how that method is used, such that the LLM can mimic it. These types of highly detailed instructions are even more robust in the context of tools like Claude Skills2, which provide highly detailed instructions for precisely how to execute certain tasks.

Prompt engineering, however, is mostly still a bespoke field of human-produced content. There are best practices, key phrases, and tricks that can be employed, but the scope of possible variations of what can be included in a prompt and how it is drafted is effectively infinite. This can be challenging to optimize. However, tools like DSPy3 are being developed to help with this. DSPy is essentially an automated prompt engine. It allows users to define what they want a prompt to do and then will iterate across a range of variations and “learn” the features of the prompts that produces the best LLM outcome (as measured by whatever task the LLM is supposed to do). This automation can be a key resource for optimizing the specific prompts that are used without human guesswork or laborious intervention.

At the end of the day, however, even with all of the above techniques, the LLM ultimately is a stochastic (probabilistic) engine. While it can be prompted precisely, retrieve all of the accurate information it needs, and be fine-tuned for industry specific content, the output still has elements of variability. To resolve this, it’s important to have validators.

Agentic Architectures

It is already common now to speak about agentic or multi-agent architectures where different parts of a system play different roles, and this is a key part of ensuring robustness and reliability of LLM systems. The ways to assemble the architectures are essentially infinite, but one simple example may be the following:

Consider that you are producing a legal document that references case law. There is a retrieval step at the outset, calling an external API to access relevant cases and add them to the context. The LLM then produces an output. A simple validator could accept that output, parse the referenced cases, and cross-reference their existence and context against the same API. This check would make the system robust against both the initial LLM fabricating cases out of whole cloth and/or generating inaccurate information about retrieved cases. The ability to create these validation systems can be extended to any number of steps necessary to ensure the robustness of an output. Research consistently demonstrates that using multiple LLM systems to check and challenge each other results in more truthful and reliable outputs4.

Conclusion

There is tremendous potential in the application of AI in the enterprise, and it’s critically important that business leaders and executives have a working understanding of technology. While the excitement is high at the moment, the implementation will need to be deliberate and robust for long-term sustainability. There are reports that 95% of AI implementations in enterprise have failed to deliver their projected ROI5. This is unfortunate, but it may be true. In many cases, this may be for a lack of effective prioritization, resource allocation, or overly ambitious expectations. In the long term, enterprises that take measured and strategic approaches to AI implementations will have substantial advantages.

Building Autonomous Organizations

Jason Steiner

Mar 12

Read full story

Share Techbio<>Biotech

References

https://github.com/x1xhlol/system-prompts-and-models-of-ai-tools

https://www.anthropic.com/news/skills

https://dspy.ai/

https://arxiv.org/abs/2402.06782

https://www.forbes.com/sites/andreahill/2025/08/21/why-95-of-ai-pilots-fail-and-what-business-leaders-should-do-instead/

Rainbow Roxy

Nov 1

Wow, the part about the five methods, especially distinguishing between level 1 semantics and level 2 reasoning in fine-tuning, realy stood out to me and made me wonder how much of this framework applies to educational AI too, you totally nailed explaining these complex ideas so well!

Expand full comment

Techbio<>Biotech