Tools: The Capability Boundaries of AgentsWhen generative AI

Tools: The Capability Boundaries of Agents

This is the ninth article in the series "Practices and Reflections on Harness Engineering in Enterprise Applications."

Tools Move Agents From Conversation to Work

When generative AI first entered software development, it was closer to an advisor inside a chat window. Humans had to copy requirements, code snippets, errors, and context into the chat, then copy the AI's suggestions, explanations, or example code back into the project. AI could help people think, but it had not truly entered the development environment.

Once AI gained basic tools, the situation changed noticeably. It could read files, modify files, search code, and run commands. Capabilities such as fs read, fs write, shell, and search may look basic, but they changed AI from something that merely answered questions into something that could participate directly in development.

Today's agentic development tools are stronger by default. Many already include file reading and writing, command execution, code search, patching, browser interaction, and similar capabilities. These tools make it easier for agents to understand projects and make development much smoother.

But that is still not enough.

In enterprise application development, a great deal of information and capability still lives outside the default tool set. An agent may not be able to see designs in Figma, read historical discussions in Slack or Google Drive, or understand context from an issue system. It may also be able to "check" code style, "judge" whether tests are meaningful, or "infer" coverage, but if those tasks rely only on the model itself, the result is unstable and wastes valuable context window capacity.

This is where tools come in.

In this article, tools mean engineering mechanisms outside the model that extend an agent's ability to perceive, act, and verify. They let the agent see what the model itself cannot see, perform actions the model cannot reliably perform, and verify model-generated results in a more dependable way.

Tools roughly fall into several categories.

The first category is tools that connect to external systems, often through MCP. Figma lets agents inspect designs. Playwright lets agents operate browsers. Slack or Google Drive lets agents reach context scattered across team collaboration tools.

The second category is traditional software engineering tools: git, Java, Gradle, Docker, lint, test runners, Jacoco, SonarQube. These are not inventions of the AI era, but in agentic development they become the foundation for stable execution and verification.

The third category is project-specific tools: skills, scripts, custom inspection scripts, generation scripts, status synchronization scripts. They usually come from recurring project problems: a check that is always needed, a report that is always generated, a state update that is always synchronized. Over time, these recurring needs are distilled into tools.

Tools Must Follow Agent Responsibility Boundaries

Tools provide capability, so it is easy to assume that more tools make an agent stronger.

But in Harness Engineering, more tools are not always better. Tool descriptions enter context. Tool results enter context. Tool selection itself consumes agent attention. The more tools an agent has, the easier it is for the agent to hesitate between tools it should not be using, or to bring irrelevant tool results into the current task.

More importantly, tools mean action capability. Reading files, writing files, running commands, accessing external systems, changing status, and triggering deployment are not neutral abilities. When an agent receives tools outside its responsibility, it is more likely to overstep boundaries and create uncontrolled outcomes.

Tools must therefore follow agent responsibility boundaries. An agent's responsibility defines what capabilities it should have, and also which tools it should not touch.

A Dev Agent needs development, build, runtime, and git-related tools because it turns requirements into code. A QA Agent needs test frameworks, Playwright, curl, and database tools because it verifies behavior. A Code Reviewer Agent needs lint, typecheck, static analysis, and custom inspection scripts because it reviews code quality and engineering risk. BA and TL Agents need browsers, Figma, Google Drive, and code-reading tools because they help humans analyze requirements, designs, and technical constraints. A PM Agent needs the ability to read state, organize reports, and dispatch agents because it helps humans manage workflow.

Looking back at agent responsibility from the perspective of tools also makes it easier to understand why agents should not overstep. A QA Agent should not modify code not only because it is outside its responsibility, but also because it does not have the Dev Agent's tools for development, build, commit, and implementation tracking. A Dev Agent should not take over QA verification not only because it is outside its responsibility, but also because it does not have the QA Agent's tools for test design, browser verification, test data construction, and test reporting. A Code Reviewer Agent should not reimplement functionality not only because it is outside its responsibility, but because its tools are primarily for exposing code quality and engineering risk, not for pushing feature development forward.

If an agent does not have the tools needed to do another agent's work, it should not easily cross that responsibility boundary. Tools make responsibilities real, and they make boundaries concrete.

From this perspective, an agent is not only defined by prompt and context. It is also defined by the tools it can use. The same model, given different tools, becomes a different executor. Different responsibilities require different tools; different tools create different capability boundaries.

Tools Are Weapons in Adversarial Practice

When agents face the same responsibility, having tools or not having tools makes a major difference. Without tools, an agent can only rely on reading, reasoning, and self-description to judge a problem. With tools, many judgments become executable, repeatable, and verifiable engineering results.

This difference is most visible in adversarial practice. Dev Agents, Code Reviewer Agents, and QA Agents can all say they performed checks. But without tools, many checks remain subjective agent judgment. With tools, checks become more stable evidence.

The first characteristic of tools is efficiency. lint, typecheck, test runners, coverage, build, and Playwright can execute directly and return results. The agent does not need to repeatedly reason inside the context window. Tools save not only time, but also agent attention.

The second characteristic is precision. Different tools serve different goals: lint checks rules, SAST exposes security risk, Jacoco checks coverage, SonarQube exposes complexity, duplication, and code smells, Playwright verifies real interactions, and custom scripts check project-specific problems. A tool does not turn a clear inspection into a vague paragraph of advice.

The third characteristic is ruthlessness. Agents can explain and defend themselves. More importantly, agents can influence and persuade one another. A Dev Agent may package a meaningless test as "just a smoke test," "not important here," or "to be completed later." If another agent is only reading that explanation, it may be persuaded.

Tools are not persuaded. A meaningless test such as expect(true).toBe(true), or a test that only checks an object is not null, remains meaningless in front of coverage, mutation testing, and custom inspection scripts. Missing coverage is missing coverage. A rule failure is a rule failure.

This is the most important value of tools in adversarial practice: they turn part of quality into non-negotiable facts. Agents can discuss reasons and propose fixes, but they cannot explain their way around problems exposed by tools.

Of course, tools are weapons in adversarial practice, not adversarial practice itself. Tools expose signals, but those signals still need to be understood, judged, fixed, and distilled by agents and humans. Real adversarial practice still comes from checks made under different responsibilities, different standards, and different contexts.

Humans Prepare Tools and Manage Permissions

Tools do not automatically become part of a harness. Humans need to decide which capabilities should be delegated to tools, which knowledge should be distilled into skills, which repeated actions should become scripts, which tools should be given to which agents, and whether tool use should be allowed by default or require confirmation.

The first human responsibility is to equip agents with suitable tools. Suitable has two meanings. On one hand, the tools must be sufficient for the agent to complete its responsibility efficiently. On the other hand, the tools must not exceed the agent's responsibility boundary. If tools are insufficient, the agent can only guess, explain, or rely on unstable model judgment. If tools exceed the boundary, the agent may perform actions it should not perform and cannot be responsible for.

The second responsibility is to distill recurring knowledge and processes into tools. The tool does not always need to be written from scratch. It may be an existing tool that humans identify through experience and then integrate into the harness. If a check is always needed, it should become a script. If a report is always generated, it can become a skill. If a quality problem appears repeatedly, it may require a custom rule or inspection tool. If a step is repeatedly forgotten by agents, it should enter a handbook, workflow, or automation chain.

The third responsibility is managing tool permissions. Permissions have at least two layers: whether a tool can be used at all, and within what scope it can be used. Reading files, writing files, running commands, accessing external systems, changing status, and triggering deployment all require different permission strategies. Some tools can be allowed by default, some must ask for confirmation, and some can only be used by specific agents or in specific workflow stages. Even for the same action, scope changes the meaning of permission: deleting generated temporary files inside the project is very different from deleting files outside the project; reading documents in an external system is very different from modifying production configuration. If tool permission is not bound to scope, the agent's capability boundary becomes too wide.

Tool permissions must therefore match responsibilities. Humans are not simply giving agents "more capability." They are preparing just enough capability for agents to complete their responsibilities, while keeping that capability inside controlled boundaries.

Back to Harness: Tools Are the Hands and Eyes of Agents

Back to harness: tools are the hands and eyes of agents. They let agents see what the model itself cannot see, and let agents perform actions the model itself cannot do, cannot do reliably, or should not do purely through model reasoning.

Tools provide agents with powerful capabilities. They let agents read information, modify projects, call systems, run checks, and verify results more efficiently. They also turn many things that would otherwise rely on model reasoning into more stable, more accurate, and more repeatable engineering actions.

But that power is also dangerous.

If the wrong tool is used by the wrong agent, at the wrong time, and within the wrong scope, the result can go out of control. An agent that should only read context may modify files it should not modify. An agent that should only verify behavior may cross the boundary and change implementation. A tool that should only run locally may touch real systems or real data.

A good harness does not give agents unlimited tools. It gives agents tools that are sufficient, handy, reliable, and controlled within their responsibility boundaries. Tools should make agents faster and more accurate, while preventing them from acting where they should not act.

Tools are not an accessory to the harness. They are an important part of how the harness controls agent action capability. They determine what agents can see and what agents can touch. They extend agent capability, and they also define agent boundaries.