Adversarial Development: Making Code Withstand Review and Verification

3 阅读10分钟

Adversarial Development: Making Code Withstand Review and Verification

This is the fifth article in the series "Practices and Reflections on Harness Engineering in Enterprise Applications."

Why Writing Code Is Not Enough

After the Spec passes review and the analyzing -> implementing human gate opens, the agent can finally start implementation.

At this point, the problem has been defined, the boundaries have been confirmed, and the acceptance criteria have been written into the Spec. It may seem that all that remains is for the agent to write code according to the Spec.

Reality is usually not that simple.

After one round of implementation, the code may look fairly complete: the feature seems to run, the page opens, the API returns. But the real problems often hide in the details. The happy path may work while exception paths are unhandled. Permission branches may be missing. UI states may be incomplete. Data boundaries may not be covered. The implementation may bypass architectural constraints.

More commonly, there are no new tests at all. Or existing tests are already failing, and the agent has not really addressed them. The code may contain duplicate implementation, careless abstraction, code smell, or structural damage introduced just to make the current path work. As for the development report, it may be nothing more than "done," or there may be no record that future people or agents can understand and trace.

That may be enough for a demo, but enterprise applications cannot accept it. Enterprise applications face real users, real data, real failures, real permissions, real collaboration, and long-term maintenance. Code that merely appears to run, but cannot be reviewed, verified, traced, or explained, is not completion. It is risk.

This is not a new problem created by AI. Software engineering has long understood that writing code only produces change. It does not automatically make the change correct. Code must enter a feedback loop, accept review and testing, and expose problems through the shortest possible path.

AI makes code appear faster, but it also makes errors accumulate faster. In Harness Engineering, the implementation phase therefore cannot focus only on "letting the agent write code." It must also make the code withstand adversarial review and verification.

Why Agents Need Adversarial Roles

Adversarial practice is not an invention of the AI era, nor is it meant to create conflict. It comes from feedback mechanisms that already exist in software engineering: implementation needs review, behavior needs verification, and problems need to surface early.

If review and testing used to be part of the software engineering feedback loop performed by humans, then once the actor writing code becomes an agent, these adversarial checks naturally move to agents as well.

The most intuitive approach is to let one agent do everything: write code, inspect code, run tests, fix issues, and summarize results. This looks simple and fits the fantasy of "let AI complete the task by itself."

In practice, it quickly becomes unreliable.

The first problem is that an agent can easily convince itself. The agent that generated the code tends to explain its own implementation through the same assumptions that produced it. It can and should perform self-checks, but self-checks cannot replace an independent perspective. The implementer should not be the only one deciding whether its result is correct.

The second problem is that the Context Window quickly fills up. If implementation details, review findings, test output, repair history, and report material are all pushed into the same Context, the agent gradually loses focus on its current responsibility. It has to remember how the code was written, inspect architectural boundaries, verify user behavior, and track test results all at once. Eventually it may have seen everything, while focusing on nothing.

The third problem is that discovery and repair become mixed together. If an agent finds a problem during implementation, it often fixes it immediately. That may look efficient, but the problem, judgment, and repair process get flattened into one code change. Later people and agents struggle to know what was discovered, why it was a problem, what the repair was based on, whether it related to the Spec, and whether reverse sync was needed.

Finding and fixing problems are themselves valuable project experience. If they are not recorded, the traceability of the code is harmed.

So adversarial development must split not only perspectives, but also discovery, recording, and repair. In this practice, adversarial work is divided into roles: Dev Agent implements and performs basic self-checks; Code Reviewer checks whether the code conforms to the Spec and engineering constraints; QA Agent verifies whether behavior conforms to the Spec.

This split is not about imitating the organization of a human team. It preserves one of the most important quality mechanisms in software engineering: implementation cannot prove itself correct, and behavior must be verified independently. Different agents read the same Spec, but work with different responsibilities and different Contexts. That avoids self-persuasion and prevents a single Context from becoming overloaded.

Adversarial work is not mutual negation. It is mutual calibration. Its goal is not to create conflict, but to let code approach correctness through repeated checks under different standards.

The Adversarial Development Flow

The adversarial development flow does not invent new software engineering. It inherits ideas, processes, and tools that have already been validated: define how to verify, implement, review, test, and feed discovered problems back into implementation.

QA Agent: Bring Verifiability Into the Process First

Before implementation begins, QA Agent should enter the flow.

Test-first here does not mean narrow baby-step, red-green-refactor TDD. That practice is largely designed to reduce the cognitive load of human developers. For agents, overly fine-grained red-green cycles may increase process overhead instead. What is worth inheriting is the idea behind test-first: before implementation begins, bring "how to prove completion" into the process.

QA Agent is better suited than Dev Agent to do this first. It focuses more on testing principles, test granularity, and the choice of test doubles. It is also better positioned to translate business language into testing language before implementation details interfere: given which preconditions, when which action happens, what result should be observed.

These test designs can use Given-When-Then, and their granularity can be chosen according to the story: unit, component, in-process integration, e2e, or manual verification paths. The goal is not to write all complete tests upfront, but to establish the verification approach and test skeleton.

QA Agent is not a developer. It is more sensitive to testing principles and verification paths, but it is not suited to implement the tests. Its output here is only a test skeleton: what needs to be verified, how it should be verified, where test doubles are needed, and which scenarios must be covered. Dev Agent later turns those skeletons into runnable tests during implementation.

Dev Agent: Implement Against Spec and Complete the Tests

Dev Agent should not write all code in one burst. It should implement requirement by requirement, enhancement by enhancement, following the Spec point by point.

For each completed point, Dev Agent should complete the corresponding tests, implement the necessary test skeletons, run the relevant tests, and record progress. At the end of each point, it should commit the code and form a clear boundary for that code change. After all points are complete, it should run the full test suite to confirm that other functionality has not been broken.

The development process itself must not become a black box. Dev Agent needs to record which requirement or enhancement is being implemented, what has been completed, what remains unfinished, which tests were added, what issues were discovered, and what risks remain. Progress records are not decorative management fields. They are engineering material that makes implementation handoff, tracing, and retrospective analysis possible.

Dev Agent also needs to bind code changes to git commit ids. Otherwise, a progress record only says "what was done," but cannot precisely trace which code change corresponds to it. Commit ids create a traceable chain between requirement, implementation, tests, review findings, and repair records. If a problem needs to be investigated later, or a behavior needs to be rolled back, one can trace from the story to reports and then to the concrete code change.

Without commit ids, development records easily become another kind of chat summary. With commit ids, they connect to engineering facts.

In practice, the implementation of one story often exceeds the Context capacity of a single agent. At that point, multiple Dev Agents may need to work in parallel or take turns. Whether they can work in parallel depends on whether the code has ordering dependencies. If obvious dependencies exist, implementation should proceed sequentially to avoid multiple agents changing code based on assumptions they cannot see in each other.

Look closely, and this is still a traditional development process: split work, implement, add tests, run tests, commit, record progress. The executor has changed to an agent.

Code Reviewer: Check Whether Code Conforms to Spec and Engineering Constraints

After one implementation round, the code should enter Code Review immediately.

What Code Reviewer does is still very traditional: check whether the implementation conforms to the Spec, whether code smell appears, whether implementation is duplicated, whether team conventions are violated, whether architectural boundaries are broken, whether unnecessary complexity has been introduced, whether tests are missing, and whether there are security, performance, or maintainability risks.

Tools should also enter this process. lint, typecheck, SAST, dependency scanning, coverage, and build results can all serve as inputs to Code Reviewer. But tools are still weapons, not the adversarial process itself. The real judgment comes from Code Reviewer combining the Spec, code changes, and engineering constraints.

There is no essential difference from traditional Code Review. The only change is that the feedback distance becomes shorter: write, review, fix. It is slower than pair programming, but much faster than traditional asynchronous Code Review.

QA Agent: Verify Whether Behavior Conforms to Spec

After Code Review, QA Agent still needs to verify whether behavior truly conforms to the Spec.

QA Agent's work is also traditional. It runs automated tests, starts services, uses curl, Playwright, or similar tools to verify APIs and pages, manually clicks through key paths when needed, observes visual effects and interaction states, and records discovered issues.

The goal of QA is not to prove that "tests ran." It is to prove that behavior conforms to the Spec. Permission branches, exception paths, boundary data, state transitions, visual details, and real user paths all need verification. When QA finds a problem, it should not modify code directly. It should record the bug clearly: what the problem is, how to reproduce it, what was expected, what actually happened, and which Spec clause it relates to.

Reports and Repair: Feed Adversarial Results Back Into Code

Reports are not the end of adversarial work. They are the input to the next repair cycle. Code Review report and QA report play a role similar to a "repair-phase spec": they explain what was found, why it is a problem, which Spec clause it relates to, and where the fix should return.

Adversarial reports also have long-term value: they let repeated problems settle into knowledge. If a certain kind of Code Review finding appears repeatedly, the project may need a new coding guideline, checklist, or handbook. If a certain kind of QA failure appears repeatedly, the Spec template, testing strategy, or component standard may need to change. Reports are not disposable material. They are an entry point through which project memory becomes knowledge.

Reports are also the basis for final human review. People should not have to re-review all code from scratch. They should use development reports, review reports, and test reports to understand the implementation path, risks, verification results, and remaining issues.

Look closely, and this flow contains nothing new. It is still test design, development, self-check, Code Review, QA verification, and defect repair. Harness Engineering reorganizes these practices into a structure agents can execute, hand off, and trace, while recording the judgments, problems, repairs, and results produced along the way.

These records have three layers of value. For Dev Agent, they are the input to the next repair cycle. For people, they are the basis for final review and acceptance. For the project, they are memory that can continue to settle. Without records, adversarial work is only a temporary inspection. With records, it becomes an engineering process that can repair, accept, and accumulate knowledge.

Human Gate: Final Confirmation and Problem Routing

The checks between agents can only serve as references. They cannot become the final decision. The in-review -> completed human gate must be confirmed by the person in the Human in the Loop.

This follows directly from the principle that authority and responsibility must match. The person is ultimately responsible for the system, so the final confirmation authority must belong to the person. More importantly, human experience, thinking, and judgment remain capabilities that agents do not possess and cannot replace. Agents can help discover issues, organize evidence, run verification, and answer questions, but they cannot decide for humans whether a result is truly acceptable.

This does not mean agents do nothing before the human gate. Quite the opposite: at this stage, agents should provide as much support and service as possible. They should prepare development reports, review reports, and test reports; start services; prepare test data; open frontend pages, Swagger pages, or other verification entry points; answer human questions; and provide logs, screenshots, command output, or git information when needed. The agent's job is to reduce the human verification cost as much as possible.

The human task is still heavy.

People need to read development reports, review reports, and test reports, inspect git history and commit messages, and understand the implementation path, risks, verification results, and remaining issues. They may read code when necessary, but should still avoid turning acceptance into a code-reading task. What matters most is functional behavior. The traditional PO sign-off should move forward into this stage: people verify the feature end to end on the running service, check functionality, boundaries, and exceptions, and perform any other verification they believe is necessary.

If everything looks correct, the person can confirm the story is done. The agent then performs closure work: update state, organize records, complete necessary reports, stop temporary services, clean up test data and temporary resources, and move the story from in-review to completed.

If a problem is found, the person at the gate is not deciding "can the agent just fix this?" The real question is "which layer should this problem return to?" Did the code deviate from the Spec? Was the Spec itself unclear? Has the scope of the story changed?

At this point, the agent should help the person write a human review report. This report is not ordinary chat feedback. It is the carrier through which human judgment enters the workflow: what problem was found, how to reproduce it, why it is unacceptable, which layer it should return to, and whether the Spec or story scope needs to change.

If it is only an implementation issue, the human review report can become a local Spec for the repair phase. If it is an analysis issue, the story must return to the analysis phase, the Spec must be updated, and confirmation must happen again. If it is a requirement change, it should be handled as a scope change.

In other words, after a problem is found, the agent should not modify code directly based on one prompt, and humans should not manually patch the code either. The first bypasses problem analysis; the second breaks the discipline of Spec-Driven Development and loses the traceable evidence chain. The correct path is to analyze the root cause first, then decide which layer to return to: the repair report, the Spec, or the story scope.

This is one of the heaviest forms of human work under the Harness Engineering paradigm. One heavy point comes before implementation: writing the problem into a reliable Spec. The other comes after implementation: judging whether the result can truly be closed. Agents can serve this process, but they cannot, and cannot possibly, replace the final human judgment.

Back to Harness: Adversarial Practice Is a Continuation of Software Engineering

From the perspective of Harness Engineering, adversarial practice is not a new invention of the AI era.

It inherits ideas, processes, and tools that software engineering has validated for a long time: implementation needs review, behavior needs testing, problems need recording, repairs need tracing, and final results need human confirmation. Agentic development changes the executor and feedback speed, not these engineering principles.

In the past, these tasks were mainly done by people: people wrote code, people performed Code Review, people did QA, people organized reports, and people decided whether something could be delivered. Now many execution tasks can be delegated to agents: Dev Agent implements, Code Reviewer reviews, QA Agent verifies, and agents help organize reports and prepare acceptance environments. The feedback loop can therefore become shorter, and the process records can become more detailed.

But this does not mean code becomes trustworthy simply because it has been written. In enterprise applications, code must withstand multiple layers of adversarial pressure: it must align with the Spec, withstand Code Review, pass QA verification, be traceable through reports, and ultimately be accepted by the person in the Human in the Loop.

The goal of adversarial practice is not to make agents consume each other, nor to make the process heavier. Its goal is to make AI speed continuously meet engineering feedback, so that problems surface through the shortest possible path and are recorded, fixed, and accumulated.

So adversarial practice in agentic development is, at its core, the natural continuation of software engineering on a new executor. AI makes code appear faster; adversarial practice makes code trustworthy.