7 Feedback, Verification, and Trust
After this chapter you can:
- turn feedback into evidence through fidelity, provenance, and an evidence chain;
- set escalation thresholds and commitment levels by reversibility and blast radius;
- name what can reject a result, and why rejection authority must be independent;
- review a claim with the trust checklist instead of by its tone.
Chapter 6 treated methods as roles inside a design loop. This chapter asks when the outputs of that loop should be believed. The answer is deliberately conservative: an Architecture 2.0 result is credible only when the feedback supporting it has been turned into evidence, the evidence can be audited, an independent authority can reject the result, and the level of human commitment matches the cost of being wrong.
This distinction is central to the lecture. A model can generate a plausible architecture description. A search method can find a strong proxy score. A surrogate can rank candidates. A tool-using agent can call simulators and summarize results. None of those actions creates trust by itself. Trust begins when the loop records what was measured, why it was relevant, how much it cost, what assumptions it used, where the feedback is weak, which alternatives failed, and what can say no.
The lighthouse prompt exposes the issue. A proposed low-power 64-bit RISC-V compute subsystem for XRBench under a 3 W, 3 nm-class low-power mobile envelope might look reasonable under an analytic proxy, promising under simulation, and broken under synthesis or timing. It might meet performance while missing an energy or thermal target. It might pass a benchmark but fail a real deployment scenario. Architecture 2.0 therefore needs an evidence discipline that is as explicit as its representations, environments, and methods.
7.1 Feedback Budget Ledger and Feedback Economics
The first trust problem is economic. Feedback is not free, uniform, or automatically useful. An architecture loop may have thousands of cheap proxy evaluations, hundreds of simulations, tens of synthesis or physical-design runs, a few emulation opportunities, and almost no chances to learn from silicon or deployment mistakes. It may also have scarce human review time, limited tool licenses, long queue delays, confidential workloads, and organizational deadlines. These limits shape which methods are appropriate and how much autonomy is acceptable.
This is why a loop needs a feedback budget ledger. The ledger is not accounting bureaucracy. It is the object that tells the method what kind of learning is possible. A Bayesian optimizer, reinforcement-learning policy, surrogate model, critic, or tool-using agent should behave differently when a feedback source takes milliseconds versus days, when a failed action is reversible versus costly, and when the signal is a rough proxy versus a signoff report. Table 7.1 gives the working form.
Feedback budget. A feedback budget records which evaluations, measurements, tool runs, human reviews, and deployment signals are available, what they cost, how long they take, how reversible they are, and what level of decision they can support.
| Budget item | What to record | Why it matters |
|---|---|---|
| Latency and cost | Runtime, queue time, dollar cost, tool hours, license limits, and human review time. | Determines whether the loop should search broadly, sample carefully, or mostly critique. |
| Signal quality | Fidelity level, metric definition, noise, determinism, coverage, and uncertainty. | Separates raw feedback from decision-grade evidence. |
| Sample budget | Number of possible runs at each fidelity, including failed runs and invalid candidates. | Forces sample-efficient methods and preserves negative traces. |
| Reversibility | Whether the action can be undone cheaply, re-run, patched, or rolled back. | Connects autonomy to risk. Reversible actions can tolerate weaker evidence than irreversible commitments. |
| Scarce attention | Expert review, debugging effort, validation bandwidth, security review, and integration time. | Prevents the loop from outsourcing cost to people whose time is the real bottleneck. |
The ledger also changes what a result means. A method that finds a good point after 10,000 cheap proxy evaluations has learned something different from a method that selects three candidates for expensive synthesis. A loop that records failures, timeouts, warnings, rejected candidates, and review notes has more evidence than a loop that records only the winning score. This is the connection to sample efficiency from Chapter 6: sample efficiency is not only about using fewer evaluations. It is about making each evaluation carry more architectural information.
One way to make that discipline concrete is to write the feedback budget and sample value explicitly:
\[ B = \sum_i n_i c_i, \qquad V_i \approx \frac{\Delta \mathrm{Conf}(d \mid e_i)}{c_i}. \]
Here, \(n_i\) is the number of evaluations of feedback type \(i\), \(c_i\) is the cost of one such evaluation, \(B\) is the total feedback budget, \(e_i\) is the evidence produced by that evaluation, and \(\Delta \mathrm{Conf}(d \mid e_i)\) is the change in confidence for the decision \(d\). The equation is not a universal currency for evidence; it is a question the loop designer must answer before each stage: will another simulation, synthesis run, expert review, or deployment experiment change a decision enough to justify its cost? Chapter 8 makes the question concrete, spending the proxy budget freely, the cycle-level budget carefully, and a high-fidelity power check only on the surviving candidate.
7.2 Fidelity Ladders and Evidence Chains
Feedback becomes evidence only when it is tied to fidelity, provenance, uncertainty, and a decision. A simulator result is feedback. It becomes evidence when the workload, simulator version, configuration, random seed, assumptions, metric definition, failure status, and acceptance criterion are recorded. A synthesis report is feedback. It becomes evidence when the technology assumptions, constraints, tool versions, warnings, and comparison baseline are explicit.
Feedback regimes and evidence chain. Feedback regimes organize sources from cheap proxies to high-commitment checks. An evidence chain records how a claim moves across those regimes, preserving provenance, uncertainty, negative traces, rejection gates, and human decisions.
This idea is not new to engineering. Safety-critical fields formalize it as an assurance case: a structured argument that links a top-level claim to sub-claims, assumptions, and supporting evidence, often written in Goal Structuring Notation so that each inference and the evidence under it are explicit and reviewable (Kelly and Weaver 2004). An Architecture 2.0 evidence chain is an assurance case for a design decision: it states what is claimed, at what fidelity, on what evidence, and what could reject it. Naming the lineage helps, because that field already catalogs how such arguments fail: unstated assumptions, evidence that does not support the claimed scope, and confidence that outruns the proof.
Figure 7.1 shows the working model. A claim should move through a chain of increasingly costly feedback sources, with a rejection gate and evidence record at each stage.
Verification is used broadly in this chapter. It includes formal methods when formal properties are available, but it also includes type checks, interface checks, regression tests, baseline replay, simulator cross-checks, synthesis constraints, physical-design warnings, security review, workload coverage, and expert design review. The common requirement is independence: the check should be able to reject the claim, not merely restate the method’s output.
The regime view is not a simple ranking from false to true. Higher fidelity is not automatically truth if the wrong workload, objective, constraints, or baseline were used. A detailed physical-design result can still answer the wrong architecture question. A deployment signal can still be confounded by a software change. A benchmark can still be too narrow. The purpose of the feedback-regime view is therefore not to worship expensive tools. It is to make the path from weak feedback to stronger evidence explicit.
For the lighthouse prompt, low-fidelity feedback may be useful for eliminating obviously infeasible vector widths, memory sizes, or accelerator partitions. Simulation may then test workload behavior and data movement. Synthesis or emulation may expose timing, area, or power problems. Deployment-like traces or silicon evidence may reveal workload drift or integration effects. At each stage, the loop should ask whether the earlier conclusion survived, changed, or should be rejected.
This framing is consistent with the quantitative tradition in computer architecture, where measurement, abstraction, and careful comparison are central (Hennessy and Patterson 2017). Architecture 2.0 adds a loop-level requirement: the evidence chain itself must be represented so that a compound method system and a human reviewer can inspect it.
7.3 Commitment Levels and Reversibility
Trust requirements should rise with commitment. A loop that generates a candidate simulator configuration can tolerate more automation than a loop that changes RTL, partitions a chiplet boundary, selects a package interface, or recommends a deployment policy. The difference is not whether AI is involved. The difference is rollback cost, blast radius, and who bears the consequence when the loop is wrong.
Table 7.2 gives a commitment-level view. The exact ordering will vary across organizations, but the pattern is stable: reversible exploration can use lighter evidence, while irreversible or high-blast-radius decisions require stronger evidence, independent rejection, and human ownership.
Escalation threshold. An escalation threshold is the stated condition under which a loop must stop relying on its current feedback source and move to stronger evidence, independent review, or explicit human approval.
The architect owns these thresholds because they depend on consequences, not only on model confidence. A proxy win may be enough to keep exploring. It is not enough to change a subsystem interface, waive a verification concern, or commit to a power claim. The loop should therefore state in advance which events force escalation: uncertainty outside the calibrated range, a benchmark coverage gap, a failed tool check, a security boundary, a high rollback cost, or a decision that would affect another team or product.
| Commitment | Example actions | Required discipline | Automation stance |
|---|---|---|---|
| Exploratory | Generate hypotheses, configs, candidate questions, or design cards. | Basic validity checks and provenance. | Broad assistance acceptable. |
| Experimental | Run simulator sweeps, tune compiler flags, select candidates for deeper study. | Workload versions, seeds, baseline, rejected candidates, and uncertainty. | Automation with review. |
| Implementation | Change RTL, generators, tool constraints, tests, or runtime interfaces. | Tool checks, regression tests, synthesis or integration evidence. | Bounded automation plus rejection gates. |
| Integration | Change subsystem interfaces, chiplet boundaries, memory contracts, or product-facing policies. | Cross-tool checks, compatibility, security, and explicit escalation. | Advisory or human-approved. |
| Irreversible | Mask-level choices, committed signoff decisions, fleet-wide rollouts, or customer-visible deployments. | Independent evidence chain, rejection authority, audit trail, and accountable human decision. | Human commitment dominates. |
This commitment structure keeps Architecture 2.0 from making a naive autonomy argument. Autonomy is not a virtue by itself. A loop may be highly automated for low-commitment exploration and deliberately conservative for high-commitment decisions. In fact, some of the most valuable near-term systems may not be systems that make final design choices. They may be systems that narrow a space, identify contradictions, preserve evidence, and prepare the human architect to make a better decision.
7.5 Proxy Mismatch, Metric Gaming, and Calibration
The central failure mode is proxy mismatch. A loop optimizes the measurement it can see, while the architect cares about a broader objective. IPC may improve while energy, area, or tail latency worsens. A simulator metric may improve while synthesis exposes timing or power problems. A benchmark result may improve while the real workload distribution changes. A Pareto frontier may look convincing because all points were evaluated under the same flawed proxy. An agent may appear capable because it overfits the evaluation loop, not because it understands the architecture problem.
This is not a new problem created by AI. Benchmarks, simulators, cost models, and design rules have always been approximations. What changes in Architecture 2.0 is the speed and persistence with which a method can exploit the approximation. A human may notice that a score is improving for the wrong reason. A search method may happily continue. A compound agent may even produce a persuasive narrative for a proxy win unless the loop asks for calibration and counterevidence.
A configuration leads the design-space study for weeks on a cycle-level model: better IPC, lower modeled energy, a clean Pareto point. At synthesis the lead evaporates, because the model never charged for the timing and congestion the winning structure creates. The team did not pick a bad candidate; it trusted a proxy past the point the proxy could support. The cheap fix is a rule written before the search starts: a cycle-level win is a reason to escalate, never a reason to commit.
Calibration is therefore a trust requirement. Prediction models used in architecture have long needed validation against held-out data, higher-fidelity measurements, or carefully designed experiments (Lee and Brooks 2006; Ipek et al. 2006). The same principle applies to agentic loops. If a loop claims a candidate satisfies the 3 W lighthouse envelope, the evidence must show how the power estimate was calibrated, what workload region it covers, what uncertainty remains, and what higher-fidelity result could overturn the claim.
Benchmark governance also matters. MLPerf is useful not just because it names benchmarks, but because it defines rules, versions, and submission practices that make results more interpretable across a changing field (Mattson et al. 2020). Architecture 2.0 needs a similar instinct for design loops: define the evaluation contract, preserve provenance, track versions, and treat benchmark changes as part of the evidence story.
7.6 Security, IP, and Confidentiality Boundaries
Industry readers will not trust an Architecture 2.0 workflow that ignores security, intellectual property, and confidentiality. Architecture state is often sensitive: RTL, design specifications, process assumptions, timing constraints, floorplans, tool logs, customer workloads, proprietary traces, compiler settings, design reviews, and deployment telemetry can all reveal valuable or restricted information.
This has a direct technical consequence. Security boundaries are part of the environment and evidence design. A loop must define what data can leave an organization, what must remain local, what can be summarized, which artifacts can be shared publicly, which logs should be redacted, and which agents or tools can access each class of information. The trust question is not only whether the method is accurate. It is whether the workflow preserves the constraints under which architecture work actually happens.
For community infrastructure, this means the field should distinguish between public artifacts and private state. Public benchmarks, datasets, papers, and gym environments can help bootstrap shared progress. Private workloads, proprietary RTL, product traces, and process-specific assumptions often cannot be released. Architecture 2.0 should support both. The design-loop card, environment contract, and evidence checklist should let a project describe what kind of evidence exists without forcing disclosure of sensitive material.
The practical rule is simple: do not make confidentiality invisible. If a claim depends on private data, say what class of data supports it, what auditing is possible, what cannot be disclosed, and what public proxy would be insufficient. That is more honest than pretending every architecture loop can be reproduced from public web data.
7.7 Evidence Disputes and the Trust Checklist
Evidence disputes are inevitable. One group may claim that a learning-based method improves a design flow. Another may argue that the baseline, workload, constraint set, tool version, or evaluation protocol was incomplete. A company may have private evidence that cannot be released. A paper may report a strong result but omit negative traces. A benchmark may reward behavior that matters less in deployment. These disputes should not be treated as distractions from Architecture 2.0. They are part of the field learning how to assign trust.
The anatomy of an evidence dispute is stable:
claim; proxy; fidelity level; assumptions; workload coverage; provenance; counterevidence; rejection rule; and final human decision.
Learned chip placement is the field’s most public worked example of this anatomy. A 2021 result reported that a reinforcement-learning method produced floorplans competitive with human experts in far less time (Mirhoseini et al. 2021). Independent groups then disputed the claim on exactly the axes above: the baselines, the released code, and the reproducibility of the protocol (Cheng et al. 2023). The point here is not to adjudicate that dispute, which remains unsettled. The point is that the disagreement was never about whether the model ran; it was about provenance, baselines, and what evidence could reject the result. The constructive response has been to build reproducible, end-to-end benchmarks that score placement by final power, performance, and area rather than an intermediate proxy (Wang et al. 2025). That is the anatomy doing its work: a contested claim becomes tractable once the loop’s evidence, baselines, and rejection rules are made explicit.
Table 7.3 turns that anatomy into a checklist. It is intended for reading papers, reviewing internal tools, evaluating student projects, and deciding whether an agentic workflow is ready for a more expensive commitment.
| Question | What a credible answer contains | Warning sign |
|---|---|---|
| What is the claim? | A bounded architecture task, objective, workload, and commitment level. | Vague claims of automation or improvement. |
| What feedback supports it? | Metrics, tool outputs, logs, reviews, and negative traces tied to a feedback budget. | Only the winning score is shown. |
| What is the fidelity? | Proxy, simulation, synthesis, emulation, signoff, deployment, or silicon level stated explicitly. | Treating all measurements as equivalent. |
| What is the provenance? | Workload versions, tool versions, configs, seeds, constraints, assumptions, and baselines. | Hidden scripts or unstated defaults. |
| What can reject it? | Tests, formal checks, simulators, signoff rules, deployment signals, or expert review. | No independent authority can say no. |
| Who commits? | A named human or process accepts, rejects, escalates, or revises the artifact. | The loop silently turns output into decision. |
This checklist gives the book one of its practical tests. A paper, tool, or project does not need to solve the whole lighthouse prompt to be valuable. It does need to say where it sits in the loop, what evidence it produces, what it cannot prove, and what can reject it. That is how Architecture 2.0 can remain ambitious without becoming credulous.
The next chapter puts this discipline to work. It runs one loop end to end on the lighthouse prompt: generating candidates, escalating from proxy to simulation, rejecting on the power envelope, and stopping at an honest commitment level. The chapter after that generalizes the single loop into the patterns that recur across the stack, where the same ontology applies but the task, representation, environment, method role, feedback budget, evidence burden, and commitment level all change.
Before believing a result, ask:
- What fidelity produced this, and does the evidence match the commitment level?
- What independent authority can reject it, and what would force escalation?
- Who is accountable for the commitment if the result turns out wrong?