7 Feedback, Verification, and Trust

Author

Affiliation

Vijay Janapa Reddi

Harvard John A. Paulson School of Engineering and Applied Sciences

Published

June 25, 2026

What this chapter gives you

After this chapter you can:

turn feedback into evidence through fidelity, provenance, and an evidence chain;
set escalation thresholds and commitment levels by reversibility and blast radius;
name what can reject a result, and why rejection authority must be independent;
review a claim with the trust checklist instead of by its tone.

Chapter 6 treated methods as roles inside a design loop. This chapter asks when the outputs of that loop should be believed. The answer is deliberately conservative: an Architecture 2.0 result is credible only when the feedback supporting it has been turned into evidence, the evidence can be audited, an independent authority can reject the result, and the level of human commitment matches the cost of being wrong.

This distinction is central to the lecture. A model can generate a plausible architecture description. A search method can find a strong proxy score. A surrogate can rank candidates. A tool-using agent can call simulators and summarize results. None of those actions creates trust by itself. Trust begins when the loop records what was measured, why it was relevant, how much it cost, what assumptions it used, where the feedback is weak, which alternatives failed, and what can say no.

The lighthouse prompt exposes the issue. A proposed low-power 64-bit RISC-V compute subsystem for XRBench under a 3 W, 3 nm-class low-power mobile envelope might look reasonable under an analytic proxy, promising under simulation, and broken under synthesis or timing. It might meet performance while missing an energy or thermal target. It might pass a benchmark but fail a real deployment scenario. Architecture 2.0 therefore needs an evidence discipline that is as explicit as its representations, environments, and methods.

7.1 Feedback Budget Ledger and Feedback Economics

The first trust problem is economic. Feedback is not free, uniform, or automatically useful. An architecture loop may have thousands of cheap proxy evaluations, hundreds of simulations, tens of synthesis or physical-design runs, a few emulation opportunities, and almost no chances to learn from silicon or deployment mistakes. It may also have scarce human review time, limited tool licenses, long queue delays, confidential workloads, and organizational deadlines. These limits shape which methods are appropriate and how much autonomy is acceptable.

This is why a loop needs a feedback budget ledger. The ledger is not accounting bureaucracy. It is the object that tells the method what kind of learning is possible. A Bayesian optimizer, reinforcement-learning policy, surrogate model, critic, or tool-using agent should behave differently when a feedback source takes milliseconds versus days, when a failed action is reversible versus costly, and when the signal is a rough proxy versus a signoff report. Table 7.1 gives the working form.

Feedback budget. A feedback budget records which evaluations, measurements, tool runs, human reviews, and deployment signals are available, what they cost, how long they take, how reversible they are, and what level of decision they can support.

Table 7.1: Feedback budgets make learning economics explicit: The ledger records what feedback is available, what it costs, what evidence it can support, and when scarce human attention or irreversible action should limit automation.

Budget item	What to record	Why it matters
Latency and cost	Runtime, queue time, dollar cost, tool hours, license limits, and human review time.	Determines whether the loop should search broadly, sample carefully, or mostly critique.
Signal quality	Fidelity level, metric definition, noise, determinism, coverage, and uncertainty.	Separates raw feedback from decision-grade evidence.
Sample budget	Number of possible runs at each fidelity, including failed runs and invalid candidates.	Forces sample-efficient methods and preserves negative traces.
Reversibility	Whether the action can be undone cheaply, re-run, patched, or rolled back.	Connects autonomy to risk. Reversible actions can tolerate weaker evidence than irreversible commitments.
Scarce attention	Expert review, debugging effort, validation bandwidth, security review, and integration time.	Prevents the loop from outsourcing cost to people whose time is the real bottleneck.

The ledger also changes what a result means. A method that finds a good point after 10,000 cheap proxy evaluations has learned something different from a method that selects three candidates for expensive synthesis. A loop that records failures, timeouts, warnings, rejected candidates, and review notes has more evidence than a loop that records only the winning score. This is the connection to sample efficiency from Chapter 6: sample efficiency is not only about using fewer evaluations. It is about making each evaluation carry more architectural information.

One way to make that discipline concrete is to write the feedback budget and sample value explicitly:

\[ B = \sum_i n_i c_i, \qquad V_i \approx \frac{\Delta \mathrm{Conf}(d \mid e_i)}{c_i}. \]

Here, \(n_i\) is the number of evaluations of feedback type \(i\), \(c_i\) is the cost of one such evaluation, \(B\) is the total feedback budget, \(e_i\) is the evidence produced by that evaluation, and \(\Delta \mathrm{Conf}(d \mid e_i)\) is the change in confidence for the decision \(d\). The equation is not a universal currency for evidence; it is a question the loop designer must answer before each stage: will another simulation, synthesis run, expert review, or deployment experiment change a decision enough to justify its cost? Chapter 8 makes the question concrete, spending the proxy budget freely, the cycle-level budget carefully, and a high-fidelity power check only on the surviving candidate.

7.2 Fidelity Ladders and Evidence Chains

Feedback becomes evidence only when it is tied to fidelity, provenance, uncertainty, and a decision. A simulator result is feedback. It becomes evidence when the workload, simulator version, configuration, random seed, assumptions, metric definition, failure status, and acceptance criterion are recorded. A synthesis report is feedback. It becomes evidence when the technology assumptions, constraints, tool versions, warnings, and comparison baseline are explicit.

Feedback regimes and evidence chain. Feedback regimes organize sources from cheap proxies to high-commitment checks. An evidence chain records how a claim moves across those regimes, preserving provenance, uncertainty, negative traces, rejection gates, and human decisions.

This idea is not new to engineering. Safety-critical fields formalize it as an assurance case: a structured argument that links a top-level claim to sub-claims, assumptions, and supporting evidence, often written in Goal Structuring Notation so that each inference and the evidence under it are explicit and reviewable (Kelly and Weaver 2004). An Architecture 2.0 evidence chain is an assurance case for a design decision: it states what is claimed, at what fidelity, on what evidence, and what could reject it. Naming the lineage helps, because that field already catalogs how such arguments fail: unstated assumptions, evidence that does not support the claimed scope, and confidence that outruns the proof.

Kelly, Tim, and Rob Weaver. 2004. “The Goal Structuring Notation: A Safety Argument Notation.” Workshop on Assurance Cases, International Conference on Dependable Systems and Networks (DSN).

Figure 7.1 shows the working model. A claim should move through a chain of increasingly costly feedback sources, with a rejection gate and evidence record at each stage.

Figure 7.1: **Evidence chains turn feedback into trust:** A claim becomes more credible only as it moves through staged feedback sources, records provenance and uncertainty, and gives each stage authority to reject, revise, or escalate the result before human commitment increases.

Verification is used broadly in this chapter. It includes formal methods when formal properties are available, but it also includes type checks, interface checks, regression tests, baseline replay, simulator cross-checks, synthesis constraints, physical-design warnings, security review, workload coverage, and expert design review. The common requirement is independence: the check should be able to reject the claim, not merely restate the method’s output.

The regime view is not a simple ranking from false to true. Higher fidelity is not automatically truth if the wrong workload, objective, constraints, or baseline were used. A detailed physical-design result can still answer the wrong architecture question. A deployment signal can still be confounded by a software change. A benchmark can still be too narrow. The purpose of the feedback-regime view is therefore not to worship expensive tools. It is to make the path from weak feedback to stronger evidence explicit.

For the lighthouse prompt, low-fidelity feedback may be useful for eliminating obviously infeasible vector widths, memory sizes, or accelerator partitions. Simulation may then test workload behavior and data movement. Synthesis or emulation may expose timing, area, or power problems. Deployment-like traces or silicon evidence may reveal workload drift or integration effects. At each stage, the loop should ask whether the earlier conclusion survived, changed, or should be rejected.

This framing is consistent with the quantitative tradition in computer architecture, where measurement, abstraction, and careful comparison are central (Hennessy and Patterson 2017). Architecture 2.0 adds a loop-level requirement: the evidence chain itself must be represented so that a compound method system and a human reviewer can inspect it.

Hennessy, John L., and David A. Patterson. 2017. Computer Architecture: A Quantitative Approach. 6th ed. Morgan Kaufmann.

7.3 Commitment Levels and Reversibility

Trust requirements should rise with commitment. A loop that generates a candidate simulator configuration can tolerate more automation than a loop that changes RTL, partitions a chiplet boundary, selects a package interface, or recommends a deployment policy. The difference is not whether AI is involved. The difference is rollback cost, blast radius, and who bears the consequence when the loop is wrong.

Table 7.2 gives a commitment-level view. The exact ordering will vary across organizations, but the pattern is stable: reversible exploration can use lighter evidence, while irreversible or high-blast-radius decisions require stronger evidence, independent rejection, and human ownership.

Escalation threshold. An escalation threshold is the stated condition under which a loop must stop relying on its current feedback source and move to stronger evidence, independent review, or explicit human approval.

The architect owns these thresholds because they depend on consequences, not only on model confidence. A proxy win may be enough to keep exploring. It is not enough to change a subsystem interface, waive a verification concern, or commit to a power claim. The loop should therefore state in advance which events force escalation: uncertainty outside the calibrated range, a benchmark coverage gap, a failed tool check, a security boundary, a high rollback cost, or a decision that would affect another team or product.

Table 7.2: Commitment level should govern autonomy: Reversible exploration can tolerate lighter evidence, while interface changes, product policies, signoff decisions, and deployments require stronger evidence, independent rejection, and explicit human ownership.

Commitment	Example actions	Required discipline	Automation stance
Exploratory	Generate hypotheses, configs, candidate questions, or design cards.	Basic validity checks and provenance.	Broad assistance acceptable.
Experimental	Run simulator sweeps, tune compiler flags, select candidates for deeper study.	Workload versions, seeds, baseline, rejected candidates, and uncertainty.	Automation with review.
Implementation	Change RTL, generators, tool constraints, tests, or runtime interfaces.	Tool checks, regression tests, synthesis or integration evidence.	Bounded automation plus rejection gates.
Integration	Change subsystem interfaces, chiplet boundaries, memory contracts, or product-facing policies.	Cross-tool checks, compatibility, security, and explicit escalation.	Advisory or human-approved.
Irreversible	Mask-level choices, committed signoff decisions, fleet-wide rollouts, or customer-visible deployments.	Independent evidence chain, rejection authority, audit trail, and accountable human decision.	Human commitment dominates.

This commitment structure keeps Architecture 2.0 from making a naive autonomy argument. Autonomy is not a virtue by itself. A loop may be highly automated for low-commitment exploration and deliberately conservative for high-commitment decisions. In fact, some of the most valuable near-term systems may not be systems that make final design choices. They may be systems that narrow a space, identify contradictions, preserve evidence, and prepare the human architect to make a better decision.

7.4 Rejection Authority

A credible loop needs something with authority to say no. The rejecting authority might be a type checker, parser, simulator, formal tool, regression test, synthesis flow, cross-tool comparison, signoff rule, deployment signal, security policy, benchmark governance rule, or expert reviewer. What matters is that rejection is part of the loop interface, not an afterthought.

Rejection authority. A credible Architecture 2.0 loop should name what can say no: at least one independent authority that can reject a candidate before the loop treats it as an architectural commitment.

Rejection authority has three parts. First, the loop must know which check is being applied. Second, the loop must know what happens after rejection. Third, the loop must record the rejection as evidence. A simulator crash, failed build, invalid constraint, timing miss, benchmark violation, or expert objection is not merely an inconvenience. It is information about the boundary of the design space.

A compact way to write the commitment rule is \[ \mathrm{Commit}_k(x) = \begin{cases} 1, & \mathrm{valid}(x) \land E_k(x) \ge \tau_k \land \rho_k(x) \le \rho_{\max,k}, \\ 0, & \text{otherwise}. \end{cases} \] Here, \(x\) is the candidate, \(k\) is the commitment level, \(E_k(x)\) is the evidence available at that level, \(\tau_k\) is the evidence threshold, \(\rho_k(x)\) is the residual risk, and \(\rho_{\max,k}\) is the risk the architect or organization is willing to tolerate. The thresholds are policy and judgment choices, not magic constants learned by the method. If validity, evidence, or residual risk fails, the loop should reject, revise, or escalate instead of silently turning output into commitment. Chapter 8 runs this rule on the lighthouse prompt: the accelerator fails validity at the 3 W envelope, the vector CPU fails it at the latency deadline, and only the SoC block clears every term at the experimental commitment level.

The response to rejection should be explicit. A candidate may be discarded. A representation may need a new field. An environment may need a validity check. A method may need a smaller action space. A workload may need a better coverage definition. A claim may need to be weakened. A human architect may need to escalate the decision. Without this response path, rejection becomes a log message rather than a learning signal.

Rejection authority also protects against polished but unsupported outputs. Tool-using agents can generate convincing summaries, plots, design reports, and review notes. Those artifacts are useful only if the loop can still reject them. In architecture, a beautiful explanation cannot overrule a failed timing check, an invalid workload, a missing baseline, or a security boundary.

The independence requirement grows sharper as verification itself becomes AI-assisted. Production verification platforms now use machine learning to triage failures, rank likely root causes, and direct coverage (Cadence Design Systems 2022). Such tools are valuable, but they move part of the rejection authority into a learned system, which forces a recursive version of this chapter’s question: is the authority that can reject a design independent of the system that produced it, or has the loop quietly made the generator and its judge the same model? A rejection authority that shares the generator’s blind spots is not an independent gate. It is a second opinion from the same witness.

Cadence Design Systems. 2022. Cadence Revolutionizes Verification Productivity with the Verisium AI-Driven Verification Platform. Cadence press release. https://www.businesswire.com/news/home/20220913005378/en/Cadence-Revolutionizes-Verification-Productivity-with-the-Verisium-AI-Driven-Verification-Platform.

7.5 Proxy Mismatch, Metric Gaming, and Calibration

The central failure mode is proxy mismatch. A loop optimizes the measurement it can see, while the architect cares about a broader objective. IPC may improve while energy, area, or tail latency worsens. A simulator metric may improve while synthesis exposes timing or power problems. A benchmark result may improve while the real workload distribution changes. A Pareto frontier may look convincing because all points were evaluated under the same flawed proxy. An agent may appear capable because it overfits the evaluation loop, not because it understands the architecture problem.

This is not a new problem created by AI. Benchmarks, simulators, cost models, and design rules have always been approximations. What changes in Architecture 2.0 is the speed and persistence with which a method can exploit the approximation. A human may notice that a score is improving for the wrong reason. A search method may happily continue. A compound agent may even produce a persuasive narrative for a proxy win unless the loop asks for calibration and counterevidence.

Field note: the win that vanished at signoff

A configuration leads the design-space study for weeks on a cycle-level model: better IPC, lower modeled energy, a clean Pareto point. At synthesis the lead evaporates, because the model never charged for the timing and congestion the winning structure creates. The team did not pick a bad candidate; it trusted a proxy past the point the proxy could support. The cheap fix is a rule written before the search starts: a cycle-level win is a reason to escalate, never a reason to commit.

Calibration is therefore a trust requirement. Prediction models used in architecture have long needed validation against held-out data, higher-fidelity measurements, or carefully designed experiments (Lee and Brooks 2006; Ipek et al. 2006). The same principle applies to agentic loops. If a loop claims a candidate satisfies the 3 W lighthouse envelope, the evidence must show how the power estimate was calibrated, what workload region it covers, what uncertainty remains, and what higher-fidelity result could overturn the claim.

Lee, Benjamin C., and David M. Brooks. 2006. “Accurate and Efficient Regression Modeling for Microarchitectural Performance and Power Prediction.” Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XII, 185–94.

Ipek, Engin, Sally A. McKee, Bronis R. de Supinski, Martin Schulz, and Rich Caruana. 2006. “Efficiently Exploring Architectural Design Spaces via Predictive Modeling.” Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XII, 195–206. https://doi.org/10.1145/1168857.1168882.

Mattson, Peter, Hanlin Tang, Gu-Yeon Wei, et al. 2020. “MLPerf: An Industry Standard Benchmark Suite for Machine Learning Performance.” IEEE Micro 40 (2): 8–16. https://doi.org/10.1109/MM.2020.2974843.

Benchmark governance also matters. MLPerf is useful not just because it names benchmarks, but because it defines rules, versions, and submission practices that make results more interpretable across a changing field (Mattson et al. 2020). Architecture 2.0 needs a similar instinct for design loops: define the evaluation contract, preserve provenance, track versions, and treat benchmark changes as part of the evidence story.

7.6 Security, IP, and Confidentiality Boundaries

Industry readers will not trust an Architecture 2.0 workflow that ignores security, intellectual property, and confidentiality. Architecture state is often sensitive: RTL, design specifications, process assumptions, timing constraints, floorplans, tool logs, customer workloads, proprietary traces, compiler settings, design reviews, and deployment telemetry can all reveal valuable or restricted information.

This has a direct technical consequence. Security boundaries are part of the environment and evidence design. A loop must define what data can leave an organization, what must remain local, what can be summarized, which artifacts can be shared publicly, which logs should be redacted, and which agents or tools can access each class of information. The trust question is not only whether the method is accurate. It is whether the workflow preserves the constraints under which architecture work actually happens.

For community infrastructure, this means the field should distinguish between public artifacts and private state. Public benchmarks, datasets, papers, and gym environments can help bootstrap shared progress. Private workloads, proprietary RTL, product traces, and process-specific assumptions often cannot be released. Architecture 2.0 should support both. The design-loop card, environment contract, and evidence checklist should let a project describe what kind of evidence exists without forcing disclosure of sensitive material.

The practical rule is simple: do not make confidentiality invisible. If a claim depends on private data, say what class of data supports it, what auditing is possible, what cannot be disclosed, and what public proxy would be insufficient. That is more honest than pretending every architecture loop can be reproduced from public web data.

7.7 Evidence Disputes and the Trust Checklist

Evidence disputes are inevitable. One group may claim that a learning-based method improves a design flow. Another may argue that the baseline, workload, constraint set, tool version, or evaluation protocol was incomplete. A company may have private evidence that cannot be released. A paper may report a strong result but omit negative traces. A benchmark may reward behavior that matters less in deployment. These disputes should not be treated as distractions from Architecture 2.0. They are part of the field learning how to assign trust.

The anatomy of an evidence dispute is stable:

claim; proxy; fidelity level; assumptions; workload coverage; provenance; counterevidence; rejection rule; and final human decision.

Learned chip placement is the field’s most public worked example of this anatomy. A 2021 result reported that a reinforcement-learning method produced floorplans competitive with human experts in far less time (Mirhoseini et al. 2021). Independent groups then disputed the claim on exactly the axes above: the baselines, the released code, and the reproducibility of the protocol (Cheng et al. 2023). The point here is not to adjudicate that dispute, which remains unsettled. The point is that the disagreement was never about whether the model ran; it was about provenance, baselines, and what evidence could reject the result. The constructive response has been to build reproducible, end-to-end benchmarks that score placement by final power, performance, and area rather than an intermediate proxy (Wang et al. 2025). That is the anatomy doing its work: a contested claim becomes tractable once the loop’s evidence, baselines, and rejection rules are made explicit.

Mirhoseini, Azalia, Anna Goldie, Mustafa Yazgan, et al. 2021. “A Graph Placement Methodology for Fast Chip Design.” Nature 594 (7862): 207–12. https://doi.org/10.1038/s41586-021-03544-w.

Cheng, Chung-Kuan, Andrew B. Kahng, et al. 2023. “Assessment of Reinforcement Learning for Macro Placement.” Proceedings of the 2023 International Symposium on Physical Design (ISPD). https://doi.org/10.1145/3569052.3578926.

Wang, Zhihai et al. 2025. “Benchmarking End-to-End Performance of AI-Based Chip Placement Algorithms.” Advances in Neural Information Processing Systems (NeurIPS). https://arxiv.org/abs/2407.15026.

Table 7.3 turns that anatomy into a checklist. It is intended for reading papers, reviewing internal tools, evaluating student projects, and deciding whether an agentic workflow is ready for a more expensive commitment.

Table 7.3: Trust is a checklist, not a tone judgment: A credible Architecture 2.0 claim states its task, feedback, fidelity, provenance, rejection authority, and human commitment before it asks the reader to believe the result.

Question	What a credible answer contains	Warning sign
What is the claim?	A bounded architecture task, objective, workload, and commitment level.	Vague claims of automation or improvement.
What feedback supports it?	Metrics, tool outputs, logs, reviews, and negative traces tied to a feedback budget.	Only the winning score is shown.
What is the fidelity?	Proxy, simulation, synthesis, emulation, signoff, deployment, or silicon level stated explicitly.	Treating all measurements as equivalent.
What is the provenance?	Workload versions, tool versions, configs, seeds, constraints, assumptions, and baselines.	Hidden scripts or unstated defaults.
What can reject it?	Tests, formal checks, simulators, signoff rules, deployment signals, or expert review.	No independent authority can say no.
Who commits?	A named human or process accepts, rejects, escalates, or revises the artifact.	The loop silently turns output into decision.

This checklist gives the book one of its practical tests. A paper, tool, or project does not need to solve the whole lighthouse prompt to be valuable. It does need to say where it sits in the loop, what evidence it produces, what it cannot prove, and what can reject it. That is how Architecture 2.0 can remain ambitious without becoming credulous.

The next chapter puts this discipline to work. It runs one loop end to end on the lighthouse prompt: generating candidates, escalating from proxy to simulation, rejecting on the power envelope, and stopping at an honest commitment level. The chapter after that generalizes the single loop into the patterns that recur across the stack, where the same ontology applies but the task, representation, environment, method role, feedback budget, evidence burden, and commitment level all change.

Architect’s checkpoint

Before believing a result, ask:

What fidelity produced this, and does the evidence match the commitment level?
What independent authority can reject it, and what would force escalation?
Who is accountable for the commitment if the result turns out wrong?