10  What the Architect Owns

Author
Affiliation

Harvard John A. Paulson School of Engineering and Applied Sciences

Published

June 25, 2026

TipWhat this chapter gives you

After this chapter you can:

  • name the architectural responsibilities that cannot be delegated to a method;
  • answer the strongest objections to Architecture 2.0;
  • identify the community infrastructure the field still needs;
  • use the design-loop card as a teaching and review artifact.

The opening prompt asked for a low-power, 64-bit RISC-V-based compute subsystem for real-time mobile XR under a 3 W target in a 3 nm-class low-power mobile process. At the start of the lecture, that prompt was a provocation. It looked like a request for a future hardware foundation model. By this point, it should read differently. The prompt is not powerful because it is short. It is powerful because it exposes a missing design loop.

That loop includes workload definition, architecture representation, tool interfaces, compound method roles, feedback budgets, evidence chains, rejection gates, and human decisions. It also exposes the central conclusion of Architecture 2.0: AI systems do not eliminate the computer architect. They change what the architect must own.

The architect’s responsibility moves upward. Instead of owning every step of manual artifact construction, the architect owns the framing of the problem, the abstractions that make it tractable, the representations that make it legible, the evidence standards that make it believable, the rejection rules that make it safe to use, and the accountability boundary around the final decision. This is why Architecture 2.0 is not simply an automation story. It is a responsibility story.

10.1 Return to the Moonshot

Return to the lighthouse prompt:

Design a low-power, 64-bit RISC-V-based compute subsystem for an XRBench real-time mobile XR workload. Realize it as a vector-capable CPU, tightly coupled accelerator, or SoC block under a 3 W TDP target in a 3 nm-class low-power mobile process, and return a design-space report with evidence and rejected alternatives.

The prompt now has visible structure. The workload is not just “XR.” It is a workload distribution, a quality-of-experience target, a benchmark version, a set of traces, and a software stack. The contract is not just “64-bit RISC-V.” It includes ISA assumptions, vector behavior, memory ordering, compiler support, operating-system interfaces, and software compatibility. The architecture object is not just a processor. It might be a CPU, accelerator, SoC block, or compute subsystem with specific interfaces and integration constraints. The technology envelope is not just “3 W” or “3 nm-class.” It includes power, thermal behavior, process assumptions, physical design feasibility, verification burden, and deployment risk.

The requested deliverable also matters. A design-space report is different from RTL. RTL is different from a verified implementation. A verified implementation is different from a tapeout-ready design. Each step changes the evidence standard. The same prompt can support a brainstorming loop, a research prototype, a simulator-backed design-space exploration, or a high-commitment implementation workflow. Treating all of those deliverables as the same is one of the fastest ways to overclaim.

The lecture’s framework turns the prompt into a set of explicit questions: What task is being solved? What representation is available? What world model does the loop assume? What tools can the system act on? What method roles are allowed? What feedback is affordable? What evidence would make the result credible? What alternatives were rejected? Who can stop the loop? Who is accountable for the decision?

That is the shift from prompt to loop. The prompt motivates the work. The loop makes the work inspectable.

10.2 Nondelegable Architectural Responsibilities

Nondelegable does not mean unaided. An architect can use models, agents, search procedures, simulators, compilers, profilers, EDA tools, benchmarks, and critics. The point is narrower and more important: the architect cannot transfer responsibility for the architectural judgment itself into a model.

Figure 10.1 separates assistable loop work from the accountability boundary. Agents may help represent state, generate candidates, evaluate proxies, call tools, critique results, summarize evidence, and preserve provenance. Those are substantial contributions. But they do not decide what problem matters, which abstraction is legitimate, what evidence is enough, which failure is acceptable, when to reject a result, or who answers for the consequences.

Human accountability boundary. The human accountability boundary is the line between work a loop may assist and commitments the architect still owns: intent, abstraction, evidence standards, rejection authority, deployment risk, and responsibility for consequences.

Figure 10.1: The architect owns the boundary of the loop: AI systems can assist many operations inside an Architecture 2.0 loop, but intent, abstraction, representation, evidence standards, rejection authority, accountability, and field-building remain architect-owned responsibilities.

Table 10.1 turns this claim into a practical review object. It is not meant to romanticize human judgment. Human judgment can be biased, inconsistent, and incomplete. The reason it remains central is that architecture decisions bind technical artifacts to organizational, economic, ethical, and deployment consequences. A model can help reason about those consequences, but it does not own them.

Table 10.1: Architect-owned responsibilities become explicit loop obligations: The architect must define intent, abstraction, constraints, evidence standards, rejection authority, escalation rules, and final commitment even when methods automate pieces of the loop.
Responsibility Why it cannot be delegated How AI can assist
Intent and constraints The loop must serve a real architectural objective, not merely an available benchmark or proxy. Elicit missing constraints, surface conflicts, and compare formulations.
Abstraction and representation The encoded state determines what the loop can see, optimize, ignore, or falsely simplify. Translate artifacts, organize traces, find gaps, and suggest structured schemas.
Evidence standard A result is useful only if the evidence matches the commitment level and cost of being wrong. Build evidence packets, track provenance, estimate uncertainty, and summarize rejected runs.
Escalation thresholds The moment when proxy evidence is no longer enough depends on risk, reversibility, blast radius, and organizational context. Detect threshold crossings, surface missing evidence, and route decisions to review.
Rejection and commitment Someone must decide when a candidate is invalid, too risky, insufficiently supported, or ready to use. Critique assumptions, flag rule violations, and compare alternatives.
Accountability and boundaries Architecture choices affect users, teams, IP, security, cost, reliability, and long-lived systems. Maintain audit trails, identify policy conflicts, and make tradeoffs explicit.

This boundary also clarifies the word agentic. The book is not arguing that every architecture workflow should become autonomous. Agentic systems are useful because they can act inside represented loops: call tools, maintain state, revise plans, use feedback, and coordinate method roles. But action is not ownership. As loops become more capable, the architect’s responsibility is not reduced; it becomes more explicit.

10.3 The Strongest Objections

A field-defining claim should survive its sharpest critics, so it is worth stating the strongest objections plainly.

The first objection is that Architecture 2.0 is just good experimental methodology with new names. Provenance, baselines, rejection criteria, and held-out validation are not new; careful architects have always done them. The response concedes the lineage and locates the difference. What changes is who acts inside the loop and how fast. When a human runs the loop, tacit judgment supplies the missing rigor: the architect remembers why a workload was excluded or distrusts a proxy on instinct. When a method runs the loop at machine speed, that tacit layer is gone, so the discipline must be made explicit and machine-readable. Architecture 2.0 is the claim that the methodology must become an engineered object, not a craft, once agents participate. The book borrows established forms deliberately: the evidence chain is an assurance case (Kelly and Weaver 2004), a discipline already mandatory in safety-critical domains such as automotive functional safety, where a safety case must argue that a design meets its requirements before it ships (International Organization for Standardization 2018); the fidelity ladder is multi-fidelity modeling (Peherstorfer et al. 2018). Naming the lineage is the point, not a weakness.

Kelly, Tim, and Rob Weaver. 2004. “The Goal Structuring Notation: A Safety Argument Notation.” Workshop on Assurance Cases, International Conference on Dependable Systems and Networks (DSN).
International Organization for Standardization. 2018. ISO 26262: Road Vehicles — Functional Safety. 2nd ed. ISO.
Peherstorfer, Benjamin, Karen Willcox, and Max Gunzburger. 2018. “Survey of Multifidelity Methods in Uncertainty Propagation, Inference, and Optimization.” SIAM Review 60 (3): 550–91. https://doi.org/10.1137/16M1082469.

The second objection is that AI for systems design is overhyped, and the most cited result, learned chip placement, is contested. That is true, and this book treats it as evidence rather than embarrassment. The placement dispute in Chapter 7 is exactly why the field needs reproducible, end-to-end benchmarks and explicit rejection authority. An honest framework predicts that some flagship results will not survive scrutiny; its value is a standard that makes the difference visible.

The third objection is that the design-loop card is process bureaucracy that will not survive a deadline. The card is not a mandatory form; it is a diagnostic. A team under deadline can fill it in a few minutes and learn whether its result rests on a proxy nobody validated or a baseline nobody documented. The cost of skipping that check is paid later, at higher commitment, where it is most expensive. The card earns its place only when it is shorter than the mistake it prevents.

None of these objections is fatal, but each sharpens the claim. Architecture 2.0 is not the assertion that AI will design chips. It is the narrower, more defensible claim that the design loop must become explicit enough to act on, judge, and trust.

10.4 Community Infrastructure for Architecture 2.0

Individual practice is not enough to move a field. A field moves when work can be compared, reproduced, criticized, taught, and extended. Architecture 2.0 therefore needs community infrastructure, not only better private workflows.

The infrastructure does not have to expose proprietary designs or internal company data. It can start with shared conventions: design-loop cards, environment schemas, provenance records, benchmark versions, negative traces, source packets, tool-interface descriptions, and review rubrics. These are small artifacts, but they change what the community can ask of a claim. They let reviewers ask not only “What result did you get?” but also “What loop produced it? What feedback did it use? What did it reject? What evidence supports the decision?”

NoteResearch question

What is the smallest public evidence record that would let two Architecture 2.0 loops be compared without exposing proprietary workloads, RTL, process assumptions, or internal design reviews?

Benchmarks show why this matters. MLPerf is useful not only as a list of machine-learning performance tasks, but as an evolving benchmark ecosystem with rules, versions, submissions, and governance (Mattson et al. 2020). Architecture 2.0 needs a similar instinct for loops. ArchGym shows one way to make architecture design-space exploration more environment-like by defining interfaces between agents and architecture tools (Krishnan et al. 2023). QuArch shows one way to start from the paper corpus and ask whether models can answer and reason about architecture questions (Prakash et al. 2025b, 2025a). None of these is the whole field. Each is a partial infrastructure move.

Mattson, Peter, Hanlin Tang, Gu-Yeon Wei, et al. 2020. MLPerf: An Industry Standard Benchmark Suite for Machine Learning Performance.” IEEE Micro 40 (2): 8–16. https://doi.org/10.1109/MM.2020.2974843.
Krishnan, Srivatsan et al. 2023. ArchGym: An Open-Source Gymnasium for Machine Learning Assisted Architecture Design.” Proceedings of the 50th Annual International Symposium on Computer Architecture, ISCA ’23. https://doi.org/10.1145/3579371.3589049.
Prakash, Shvetank et al. 2025b. QuArch: A Question-Answering Dataset for AI Agents in Computer Architecture.” IEEE Computer Architecture Letters, ahead of print. https://doi.org/10.1109/LCA.2025.3541961.
Prakash, Shvetank et al. 2025a. QuArch: A Benchmark for Evaluating LLM Reasoning in Computer Architecture. https://arxiv.org/abs/2510.22087.
Chai, Zhuomin, Yuxiang Zhao, Yibo Lin, Wei Liu, Runsheng Wang, and Ru Huang. 2022. CircuitNet: An Open-Source Dataset for Machine Learning Applications in Electronic Design Automation (EDA).” Science China Information Sciences, ahead of print. https://doi.org/10.1007/s11432-022-3571-8.
Wang, Zhihai et al. 2025. “Benchmarking End-to-End Performance of AI-Based Chip Placement Algorithms.” Advances in Neural Information Processing Systems (NeurIPS). https://arxiv.org/abs/2407.15026.

Shared data and honest evaluation are starting to appear alongside these environments. CircuitNet assembles an open dataset of design-flow samples for machine learning in electronic design automation, directly attacking the field’s shortage of reproducible training and benchmarking data (Chai et al. 2022). ChiPBench attacks the evaluation problem: it scores AI placement methods by their effect on final power, performance, and area, and reports that strong intermediate proxy metrics often fail to translate into better final designs (Wang et al. 2025). That finding is the community’s version of this book’s proxy-mismatch warning, and it shows why measuring progress in Architecture 2.0 means scoring the loop’s end-to-end evidence rather than a surrogate.

The missing pieces are just as important. Architecture work rarely preserves negative traces: failed runs, rejected candidates, invalid configurations, stale benchmarks, bad proxies, tool errors, and ideas that were ruled out by expert judgment. Yet these traces are exactly what an agentic design loop needs to learn from the field’s failures rather than rediscover them. A community that records only successful artifacts teaches future systems a distorted view of architecture practice.

Community infrastructure should also respect privacy and IP boundaries. The goal is not to force every organization to publish internal traces. The goal is to build schemas, examples, synthetic tasks, open benchmarks, redacted records, and teaching artifacts that make credible loop design discussable. That is how Architecture 2.0 can become a research area rather than a set of private demos.

10.5 Long-Horizon Challenge Tasks

Short demonstrations are useful for tool development, but they are too small to define the field. A model that writes a plausible RTL fragment, proposes a cache configuration, or summarizes a paper may be helpful without changing the architecture loop. The harder question is whether an AI-mediated system can participate in architecture work over the time scale on which architecture decisions actually mature: days, weeks, months, tool versions, workload updates, rejected alternatives, and design reviews.

Long-horizon architecture task. A long-horizon architecture task is a challenge in which a method or agent must maintain design state across multiple steps, act through valid tools or interfaces, gather feedback at appropriate fidelities, preserve rejected alternatives, expose uncertainty, and support a human architectural commitment over an extended design interval.

This framing changes what the community should ask for. The canonical challenge is not prompt-to-chip. It is prompt-to-loop: can a system preserve enough state, evidence, and rejection history that an architect can trust the next commitment? Table 10.2 sketches a starting set of tasks. They are deliberately stated as loop challenges rather than as single benchmark scores.

Table 10.2: Long-horizon challenges should test architecture loops, not single prompts: The field needs tasks that reward memory, valid action, feedback, rejection, evidence, and human commitment across time.
Challenge task What makes it architectural Success signal
Design-loop memory The loop must preserve candidates, assumptions, tool outputs, failures, and decisions across a multi-step project. A reviewer can reconstruct why a candidate advanced, changed, or was rejected.
Workload drift tracking The workload is a moving object: benchmark version, model, software stack, trace mix, and deployment scenario can all change. The loop detects when prior conclusions no longer support the current architectural claim.
Evidence-aware generation Candidate generation is useful only when paired with the cheapest evidence path that can reject or advance the candidate. The loop proposes both designs and the next evidence needed to trust or reject them.
Paper-to-loop reproduction Architecture papers often omit scripts, traces, tool settings, negative results, or precise assumptions. The system reconstructs the experiment loop, identifies missing artifacts, and states what would be needed to reproduce or falsify the claim.
Simulator trust calibration Fast proxies are necessary but can mislead when the workload, model, or design point moves outside the calibrated region. The loop knows when to trust a proxy, when to escalate fidelity, and when to invalidate a result.
Cross-stack co-design Real wins often require coordinated changes across workload, compiler, mapping, memory, accelerator, runtime, and deployment policy. The loop changes multiple layers while preserving valid interfaces and exposing which layer owns each assumption.
Negative-trace corpus Failed mappings, invalid RTL, bad floorplans, stale benchmarks, and misleading proxy wins are architecture data. Rejected alternatives become reusable evidence rather than disappearing from the record.
Design-review assistant The highest-value output may be a review packet, not a design artifact. The system prepares assumptions, risks, missing evidence, sensitivity checks, rejected alternatives, and escalation questions for human judgment.

These challenges also give Architecture 2.0 a way to remain architecture centric. A generic AI benchmark can reward answer fluency. A long-horizon architecture challenge should reward state, interfaces, feedback economics, evidence quality, and rejection. That is where the computer architecture community has something specific to contribute.

10.6 From Capability to Standard

The path from a local capability to a field standard is gradual. A research group may first build a tool wrapper for its own simulator. A lab may then create an internal leaderboard. A workshop may define a shared task. A community may agree on benchmark versions, submission rules, provenance requirements, and reporting templates. Eventually, the field may decide that certain claims are not credible unless they expose the loop that produced them.

That progression should not be rushed. Premature standardization can freeze weak tasks, reward narrow metrics, and encourage benchmark gaming. But the opposite failure is also real: if every project defines its own task, wrapper, metric, and evidence standard, the field cannot accumulate knowledge. The right target is not a single universal benchmark. It is a family of interfaces, cards, tasks, and evidence conventions that make claims comparable without pretending all architecture work is the same.

A credible standard should make five things visible. First, the task should be replayable: another group can understand what was attempted. Second, the result should be comparable: the metrics, workloads, and constraints have enough common structure to support interpretation. Third, the artifact should be versioned: benchmark, tool, model, and dataset versions are part of the claim. Fourth, the evidence should be auditable: assumptions, provenance, and rejected alternatives can be inspected. Fifth, the evaluation should resist leakage: a method should not succeed merely because it saw the benchmark, the answer, or the hidden rule during training.

Architecture 2.0 will likely mature unevenly. Fast software loops may standardize earlier than RTL and physical-design loops. Workload and benchmark loops may become more public than industrial signoff loops. That unevenness is acceptable. What matters is that the field learns how to name the loop and state its evidence burden.

10.7 Education for Loop Designers

If Architecture 2.0 changes practice, it should also change education. Future architects still need the classical foundations: ISA, microarchitecture, memory systems, interconnects, compilers, operating systems, parallelism, power, reliability, and quantitative evaluation (Hennessy and Patterson 2017). They also need a new form of literacy: how to design, inspect, and govern architecture design loops.

Hennessy, John L., and David A. Patterson. 2017. Computer Architecture: A Quantitative Approach. 6th ed. Morgan Kaufmann.

This is not a replacement curriculum. It is an added layer. Students should learn how to describe a design-space-exploration problem as a loop. They should learn to distinguish feedback from evidence, a proxy from an objective, a benchmark score from a deployment claim, and a generated candidate from a defensible decision. They should learn to ask what an agent can act on, what it cannot see, what it might optimize incorrectly, what was rejected, and what human decision remains.

The design-loop card is the simplest classroom artifact. A paper discussion can ask students to fill out the card. A project proposal can require the card before implementation. A review exercise can compare two papers not only by results, but by loop quality: task clarity, representation coverage, environment validity, feedback cost, evidence chain, negative traces, rejection authority, and human decision. This makes Architecture 2.0 teachable without turning a course into a tool tutorial.

The broader educational goal is taste under automation. Students should learn when to trust a tool, when to instrument it, when to reject its output, when to escalate to higher fidelity, and when to step back because the task itself is wrong. Those are architectural skills.

10.8 The Architecture 3.0 Horizon

Architecture 3.0 is not the subject of this lecture, but it is useful to name the horizon carefully. If Architecture 2.0 is about designing the design loop, then Architecture 3.0 would begin when the loop itself becomes adaptive at community scale. Agents would not only generate candidates inside a fixed environment. They would help discover better representations, propose new tasks, improve tool interfaces, organize negative traces, calibrate evidence standards, and refine the community’s shared infrastructure.

The early signs are visible in industry, where electronic-design-automation vendors have begun packaging design assistants that chain tool steps with less human intervention between them. The durable way to read such systems is not by their feature lists, which will change, but as a shift in the partition of design autonomy: which decisions a human still makes, which the loop may make within stated bounds, and which the loop is never allowed to make alone. That partition, not the autonomy level a system claims to reach, is what the architect must keep designing (Janapa Reddi and Yazdanbakhsh 2025).

Janapa Reddi, Vijay, and Amir Yazdanbakhsh. 2025. “Architecture 2.0: Foundations of Artificial Intelligence Agents for Modern Computer System Design.” Computer 58 (2): 116–24. https://doi.org/10.1109/MC.2024.3521641.

That horizon is plausible, but it should be treated with restraint. A loop that adapts itself can also adapt in the wrong direction. It can chase benchmarks, hide failures, overfit to available tools, or encode the biases of the traces it sees. The more adaptive the loop becomes, the more important the architect-owned boundary becomes.

The durable question is therefore not whether Architecture 3.0 will make architects unnecessary. The question is what form of judgment, accountability, and community governance is needed when the loop itself can change. That is why the final object in this lecture is not a prediction. It is a question of ownership.

10.9 The Architect’s Standing Obligation

The operational checklist already exists. The trust checklist in Chapter 7 and the design-loop card and rubric in Appendix \(\ref{chap-appendix-b-design-loop-card}\) give it for a single claim and for a whole project, and this chapter does not reprint them. The closing point is narrower and harder to delegate: accountability. Every field on that card ultimately resolves to a person who answers for the commitment. The card makes the loop visible; the architect decides what the visible loop is allowed to do, and owns the consequences when it is wrong.

That bar is intentionally modest. It does not claim that every Architecture 2.0 project must solve every problem in the field. It asks for something more basic and more durable: make the loop visible, then keep a human accountable for it. Once the loop is visible, the community can critique it, improve it, teach it, compare it, and build on it.

That is the promise of Architecture 2.0. The field does not need to wait for a single model that designs a computer from a sentence. It can start by changing the unit of architectural practice: from isolated artifacts to represented, instrumented, evidence-bearing design loops. The architect still owns the judgment. The opportunity is to build loops worthy of that judgment.

References