Appendix C — Resource Catalog for Architecture 2.0 Loops

Author
Affiliation

Harvard John A. Paulson School of Engineering and Applied Sciences

Published

June 25, 2026

This catalog is not a directory of links and it is not an endorsement list. Links change, tools age, and benchmark versions move. The stable question is what role a resource plays in the design loop. Does it provide workload state? Does it define valid actions? Does it return feedback? Does it expose evidence that can reject a candidate? Does it preserve provenance or negative traces?

The specific examples named below are a snapshot. The durable content is the role each resource plays; the current list of tools, datasets, and benchmarks is the kind of fast-moving record that belongs with the community forming around this topic, where a companion edition can keep it current without reprinting the book.

Architecture 2.0 resource. A resource is useful for Architecture 2.0 when it makes some part of the design loop explicit: task, representation, environment, method role, feedback, evidence, rejection, or human decision.

Table C.1 gives a first-pass catalog. The examples are deliberately representative, not exhaustive. A reader should use the table to ask what is missing from a loop before adding another model or tool.

Table C.1: Useful resources should be classified by loop role: A dataset, benchmark, harness, simulator, compiler, or card is valuable only if the loop records what it can and cannot support.
Resource family Examples Loop role Watch for
Architecture corpora and QA Paper/manual corpora, DBLP spines, QuArch-style QA and reasoning data (Prakash et al. 2025b, 2025a). Bootstrap architecture vocabulary, concepts, and literature-grounded reasoning. Paper text rarely preserves simulator state, failed candidates, tool logs, or review judgment.
Workloads and benchmarks XRBench, MLPerf, and maintained benchmark suites (Kwon et al. 2023; Mattson et al. 2020; Reddi et al. 2020). Define workload state, scenarios, metrics, rules, and comparability. Coverage, drift, update policy, and proxy validity must remain visible.
Evaluation harnesses and environments ArchGym-style environments, benchmark harnesses, simulator wrappers, and tool-calling APIs (Krishnan et al. 2023). Define valid actions, observations, feedback cost, logging, and rejection behavior. A wrapper can hide tool semantics, unsupported actions, nondeterminism, and failure modes.
Mapping and DSE frameworks Timeloop and MAESTRO-style mapping/dataflow tools (Parashar et al. 2019; Kwon et al. 2019). Make architecture search spaces and constraints explicit enough to explore. Fast feedback is still a model; calibration, workload scope, and invalid candidates matter.
Compiler, autotuning, and codegen resources AutoTVM, Ansor, MLIR, and kernel-generation benchmarks (Chen et al. 2018; Zheng et al. 2020; Lattner et al. 2020; Ouyang et al. 2025). Connect specialized hardware ideas to executable software paths. A kernel, schedule, or IR result is not automatically a system-level architecture result.
Evidence and provenance artifacts Design-loop cards, source packets, seeds, configs, tool logs, calibration records, and negative traces. Make claims auditable, reproducible, rejectable, and teachable. These records are often uncodified, private, or discarded because they are not publication artifacts.
Prakash, Shvetank et al. 2025b. QuArch: A Question-Answering Dataset for AI Agents in Computer Architecture.” IEEE Computer Architecture Letters, ahead of print. https://doi.org/10.1109/LCA.2025.3541961.
Prakash, Shvetank et al. 2025a. QuArch: A Benchmark for Evaluating LLM Reasoning in Computer Architecture. https://arxiv.org/abs/2510.22087.
Kwon, Hyoukjun et al. 2023. XRBench: An Extended Reality (XR) Machine Learning Benchmark Suite for the Metaverse.” Proceedings of Machine Learning and Systems.
Mattson, Peter, Hanlin Tang, Gu-Yeon Wei, et al. 2020. MLPerf: An Industry Standard Benchmark Suite for Machine Learning Performance.” IEEE Micro 40 (2): 8–16. https://doi.org/10.1109/MM.2020.2974843.
Reddi, Vijay Janapa, Christine Cheng, David Kanter, et al. 2020. MLPerf Inference Benchmark.” 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), 446–59. https://doi.org/10.1109/ISCA45697.2020.00045.
Krishnan, Srivatsan et al. 2023. ArchGym: An Open-Source Gymnasium for Machine Learning Assisted Architecture Design.” Proceedings of the 50th Annual International Symposium on Computer Architecture, ISCA ’23. https://doi.org/10.1145/3579371.3589049.
Parashar, Angshuman, Priyanka Raina, Yakun Sophia Shao, et al. 2019. Timeloop: A Systematic Approach to DNN Accelerator Evaluation.” 2019 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS, 304–15. https://doi.org/10.1109/ISPASS.2019.00042.
Kwon, Hyoukjun, Prasanth Chatarasi, Michael Pellauer, Angshuman Parashar, Vivek Sarkar, and Tushar Krishna. 2019. “Understanding Reuse, Performance, and Hardware Cost of DNN Dataflows: A Data-Centric Approach.” Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’52, 754–68. https://doi.org/10.1145/3352460.3358252.
Chen, Tianqi, Lianmin Zheng, Eddie Yan, et al. 2018. “Learning to Optimize Tensor Programs.” Advances in Neural Information Processing Systems 31.
Zheng, Lianmin et al. 2020. Ansor: Generating High-Performance Tensor Programs for Deep Learning.” 14th USENIX Symposium on Operating Systems Design and Implementation, 863–79.
Lattner, Chris, Mehdi Amini, Uday Bondhugula, et al. 2020. MLIR: A Compiler Infrastructure for the End of Moore’s Law.” arXiv Preprint arXiv:2002.11054, ahead of print. https://doi.org/10.48550/arXiv.2002.11054.
Ouyang, Anne, Simon Guo, Simran Arora, et al. 2025. KernelBench: Can LLMs Write Efficient GPU Kernels?” arXiv Preprint arXiv:2502.10517, ahead of print. https://doi.org/10.48550/arXiv.2502.10517.

C.1 Use The Catalog As A Loop Checklist

The catalog is most useful when it is used as a checklist. For a new Architecture 2.0 project, choose one resource for each role:

  • a workload or benchmark that defines the task boundary;
  • a representation that records the state the loop can read and change;
  • an environment or harness that defines valid actions and observations;
  • a feedback source with an explicit latency, fidelity, and cost model;
  • an evidence record that preserves configurations, assumptions, and negative traces;
  • a rejection rule and human decision owner.

If one of these fields is missing, the loop may still be useful, but its claim should be bounded accordingly. A paper-reading agent can help with literature triage even if it cannot act on RTL. A simulator environment can support design-space exploration even if it cannot validate timing closure. A kernel-generation benchmark can reveal code-generation capability even if it does not prove system-level efficiency.

C.2 Missing Infrastructure

The most important future resources are not only larger corpora. Architecture needs shared records of design-loop state:

  • negative-trace repositories that preserve failed candidates and reasons;
  • environment schemas that state actions, observations, costs, and invalid states;
  • benchmark-update protocols that record drift and version changes;
  • confidentiality-preserving ways to share tool traces and design reviews;
  • standard design-loop cards for papers, artifacts, and class projects.

These resources would make the field more teachable and more cumulative. They would also make AI-assisted architecture work easier to evaluate, because the community could ask whether a method improved the loop rather than merely whether it produced a plausible artifact.

References