Appendix D — Architecture 2.0 Resource Directory
These links are not a general computer-architecture directory. A resource earns space here only if it helps an Architecture 2.0 loop name a task, represent state, expose actions, return feedback, preserve evidence, or reject a result. Use the list as a starting point and check current versions before relying on any benchmark, dataset, simulator, or tool.
D.1 Architecture 2.0 Framing
Course: CS249r: Architecture 2.0. Course material on agentic AI for computer systems design.
Essay: Architecture 2.0 gymnasium essay. Framing for the data-centric gymnasium argument.
Foundations article: IEEE Computer article. Foundations article on AI agents for modern computer system design.
Workshop: ISCA 2026 Architecture 2.0 workshop. Preview-edition workshop page and community entry point.
Portal: Architecture 2.0 resource site. Living collection of Architecture 2.0 links and materials.
D.2 Architecture Reasoning and Design-Problem Benchmarks
QuArch: quarch.ai. Architecture question-answering and reasoning benchmark. Use it to test whether a model can reason over architecture concepts, but not as evidence that a loop can act through tools or reject candidates.
CVDP benchmark: NVlabs/cvdp_benchmark and Hugging Face dataset. Comprehensive Verilog design problems for RTL design and verification. Use it when the loop claim involves HDL generation, test harnesses, simulation feedback, or verification failures.
VerilogEval: NVlabs/verilog-eval. Specification-to-RTL and Verilog code-generation benchmark with executable checks. Use it for method-role claims about RTL generation and compile/test feedback, not for system-level architecture claims.
KernelBench: ScalingIntelligence/KernelBench. GPU-kernel generation benchmark with correctness and performance evaluation. Use it to study software-loop feedback, codegen, and performance evidence before claiming architecture-level benefit.
D.3 Architecture Environments and Design-Space Exploration
ArchGym: Architecture Gym. Environment interface for ML-assisted architecture design. Use it as a concrete example of actions, observations, costs, and feedback in a bounded architecture loop.
Timeloop: NVlabs/timeloop. Mapping, modeling, and code-generation tool for tensor workloads on accelerator architectures. Use it for dataflow, mapping, memory hierarchy, and accelerator design-space loops.
Accelergy: Accelergy. Energy-estimation infrastructure for accelerators. Use it when a loop needs an explicit energy feedback source and calibration boundary.
MAESTRO: maestro-project/maestro. Analytical cost model for DNN dataflows and tiling. Use it as a fast-feedback model whose limits must be recorded before higher commitment.
D.4 Full-System Simulation and Hardware/Software Harnesses
gem5: gem5 simulator. Modular computer-system simulator. Use it for architecture feedback that needs workload execution, timing behavior, and reproducible simulator state.
FireSim: FireSim. FPGA-accelerated full-system simulation. Use it when the loop needs stronger hardware/software feedback than a software-only proxy can provide.
Chipyard: Chipyard docs. Integrated framework for generating and evaluating hardware systems. Use it when the loop must connect generators, RTL, simulation, and implementation artifacts.
D.5 Physical-Design and EDA Evidence
OpenROAD: OpenROAD project. Open-source RTL-to-GDS flow. Use it for loops that need physical-design feedback, timing/area/power evidence, or signoff-adjacent rejection.
CircuitNet: CircuitNet. VLSI CAD dataset for machine-learning applications in EDA. Use it for cross-stage prediction and physical-design learning claims, while preserving tool provenance and task scope.
ChiPBench: AI chip-placement benchmark. Benchmark focused on end-to-end physical-design impact for AI chip placement. Use it when placement evidence must be tied to downstream physical metrics, not only intermediate scores.
D.6 Workload and Benchmark Governance
XRBench: XRBench paper. Extended-reality machine-learning benchmark suite. Use it as a workload anchor for mobile-XR architecture loops, including scenario definition and workload coverage questions.
MLCommons benchmarks: MLCommons benchmarks. Benchmark governance and reporting infrastructure. Use it as a model for workload versions, run rules, comparability, and community-maintained evidence rather than as a generic performance leaderboard.