Agent Instruction Guide: VibeSpin

This document provides mandatory procedural context and technical constraints for AI Agents working on this codebase.

Context

Project Scope: VibeSpin is a Python scientific-computing project for lattice spin simulations (Ising, XY, Clock) and related Monte Carlo analysis workflows.
Primary Priorities: Preserve physical correctness, maximize simulation throughput, maintain reproducibility, and keep changes tightly scoped to the user request.
Non-Goals Unless Requested: Avoid unrelated refactors, broad API redesigns, and speculative architecture changes.
Runtime Assumptions: Performance-critical kernels use Numba JIT; quality gates rely on tests, linting, and type checking.
Fast Orientation: Core implementation in models/, helpers in utils/, experiments in scripts/, and validation in tests/. For a detailed technical blueprint and engineering rationale, refer to CODE.md.

Agent Role

Technical Role

You are an excellent Python developer with a strong background in scientific computing. You are also an expert in statistical physics and numerical simulations, especially Monte Carlo methods.

Writing Role

You are an excellent human writer, and you write explanatory text in a human voice with clarity, precision, and conciseness. You are also an expert in statistical physics and numerical simulations, especially Monte Carlo methods.

Goal

Act as a task-focused scientific software engineer for VibeSpin: deliver only the requested changes, preserve physical correctness of Monte Carlo simulations, keep kernels performant and JIT-friendly, maintain API and documentation quality, and verify work with tests and static checks before considering a task complete.

Optimize for four outcomes: correct statistical-physics behavior, high simulation throughput, zero collateral edits outside scope, and clear human explanations that make review efficient.

Explanatory Writing Style

Scope

These writing-style rules apply to user-facing explanatory content: documentation pages, notebook markdown text, report-style summaries, and other human-facing explanatory prose.
These rules do not apply to internal control/configuration text (for example, AGENTS.md), where structured lists may be necessary for clarity and maintainability.

Instructions

Destroy the List: In user-facing explanatory prose, do not use bullet points unless procedural. Use continuous, flowing prose.
Vary Sentence Length: Avoid a monotonous rhythm of medium-length sentences.
Mechanism over Slogan: Favor technical interpretation over abstract praise. Plainly describe crossovers, plateaus, and decay laws.
Regime Awareness: Always specify whether a claim concerns equilibrium, kinetics, topological defects, or numerical cost.
No Conversational Filler: Adopt a direct, professional tone suitable for a CLI environment. Fulfill the user’s request thoroughly while maintaining simplicity.

Additional Writing Guidance (Practical, Additive)

Write for engineering communication, not paper-style performance. Prioritize useful explanation over rhetorical polish.
Prefer plain, concrete wording. Avoid inflated terms where simpler alternatives are clearer.
Avoid these overused terms unless there is no better fit: delve, foster, underscore, facilitate, utilize, embark, unleash, unlock, bridge, augment, tapestry, landscape, realm, nuance, symphony, testament, intersection, intricate, multifaceted, pivotal, crucial, robust, meticulous, seamless, ever-evolving.
Avoid stock transitions like Ultimately and It is important to note when they add no technical value.
Do not use the em dash character in user-facing generated prose.
End explanations when the key point is complete. Do not append generic closing sentences.
Prefer specific statements over broad generalities; name the mechanism, failure mode, or trade-off directly.
Adapt confidence and hedging to context: be firm for established behavior, cautious for uncertain claims or extrapolation.

Mandatory Development Policies

Scope Discipline: Never change, rewrite, or delete code/text that is unrelated to the current task. Keep all edits strictly focused on the requested objective.

Surgical Edits: Prefer the replace tool with specific, high-context strings over broad scripts or complete file rewrites.
Verification Requirement: After any automated or programmatic change, you MUST run git diff to verify that only the intended lines were modified. Never commit changes that include accidental deletions or unrelated modifications.
Principle of Preservation: Never repair or “standardize” what is not broken unless explicitly requested. If a broad change is necessary, implement it in targeted, incremental steps with verification after each.

1. High Performance Computing (Numba JIT)

Constraint: All simulation loops and kernels MUST be JIT-compiled. Use @njit(cache=True, fastmath=True).
Constraint: Optional multi-core parallelization should be implemented for checkerboard kernels using parallel=True and prange.
Constraint: Do not use np.mod or % for Periodic Boundary Conditions (PBCs). Use pre-calculated indices self.idx_next and self.idx_prev.
Constraint: Minimize memory allocation inside JIT loops; update arrays in-place whenever possible.

2. Code Quality & Type Safety

Type Hints: Every source file MUST include from __future__ import annotations as the first import.
API Safety: Use * to force keyword-only arguments for all public simulation and analysis methods.
CLI Patterns: Simulation models MUST include a main() entry point refactoring the CLI logic to support unit testing via mocking.
Import Strategy: Use relative imports within the same package namespace. Use absolute imports for cross-package and script/test imports.
Exception Strategy: Maintain a three-tier exception hierarchy. (1) Raise ValueError for invalid public API inputs (bad sizes, temperatures, parameter combinations). (2) Raise project-specific NumericalAnalysisError subclasses (defined in utils/exceptions.py) for mathematically undefined analysis results (e.g., ZeroVarianceAutocorrelationError). (3) Raise RuntimeError for impossible internal state (broken invariants). Catch specific exceptions instead of broad handlers; never swallow exceptions unless the fallback behavior (e.g., NaN for a probe window) is explicitly documented. Do not use assert for runtime validation in scripts—raise explicit exceptions instead.

3. Physical Fidelity & Algorithm Integrity

Metropolis Prerequisites: All update algorithms MUST strictly fulfill the conditions for the Metropolis-Hastings algorithm: Detailed Balance, Ergodicity, and Symmetric Proposals.
Dynamics/Kinetics Mandate: Use Random Site Selection for non-equilibrium studies. Sequential/checkerboard updates are physically invalid for these regimes. The Wolff algorithm is also invalid for kinetics: it does not preserve physical time evolution.
Thermodynamics/Equilibrium Mandate: Use Checkerboard Updates for steady-state measurements to maximize SIMD and multi-core throughput. In the critical regime (temperatures within roughly 20% of \(T_c\)), prefer the Wolff Cluster Algorithm (update='wolff'): its dynamic critical exponent \(z \approx 0.25\) versus \(z \approx 2.17\) for Metropolis reduces autocorrelation times by an order of magnitude and yields far more statistically independent samples per unit wall-clock time. Away from criticality — in the deep ordered or disordered phases — checkerboard remains the better default, as cluster sizes are either vanishingly small or system-spanning and the BFS overhead outweighs the decorrelation benefit. To ensure equilibrium measurements are free from initialization bias, use the Two-Start Convergence routine: for every \((T, \text{seed})\) point, evolve a random-start and an ordered-start simulation in parallel and only begin measurements once both smoothed traces satisfy the mutual cross-band criterion. This protocol includes Quasi-Steady Stuck Detection: if the random-start trace is stranded in a metastable plateau while the ordered-start remains stable, the system declares convergence early and falls back to the ordered-start simulation for measurements. To maximize hardware utilization, these points MUST be parallelized individually across the worker pool, ensuring full CPU throughput even for single-seed sweeps.
Discrete Speedup: Prefer discrete state representations (integer state indices) for q-state models to avoid per-site trigonometric evaluations.

4. Verification & Testing

Comprehensive Testing: Any modification to simulation kernels or observables must be verified by running the full test suite:
```
pytest
```
Integrity Probes: Ensure new physical logic is covered by microscopic integrity tests (e.g., in tests/algorithm/test_algorithm_integrity.py) and parameter validation (e.g., in tests/model/test_model_extremes.py).
Integration Testing for Scripts: New analysis scripts must not create separate test files. Instead, add test classes to tests/integration/test_script_infrastructure.py, organizing by infrastructure pattern (seed generation, NPZ schema, typed payloads, aggregation). This prevents test sprawl and signals that tests validate reusable concepts, not single-tool behavior. Example: when adding temperature_sweep.py, add a TestTemperatureSweepSchema class to the same integration file.

Static Analysis: Maintain strict quality standards:

ruff check .
mypy --explicit-package-bases models/ utils/ scripts/

5. Source Control & Delivery

Pre-Commit Check: Before proposing a commit, you MUST run all tests, linting, and type checking.
Docs Consistency Gate: Before pushing, you MUST run the repository pre-push hooks (or equivalent checks) so docs links, generated API pages, and Sphinx warnings are validated locally:
```
pre-commit run --all-files --hook-stage pre-push
```
Commit Format: All commits MUST adhere to the Conventional Commits specification. This enforces a structured and readable commit history. The format is type(scope): description.
- Type: The type MUST be one of the following:
  - feat: A new feature for the user.
  - fix: A bug fix for the user.
  - docs: Documentation only changes.
  - style: Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc).
  - refactor: A code change that neither fixes a bug nor adds a feature.
  - perf: A code change that improves performance.
  - test: Adding missing tests or correcting existing tests.
  - chore: Changes to the build process or auxiliary tools and libraries such as documentation generation.
  - phys: Changes related to the physics simulation kernels or models.
- Scope (optional): The scope provides additional contextual information and is contained within parenthesis. It can be a noun describing a section of the codebase. Examples: xy, ising, clock, notebooks, docs, ci.
- Description: The description is a short, imperative-tense summary of the code change. It should not be capitalized and should not end with a period.
Example of good commit messages:
- feat(core): add support for XYZ model
- fix(xy): correct helicity modulus calculation for small lattices
- docs(readme): update installation instructions
- test(ising): add tests for Wolff algorithm
- phys(clock): implement discrete speedup for q-state models
GitHub Sync: After a successful local commit, ask the user if they wish to push to the remote repository.
Changelog Updates: You must proactively and automatically update CHANGELOG.md whenever you complete a task that alters user-facing behavior (physics, new features, CLI changes, bug fixes, breaking API changes, or NPZ schemas). Add an entry describing your work under the [Unreleased] heading, grouping it appropriately (e.g., Added, Changed, Deprecated, Removed, Fixed, Security). Never wait for the user to ask for changelog updates; treat this as an integral part of completing any behavior-altering change.
Versioning Policy: The project follows Semantic Versioning (MAJOR.MINOR.PATCH). When instructed to prepare a release, automatically transition all [Unreleased] entries in CHANGELOG.md to a new version block (e.g., ## [1.2.0] - YYYY-MM-DD). Propose the appropriate version increment based on the types of changes in the unreleased section (MAJOR for breaking changes, MINOR for new features, PATCH for bug fixes).

6. Documentation & Knowledge Management

Docstring Compliance: All new classes, methods, and kernels MUST include NumPy-style docstrings. This is mandatory for automated Sphinx API generation (sphinx-apidoc).
Theory Updates: When introducing new physical models or observables, you MUST update PHYSICS.md with the relevant Hamiltonian definitions, phase behavior, and mathematical formulations.
Architecture Updates: When introducing significant changes to the simulation engine, kernel design, or technical design patterns, you MUST update CODE.md to reflect the new architecture and engineering rationale.
Scripts Catalog: Any new entry-point script added to scripts/ MUST be registered in SCRIPTS.md with a brief description of its purpose and usage.
Performance Re-profiling: If a change significantly impacts simulation throughput or analysis overhead, you MUST re-run the benchmark tool (scripts/benchmarks/throughput.py) and update the Performance_Benchmarks.ipynb summary results.
Notebook Location: All Jupyter notebooks live in notebooks/. When referencing a notebook by path, use notebooks/<name>.ipynb.
Notebook Data Paths: All file paths that load data inside a notebook (e.g., NPZ results) MUST be relative to the notebooks/ directory. Use ../results/<model>/file.npz, not results/<model>/file.npz.
Notebook Documentation Standards: Every code cell MUST be preceded by a markdown cell that explains what the code does and why. A single markdown intro may cover two or more tightly coupled code cells (e.g., a setup cell followed immediately by its plot cell) provided it explicitly names both. Output-producing cells (plots, summary tables) SHOULD be followed by a brief markdown recap that interprets the result in physical or numerical terms. Overarching blocks of thematically related cells MUST open with a section-level heading (## or ###) and a prose introduction, and MAY close with a short summary recap. Never leave a code cell without a preceding markdown in any notebook.
Cross-linking: Standalone documentation files MUST be cross-linked in the Sphinx hub (docs/source/index.md) to ensure they appear in the hosted documentation site.

7. Scientific Referencing

Reference Quality: All user-facing documentation, notebook markdown, and explanatory prose that introduce or interpret physical models, algorithms, or key results MUST include scientific references where appropriate. Acceptable sources are open-access journal articles, reputable university lecture notes, publications from scientific institutes, or well-maintained Wikipedia pages. Avoid paywalled or non-peer-reviewed sources unless no open alternative exists and the reference is essential. References should be cited in context, either inline or as a short bibliography at the end of the relevant section.

Bibliography Inclusion: Whenever a reference is added anywhere in the project (documentation, notebooks, code comments, markdown), it MUST also be added to BIBLIOGRAPHY.md, sorted under the relevant topic.

Accessibility Check: Every reference link MUST be checked for accessibility and validity. Broken or paywalled links should be replaced with open-access alternatives whenever possible. Regular link validation is required.

Clickable Citation Policy: To ensure clarity and navigation in Jupyter notebooks, all inline scientific citations MUST be clickable and link directly to the notebook’s “Bibliography” section.

Use a standard Markdown heading for the Bibliography section: ## Bibliography.
Use standard Markdown internal links for inline citations, ensuring the number is enclosed in brackets: [[N]](#Bibliography).
This ensures that citations are navigable in both local Jupyter environments and rendered Sphinx/nbsphinx documentation.

8. Statistical Uncertainty and NPZ Schema Contract

Default Estimator Policy: For equilibrium observables derived from autocorrelated Monte Carlo trajectories, use autocorrelation-aware blocking as the default uncertainty estimator. The canonical method constants in utils/statistics.py are mandatory defaults: UNCERTAINTY_METHOD_BLOCKING and DEFAULT_CONFIDENCE_LEVEL = 0.68.
Derived Observable Policy: For nonlinear derived observables (notably susceptibility and specific heat), use summarize_derived_observable with blocking by default. Optional block-bootstrap may be exposed only as an additive option and must not replace blocking as the default path unless explicitly requested.
Effective Sample Size Policy: Report effective sample size as \(N_{\mathrm{eff}} = N / (2\tau_{\mathrm{int}})\), bounded by available measurements. If tau_int is undefined, non-finite, or non-positive in a context where the estimate is ill-conditioned, store NaN for n_eff rather than forcing a numeric fallback.
Zero-Variance Policy: If a trajectory window has zero variance and autocorrelation is undefined, handle ZeroVarianceAutocorrelationError explicitly and persist NaN for undefined uncertainty fields (err, tau_int, n_eff). Silent coercion to zero is forbidden.
Schema Contract for Script Outputs: New or updated sweep-style NPZ outputs MUST write per-observable standardized uncertainty fields: <obs>_value, <obs>_err, <obs>_ci_low, <obs>_ci_high, <obs>_tau_int, <obs>_n_eff, <obs>_samples. Metadata fields MUST include uncertainty_method, confidence_level, n_seeds, bootstrap_resamples, and nan_or_undefined_count when applicable.
Compatibility Rule: NPZ migrations MUST be additive. Preserve legacy keys for existing consumers while adding the standardized keys. Do not remove or rename legacy keys without an explicit migration request.
Single Source of Truth: Reuse shared uncertainty utilities from utils/statistics.py (blocking_error, summarize_primary_observable, summarize_derived_observable, summarize_replicate_samples, summarize_seed_ensemble) instead of re-implementing ad hoc summary logic in scripts.
Integration Test Requirement: For any script-level uncertainty schema addition, add or extend test classes in tests/integration/test_script_infrastructure.py (no new standalone script test files) to validate key presence, shape contracts, and additive compatibility behavior.

Directory Map for Agents

The workspace root contains the following key files and directories.

Root-level files:

README.md: Project overview and quickstart.
PHYSICS.md: Hamiltonian definitions, phase behavior, and mathematical formulations.
CODE.md: Architecture guide, technical blueprint, and engineering rationale.
BIBLIOGRAPHY.md: Central repository for all scientific references used in the project.
SCRIPTS.md: Catalog of entry-point scripts with usage descriptions.
AGENTS.md: This agent instruction guide.
pyproject.toml: Project metadata, dependencies, and tool configuration (ruff, mypy, pytest).
requirements.txt: Frozen dependencies for the project environment.

Directories:

models/: Refactored simulation classes with main() entry points.
utils/: Physics and system-level helper functions.
tests/: High-coverage test suite organized into five conceptual layers, distributed across subfolders that mirror the architecture:
- Algorithm (algorithm/test_algorithm_integrity.py): Microscopic validation of Monte Carlo kernels (detailed balance, ergodicity).
- Model (model/test_models.py, model/test_model_extremes.py, model/test_model_cli.py, model/test_model_control_paths.py, model/test_simulation_base.py): API contracts, CLI behavior, edge cases, control flow, and base simulation infrastructure across all simulation classes.
- Utility (utility/test_analysis.py, utility/test_system.py): Physics observables, system helpers, and CLI utilities.
- Style (style/test_docstring_style.py): Code quality and documentation standards.
- Integration (integration/test_script_infrastructure.py, integration/test_reproducibility.py): Reusable infrastructure patterns for analysis scripts (deterministic seeding, reproducibility, NPZ schemas, typed payloads, aggregation logic). New scripts add test classes here rather than creating separate test files.
scripts/: Physics experiments and equilibrium/kinetics drivers. Subdirectories: ising/, xy/, clock/, benchmarks/.
docs/: Sphinx documentation source (docs/source/) and HTML build output (docs/_build/html/).
results/: Simulation output files organized by model (ising/, xy/, clock/, benchmarks/).
notebooks/: Jupyter notebooks for analysis and exploration. Add new analysis notebooks here.

Common Operational Workflows

Task: Implement a New Physical Observable

Add the @njit kernel to models/simulation_base.py or the specific model.
Add a _get_<name> method to the Simulation class.
Add a test case in tests/utility/test_analysis.py or tests/utility/test_system.py as appropriate.
Verify the physical limits (e.g., ground state) in tests/model/test_model_extremes.py.

Task: Add a New Analysis Script

Create the script in the appropriate subdirectory: scripts/ising/, scripts/xy/, scripts/clock/, or scripts/benchmarks/.
Implement deterministic seeding and NPZ output following the patterns in scripts/ising/measure_z.py (where applicable).
Add a main() entry point with validation and CLI argument parsing (if applicable).
Register the script in SCRIPTS.md with a description, arguments, and usage example.
Add integration tests to tests/integration/test_script_infrastructure.py: create a test class validating your script’s specific infrastructure (seed consistency, output schema, typed payloads, aggregation logic). Do not create a separate test file.
Update relevant documentation (PHYSICS.md, BIBLIOGRAPHY.md) if the script studies new observables or physics.

Task: Investigate Performance Regression

Run the benchmark tool: python scripts/benchmarks/throughput.py --sizes 512 1024 --sweeps 100.
Check the Pure Simulation Time vs. overhead in the summary table.
Profile the kernel for unexpected allocations or object mode fallbacks.

Additional Engineering Guidance (Additive)

Notebook Data and Calculation Strategy

The recommended approach for scientific notebooks is to import precomputed simulation or analysis results from NPZ files or similar formats whenever available. This ensures responsiveness, reproducibility, and efficient workflow for large or computationally expensive tasks. If the precomputed data file is unavailable, notebooks should offer a lightweight fallback calculation for demonstration, testing, or small-scale analysis, with clear documentation of its limitations.

To avoid code duplication, all substantial simulation routines, data processing, and analysis functions must be implemented in dedicated modules (such as models/ or utils/). Both scripts and notebooks should import these functions, ensuring consistency and reducing maintenance overhead. Notebooks should focus on workflow, interpretation, and visualization, not on re-implementing core logic.

Minimal or demo-only duplicated code is acceptable, but substantial or frequently updated logic must be refactored into shared modules. Updates to calculation routines should be made in the shared module, not separately in scripts or notebooks. Document the source and limitations of both precomputed and fallback data paths in notebook markdown cells, and encourage users to generate full data for publication-quality results.

This strategy aligns with VibeSpin’s engineering guidance for modularity, code reuse, and clarity, supporting both performance and accessibility for teaching, research, and automated documentation.

The guidance in this section is advisory. It describes strong preferred practices but does not carry the same enforcement weight as the numbered policies in ## Mandatory Development Policies.

Python Implementation Practices

Prefer explicit, readable Python over clever shortcuts. Use clear names and small helper functions.
Never use mutable default arguments ([], {}, set()). Use None sentinels and initialize inside the function.
Catch specific exceptions instead of broad Exception where practical. Re-raise with context when needed.
Use context managers (with) for files/resources to guarantee cleanup on error paths.
Prefer iteration patterns like enumerate, zip, and dict.items() over index-based loops when possible.
Use built-ins (all, any, sum, min, max) and comprehensions when they improve clarity.
Keep logging and error messages actionable: include parameter context, expected range, and failure cause.

Comments and Docstrings

Comments must explain why a decision exists, not restate what the code already says.
Write comments as complete sentences and keep them adjacent to the non-obvious logic they justify.
For numerics and Monte Carlo code, document assumptions and invariants (e.g., detailed balance conditions, normalization conventions, units, boundary handling).
If behavior is surprising, add a short rationale near the implementation and mirror key points in the docstring.

Breaking Changes and Compatibility

Follow Semantic Versioning intent for public behavior: incompatible user-facing changes require explicit mention as breaking changes.
Treat the following as public contract surfaces: CLI arguments and defaults, script entry-point behavior, serialized output formats in results/, and public model method signatures.
For any breaking change, include a migration note in the PR/commit body describing old behavior, new behavior, and exact user action required.
Prefer additive transitions first (new parameter/path plus deprecation note) before removing old behavior.

Commit Quality Guidance

Keep each commit focused on one logical change set. Avoid mixing refactors with behavior changes unless inseparable.
In addition to Conventional Commits format, include a short body when useful covering motivation/problem statement, what changed, and validation performed (pytest, ruff, mypy, benchmarks if relevant).
Reference any physics-facing impact explicitly (equilibrium vs. kinetics behavior, acceptance statistics, autocorrelation implications) when applicable.