Running Workflows#
This page explains the shared mechanics of running JaQMC workflows: how to do a quick debug run, how to launch a real training job, how resume and evaluation work, and what files each command writes.
This page is for users running built-in apps such as jaqmc molecule, jaqmc solid, and
jaqmc hall. If you are building a new app/workflow implementation, see
Writing Workflows.
In command examples, replace <app> with your system command name such as molecule,
solid, or hall.
Common Tasks#
Most users arrive here with one of four goals. Start with the recipe that matches what you want to do, then use the later sections as reference.
Fast debug run#
Use this when you want to confirm that the workflow starts, writes outputs, and makes progress without waiting for a full calculation.
Pick your normal system input: CLI flags or one or more
--ymlfiles.Set an explicit save path so the short run does not mix with other outputs.
Cut the training budget down to something that finishes quickly.
If your app has a pretrain stage, shorten that stage too.
Lower
workflow.batch_sizefor speed, for example to256or512.If needed, reduce model size with your app’s
wf.*knobs.
jaqmc <app> train ... \
workflow.save_path=./runs/<name>-debug \
train.run.iterations=100
What success looks like:
workflow.save_pathis created and containsconfig.yaml.The logs show increasing step numbers.
Reported losses and energies stay finite.
A final checkpoint is written.
If the run fails to initialize or produces NaNs, continue in Troubleshooting.
Production baseline#
Use this when you are launching a real optimization run on a workstation or cluster.
Start from the recommended settings on your system page and its config reference.
Set a stable, descriptive
workflow.save_path.Tune the optimization budget first: usually
pretrain.run.iterationsandtrain.run.iterations.Keep
workflow.batch_sizeat its default unless you have evidence that the training signal is too noisy or your hardware budget requires a change.Leave lower-level knobs alone until you have a specific reason to touch them.
jaqmc <app> train ... workflow.save_path=./runs/<name>-prod
For most production runs, the highest-value decisions are how long to train, whether to keep pretraining, and whether you need a different optimizer or wavefunction. Everything else is usually a second-order adjustment.
Fit larger runs in memory#
If a run does not fit in device memory, first decide whether you can change the walker
count. Lowering workflow.batch_size reduces memory, but it also increases variance and
can hurt optimization quality. To keep the same number of walkers, set vmap_chunk_size
on a local estimator so the work is evaluated in smaller pieces. That trades speed for
lower peak memory.
The tightest part of the model varies by system, but the kinetic energy estimator is often the right place to start:
jaqmc <app> train ... \
estimators.energy.kinetic.vmap_chunk_size=128
Use the largest chunk size that still fits. Smaller chunks cut peak memory further, but they add overhead.
Resume, branch, or evaluate#
Use this when you already have a run on disk and want to continue it, fork from it, or measure final observables from its checkpoints.
Resume training in the same directory by reusing workflow.save_path and increasing the
training budget:
jaqmc <app> train ... \
workflow.save_path=./runs/<name>-prod \
train.run.iterations=<larger_value>
Branch from an existing run into a new output directory by setting both save and restore paths:
jaqmc <app> train ... \
workflow.save_path=./runs/<name>-v2 \
workflow.restore_path=./runs/<name>-prod \
train.run.iterations=<larger_value>
Run evaluation in a separate directory by pointing workflow.source_path at the training
run you want to evaluate:
jaqmc <app> evaluate ... \
workflow.save_path=./runs/<name>-eval \
workflow.source_path=./runs/<name>-prod \
run.iterations=1000
If you want multiple evaluation variants, give each one its own workflow.save_path and
reuse the same workflow.source_path.
Tip
These three path settings do different jobs:
workflow.save_path: where the current command writes outputsworkflow.restore_path: where the current command looks for its own checkpoints; if unset, it defaults toworkflow.save_pathworkflow.source_path: training run or checkpoint used byevaluateto load trained parameters, walker data, and sampler state
When you rerun a command with the same workflow.save_path, JaQMC resumes from the
latest checkpoint already in that directory. Use workflow.restore_path only when you
want to start writing into a different output directory while restoring state from
somewhere else.
Reporting checklist#
Use this checklist before sharing final numbers:
Prefer evaluation outputs over in-training values for final reported observables.
Estimate uncertainty from per-step evaluation data, not from a single digest value.
Keep the resolved config and run provenance with the result.
Record any non-default estimator, optimizer, or wavefunction settings.
Make sure comparisons across runs use the same observable definitions and units.
For detailed analysis workflows, see Training Statistics and Analyzing Evaluations.
Paths, Outputs, And Checkpoints#
All outputs for a command go into workflow.save_path. If you do not set it explicitly,
JaQMC creates a timestamped directory under runs/. If the current working directory is inside the source repo, that means
the repo-level runs/ directory; outside the repo, it means ./runs/ in your current
working directory.
Typical training output looks like this. If your workflow does not include a pretrain
stage, the pretrain_* files are simply absent:
save_path/
├── config.yaml
├── pretrain_ckpt_NNNNNN.npz
├── pretrain_stats.h5
├── pretrain_stats.csv
├── train_ckpt_NNNNNN.npz
├── train_stats.h5
└── train_stats.csv
Typical evaluation output looks like this:
save_path/
├── config.yaml
├── evaluation_ckpt_NNNNNN.npz
├── evaluation_stats.h5
├── evaluation_stats.csv # optional, depends on configured writers
└── evaluation_digest.npz
A few details are worth keeping straight:
*_ckpt_*.npzfiles are resumable checkpoints for that stage.*_stats.h5stores per-step statistics in a machine-friendly format.*_stats.csvis writer-dependent, so it may or may not be present.evaluation_digest.npzis the compact summary produced at the end of evaluation.
Evaluation loads from workflow.source_path. If that path is a directory, JaQMC restores
the latest train_ckpt_*.npz it finds there. If it is a specific checkpoint file, JaQMC
uses that file directly.
Checkpoint saves follow one rule: the last iteration is always saved, and intermediate
checkpoints are written only when both save_time_interval and save_step_interval are
satisfied. With the defaults (600 seconds and every 1000 steps), that means a fast
job may skip step-1000 checkpointing if 10 minutes have not passed yet, but it still
writes a final checkpoint when the run ends.
For final observables, use evaluation output rather than training logs. In practice,
evaluation_digest.npz is the quick summary, while evaluation_stats.h5 is the file you
need for uncertainty analysis from per-step data. For a full walkthrough, see
Analyzing Evaluations.
If you are implementing a new workflow or app rather than running an existing one, continue with Writing Workflows for the framework-side wiring model.