A toolkit for transforming molecular dynamics (MD) trajectories into rich graph representations, sampling
random and self-avoiding walks, learning node embeddings, and visualizing residue interaction networks (RINs). SAWNERGY
keeps the full workflow — from cpptraj output to skip-gram embeddings (node2vec approach) — inside Python, backed by efficient Zarr-based archives and optional GPU acceleration.
pip install sawnergy
Optional: For GPU training, install PyTorch separately (e.g.,
pip install torch).
Note: RIN building requirescpptraj(AmberTools). Ensure it is discoverable via$PATHor theCPPTRAJ
environment variable. Probably the easiest solution: install AmberTools via Conda, activate the environment, and SAWNERGY will find the cpptraj executable on its own, so just run your code and don't worry about it.
(out_features, in_features) layout.(D, V) out weights to (V, D) for embedding access.Walker.sample_walks(..., in_parallel=True) now accepts max_parallel_workers so you can lower the worker count below os.cpu_count() when sharing machines or reserving cores for other workloads.force resets.
configure_logging() now documents the correct defaults, and an optional force=True clears existing handlers before installing fresh ones — useful for scripts/notebooks that reconfigure logging multiple times.ArrayStorage is easier to introspect.
__repr__ plus list_blocks() so you can quickly inspect the stored datasets when debugging archives or working interactively.displayed_nodes (and related selectors) now reject non-integer inputs before converting to 0-based indices, and edge coordinate buffers are only materialized when an edge layer is requested, reducing unnecessary copies when plotting nodes only.np.bincount/normalization work, and locate_cpptraj() now de-duplicates candidate paths before probing to avoid repeated cpptraj -h calls.SGNS_Torch is no longer deprecated.
SG_Torch and SG_PureML no longer use biases.
SGNS_Torch
sawnergy.embedding.SGNS_Torch currently produces noisy embeddings in practice. The issue likely stems from weight initialization, although the root cause has not yet been conclusively determined.__init__ docstring now carry a deprecation notice. Constructing the class emits a DeprecationWarning and logs a warning.SG_Torch (plain Skip-Gram with full softmax) or the PureML backends SGNS_PureML / SG_PureML.show=True for scripts.objective="sgns") or plain Skip-Gram (objective="sg"), using either PureML (default) or PyTorch.MD Trajectory + Topology
│
▼
RINBuilder
│ → RIN archive (.zip/.zarr) → Visualizer (display/animate RINs)
▼
Walker
│ → Walks archive (RW/SAW per frame)
▼
Embedder
│ → Embedding archive (frame × vocab × dim)
▼
Downstream ML
Each stage consumes the archive produced by the previous one. Metadata embedded in the archives ensures frame order,
node indexing, and RNG seeds stay consistent across the toolchain.


sawnergy.rin.RINBuildercpptraj executable to:
cpptraj execution, batch processing, and keeps temporary stores tidy viaArrayStorage.compress_and_cleanup.sawnergy.visual.VisualizerAgg fallback in headless environments) and offers convenient color palettes viavisualizer_util.sawnergy.walks.Walkerwalker_util.SharedNDArray so multiple processes can sample without copying.(time, walk_id, length+1) tensors (1-based node indices) alongside metadata such aswalk_length, walks_per_node, and RNG scheme.sawnergy.embedding.Embeddermodel_base="pureml"|"torch" with per-backend overrides supplied through model_kwargs.kind ("in", "out", or "avg") from embed_frame and embed_all.embed_all targets an output archive.sawnergy.sawnergy_util
ArrayStorage: thin wrapper over Zarr v3 with helpers for chunk management, attribute coercion to JSON, and transparent compression to .zip archives.elementwise_processor, compose_steps, etc.), temporary file management, logging, and runtimesawnergy.logging_util.configure_logging: configure rotating file/console logging consistently across scripts.| Archive | Key datasets (name → shape, dtype) | Important attributes (root attrs) |
|---|---|---|
| RIN | ATTRACTIVE_transitions → (T, N, N), float32 • REPULSIVE_transitions → (T, N, N), float32 (optional) • ATTRACTIVE_energies → (T, N, N), float32 (optional) • REPULSIVE_energies → (T, N, N), float32 (optional) • COM → (T, N, 3), float32 |
time_created (ISO) • com_name = "COM" • molecule_of_interest (int) • frame_range = (start, end) inclusive • frame_batch_size (int) • prune_low_energies_frac (float in [0,1]) • attractive_transitions_name / repulsive_transitions_name (dataset names or None) • attractive_energies_name / repulsive_energies_name (dataset names or None) |
| Walks | ATTRACTIVE_RWs → (T, N·num_RWs, L+1), int32 (optional) • REPULSIVE_RWs → (T, N·num_RWs, L+1), int32 (optional) • ATTRACTIVE_SAWs → (T, N·num_SAWs, L+1), int32 (optional) • REPULSIVE_SAWs → (T, N·num_SAWs, L+1), int32 (optional) Note: node IDs are 1-based. |
time_created (ISO) • seed (int) • rng_scheme = "SeedSequence.spawn_per_batch_v1" • num_workers (int) • in_parallel (bool) • batch_size_nodes (int) • num_RWs / num_SAWs (ints) • node_count (N) • time_stamp_count (T) • walk_length (L) • walks_per_node (int) • attractive_RWs_name / repulsive_RWs_name / attractive_SAWs_name / repulsive_SAWs_name (dataset names or None) • walks_layout = "time_leading_3d" |
| Embeddings | FRAME_EMBEDDINGS → (T, N, D), float32 |
created_at (ISO) • frame_embeddings_name = "FRAME_EMBEDDINGS" • time_stamp_count = T • node_count = N • embedding_dim = D • model_base = "torch" or "pureml" • embedding_kind = `"in" |
Notes
T equals the number of frame batches written (i.e., frame_range swept in steps of frame_batch_size). ATTRACTIVE/REPULSIVE_energies are pre-normalized absolute energies (written only when keep_prenormalized_energies=True), whereas ATTRACTIVE/REPULSIVE_transitions are the row-wise L1-normalized versions used for sampling.array_chunk_size_in_block, array_shape_in_block, and array_dtype_in_block (dicts keyed by dataset name). You’ll see these in every archive.alpha and num_negative_samples apply to SGNS only and are ignored for objective="sg".from pathlib import Path
from sawnergy.logging_util import configure_logging
from sawnergy.rin import RINBuilder
from sawnergy.walks import Walker
from sawnergy.embedding import Embedder
import logging
configure_logging("./logs", file_level=logging.WARNING, console_level=logging.INFO)
# 1. Build a Residue Interaction Network archive
rin_path = Path("./RIN_demo.zip")
rin_builder = RINBuilder()
rin_builder.build_rin(
topology_file="system.prmtop",
trajectory_file="trajectory.nc",
molecule_of_interest=1,
frame_range=(1, 100),
frame_batch_size=10,
prune_low_energies_frac=0.85,
output_path=rin_path,
include_attractive=True,
include_repulsive=False
)
# 2. Sample walks from the RIN
walker = Walker(rin_path, seed=123)
walks_path = Path("./WALKS_demo.zip")
walker.sample_walks(
walk_length=16,
walks_per_node=100,
saw_frac=0.25,
include_attractive=True,
include_repulsive=False,
time_aware=False,
output_path=walks_path,
in_parallel=False
)
walker.close()
# 3. Train embeddings per frame (PyTorch backend)
import torch
embedder = Embedder(walks_path, seed=999)
embeddings_path = embedder.embed_all(
RIN_type="attr",
using="merged",
num_epochs=10,
negative_sampling=False,
window_size=4,
device="cuda" if torch.cuda.is_available() else "cpu",
model_base="torch",
output_path="./EMBEDDINGS_demo.zip"
)
print("Embeddings written to", embeddings_path)
For the PureML backend, set
model_base="pureml"and pass the optimizer / scheduler classes insidemodel_kwargs.
from sawnergy.visual import Visualizer
v = Visualizer("./RIN_demo.zip")
v.build_frame(1,
node_colors="rainbow",
displayed_nodes="ALL",
displayed_pairwise_attraction_for_nodes="DISPLAYED_NODES",
displayed_pairwise_repulsion_for_nodes="DISPLAYED_NODES",
show_node_labels=True,
show=True
)
Visualizer lazily loads datasets and works even in headless environments (falls back to the Agg backend).
from sawnergy.embedding import Visualizer
viz = Visualizer("./EMBEDDINGS_demo.zip", normalize_rows=True)
viz.build_frame(1, show=True)
time_aware=True, provide stickiness and on_no_options when calling Walker.sample_walks.Walker.close() (or use a context manager) to release shared-memory segments.model_base="pureml"|"torch" (defaults to "pureml") and pass optimizer / scheduler overrides through model_kwargs.ArrayStorage directly to peek into archives, append arrays, or manage metadata.├── sawnergy/
│ ├── rin/ # RINBuilder and cpptraj integration helpers
│ ├── walks/ # Walker class and shared-memory utilities
│ ├── embedding/ # Embedder + SG/SGNS backends (PureML / PyTorch)
│ ├── visual/ # Visualizer and palette utilities
│ │
│ ├── logging_util.py
│ └── sawnergy_util.py
│
└── README.md
SAWNERGY builds on the AmberTools cpptraj ecosystem, NumPy, Matplotlib, Zarr, and PyTorch (for GPU acceleration if necessary; PureML is available by default).
Big thanks to the upstream communities whose work makes this toolkit possible.