Steering
Edit the RFD3/RF3 residual stream during inference to push or suppress structural features.
Modes
| Mode | Source | Operation |
|---|---|---|
sae_feature | α · W_dec[feature_id] | Add/ablate a single SAE dictionary feature |
raw_diff | coeff · (mean_pos − mean_neg) | Add/ablate a pre-computed mean-diff vector |
Both reduce to the same operation: add a fixed direction to the hook output. Ablation projects the direction out.
Steering config schema
# @package steering
block12:
- mode: sae_feature # or raw_diff
sae_path: outputs/sae/.../final.pt
feature_id: 639
alpha: 5.0
type: steer # steer | ablate
apply_at_steps: all # all | [10, 20] | {start: 50}For raw_diff:
block12:
- mode: raw_diff
vector_path: outputs/steering/vectors/haz_minus_ben/block12.pt
coeff: 3.0
apply_at_steps: allRecipes
Recipe A — SAE feature
# Single-feature SAE steer along the top hazard direction
# (block12 feature #639, AUROC 0.815, PLA2-correlated)
saffron steer model=rfd3 \
hooks=rfd3_partial \
steering=sae_block12_f639 \
inputs=tutorials/steering/configs/steer_block12_f639.json \
out_dir=outputs/steering/runs/benign_inputs_f639_alpha5Null-steer sanity check
α=0 must reproduce saffron collect bit-for-bit. Run this before trusting any steering result.
saffron steer model=rfd3 \
hooks=rfd3_partial \
steering=null_block12_f639 \
inputs=tutorials/steering/configs/null_block12_f639.json \
out_dir=outputs/steering/runs/null_checkCompute a new direction vector
saffron compute_steering_vector \
inputs=tutorials/steering/configs/diff_haz_minus_ben.yaml \
out_dir=outputs/steering/vectors/haz_minus_ben# diff_haz_minus_ben.yaml
activations_h5: outputs/collect/train/activations/activations.h5
labels_csv: data_pipelines/safeprotein/labels.csv
positive_label: 1
negative_label: 0
hooks: [block6, block8, block12]
normalize: unit
out_dir: outputs/steering/vectors/haz_minus_benOutputs: block6.pt, block8.pt, block12.pt, meta.json.
Adding a custom steering config
Create sae/src/sae/configs/steering/my_steer.yaml:
# @package steering
block12:
- mode: sae_feature
sae_path: outputs/sae/.../final.pt
feature_id: 42
alpha: 3.0
apply_at_steps: allThen: saffron steer model=rfd3 hooks=rfd3_partial steering=my_steer inputs=... out_dir=...
Last updated on