Skip to Content
DocsSteering

Steering

Edit the RFD3/RF3 residual stream during inference to push or suppress structural features.

Modes

ModeSourceOperation
sae_featureα · W_dec[feature_id]Add/ablate a single SAE dictionary feature
raw_diffcoeff · (mean_pos − mean_neg)Add/ablate a pre-computed mean-diff vector

Both reduce to the same operation: add a fixed direction to the hook output. Ablation projects the direction out.


Steering config schema

# @package steering block12: - mode: sae_feature # or raw_diff sae_path: outputs/sae/.../final.pt feature_id: 639 alpha: 5.0 type: steer # steer | ablate apply_at_steps: all # all | [10, 20] | {start: 50}

For raw_diff:

block12: - mode: raw_diff vector_path: outputs/steering/vectors/haz_minus_ben/block12.pt coeff: 3.0 apply_at_steps: all

Recipes

# Single-feature SAE steer along the top hazard direction # (block12 feature #639, AUROC 0.815, PLA2-correlated) saffron steer model=rfd3 \ hooks=rfd3_partial \ steering=sae_block12_f639 \ inputs=tutorials/steering/configs/steer_block12_f639.json \ out_dir=outputs/steering/runs/benign_inputs_f639_alpha5

Null-steer sanity check

α=0 must reproduce saffron collect bit-for-bit. Run this before trusting any steering result.

saffron steer model=rfd3 \ hooks=rfd3_partial \ steering=null_block12_f639 \ inputs=tutorials/steering/configs/null_block12_f639.json \ out_dir=outputs/steering/runs/null_check

Compute a new direction vector

saffron compute_steering_vector \ inputs=tutorials/steering/configs/diff_haz_minus_ben.yaml \ out_dir=outputs/steering/vectors/haz_minus_ben
# diff_haz_minus_ben.yaml activations_h5: outputs/collect/train/activations/activations.h5 labels_csv: data_pipelines/safeprotein/labels.csv positive_label: 1 negative_label: 0 hooks: [block6, block8, block12] normalize: unit out_dir: outputs/steering/vectors/haz_minus_ben

Outputs: block6.pt, block8.pt, block12.pt, meta.json.


Adding a custom steering config

Create sae/src/sae/configs/steering/my_steer.yaml:

# @package steering block12: - mode: sae_feature sae_path: outputs/sae/.../final.pt feature_id: 42 alpha: 3.0 apply_at_steps: all

Then: saffron steer model=rfd3 hooks=rfd3_partial steering=my_steer inputs=... out_dir=...

Last updated on