AI安全维基 - 目录

AI安全研究的个人知识库。

概述

本维基追踪与AI安全和对齐相关的实体、概念和来源。随着新来源的摄入，它将逐步维护。

还需创建：DeepMind, MIRI, Redwood Research等。

页面	摘要	更新日期
thomas-jiralerspong	Anthropic研究员，模型差异研究	2026-04-08
trenton-bricken	Anthropic对齐科学，可解释性	2026-04-08

还需创建：Stuart Russell, Nick Bostrom, Paul Christiano等。

暂无条目。创建页面：CAIS, GovAI等。

还需创建：outer alignment, deceptive alignment, specification gaming等。

页面	摘要	更新日期
mechanistic-interpretability	通过内部机制分析理解模型	2026-04-08
steering	通过修改内部激活影响行为	2026-04-08
model-diffing	比较模型以发现行为差异	2026-04-08
crosscoders	比较不同模型架构的工具	2026-04-08

还需创建：RLHF, constitutional AI, interpretability, scalable oversight等。

页面	摘要	更新日期
functional-emotions	影响模型行为的类情感表征	2026-04-07
persona-selection	模型如何采用并保持一致的角色	2026-04-07

还需创建：emotion vectors, world models, situational awareness等。

页面	标题	作者	日期	更新日期
diff-tool-ai-behavioral-differences	AI的”diff”工具：寻找新模型的行为差异	Jiralerspong, Bricken	2026-02	2026-04-08
emotion-concepts-function	大型语言模型中的情绪概念及其功能	Anthropic Interpretability	2026-03-31	2026-04-07

见 wiki/raw/articles/ 获取来源文档。

暂无条目。为文献综述、比较等创建综合页面。

见 log 获取最近活动。