The Future of AI Safety

Published on February 9, 2026

As AI systems become more capable, ensuring they align with human values is paramount. In this post, we'll explore the current challenges in alignment research, from interpretability to robust reward modeling.

Key topics include: Inner Alignment, Outer Alignment, and Mechanistic Interpretability.