Hypothesis: Solving mechanistic interpretability could allow us to greatly increase human intelligence.
Reasoning:- A model that can solve a problem better than any human must contain algorithms that are better for solving the problem than those we have.
- With mechanistic interpretability, we could discover these more competent algorithms within deep learning models.
- Once we have acquired these algorithms, we could learn them (updating our software) or modify the brain to contain them (updating our hardware).
- Thus, interpretability can enable us to advance our own intelligence to keep up with the most intelligent models, not through reliance on superintelligent models but through learning and modifying ourselves.
Conclusion: This may be the best path to true ensured safety, not only through aligning models, but through coevolving together.