Published on
large language model

Exploring Feature Universality in Large Language Models Using Sparse Autoencoders

This summary explores the concept of feature universality in large language models (LLMs) using sparse autoencoders (SAEs), as presented in "Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models" (Lan et al., 2024). The research aims to determine if different LLMs develop similar internal representations of concepts within their intermediate layers.

Key Points

  • The study utilizes SAEs to disentangle complex LLM activations into more interpretable feature spaces, addressing the challenge of polysemanticity in individual neurons. This "dictionary learning" approach allows for easier comparison of features across different models.
  • Researchers employed representational space similarity metrics, specifically Singular Value Canonical Correlation Analysis (SVCCA) and Representational Similarity Analysis (RSA), to compare SAE feature spaces across different LLMs. A novel method was developed to pair features based on activation correlation, addressing the permutation and rotation issues inherent in comparing high-dimensional spaces.
  • Experiments comparing Pythia and Gemma model variants revealed statistically significant similarities in SAE feature spaces, particularly within middle layers. Further analysis showed that semantically related feature subspaces (e.g., related to emotions or time) exhibited even stronger similarity across models.

Conclusion

The research provides strong evidence for feature universality across different LLMs by demonstrating significant similarities in their SAE-derived feature spaces. This suggests that diverse LLMs learn similar internal representations of concepts, particularly within their middle layers. These findings have implications for LLM interpretability, transfer learning, and AI safety research.

Source(s):

Keep reading

Related posts