Concept Space Alignment in Multilingual LLMs • Nadya Yuki Wangsajaya

This week, we will be reading Concept Space Alignment in Multilingual LLMs in my ongoing effort to dive deeper into the state-of-the-art multilingual LLMs research… for my internship in A*STAR.

Before reading the paper, I am intrigued by the abstract. When I first learned about embeddings in LLMs, it was mentioned briefly that the embeddings are language-agnostic, which means embeddings of the same word in different languages are the same. I thought it was cool then, and this paper aims to investigate this claim in-depth.

Summary

At its heart, the authors want to know the characteristic of the embedding space. Imagine you have the word ‘teacher’ in English and its Indonesian counterpart ‘guru’. If a multilingual model truly shares the same semantic structure, then the embedding for ‘teacher’ should lie near ‘guru’ in the embedding space, even without explicit supervision. Why does this matter? It matters because it would then imply:

Zero-shot retrieval.
- We would be able to retrieve translations directly, without need for parallel corpora at inference time. No more running a separate machine-translation system.
Multilingual model insight.
- We would know that multilingual pre-training is simply just enforcing a shared conceptual map between the different languages.
- This will open doors to further optimization in pre-training multilingual LLMs.

Problem is, full embedding spaces are messy. We need to quantify this messiness and gauge whether we can find a ‘pattern’ within one embedding to another. As such, we turn to Procrustes Analysis.

Procrustes Analysis

Without the linear algebra jargons and lingo, Procrustes analysis asks, Given two sets of shapes , what linear transformation (translation, rotation and uniform scaling) can map one set to another? Specifically, here are the steps (which I took from this page):

Pick $n$ translation pairs ${(c_i^{(s)}, c_i^{(t)})^{n}_{i=1}}$ , where
- $c_i^{(s)}$ is the source concept (e.g. English ‘teacher’)
- $c_i^{(t)}$ is the target concept (e.g. Indonesian ‘guru’)
Form data matrices $X$ and $Y$
- Let
  $X = \begin{pmatrix}x_1^\top \\[6pt]x_2^\top \\[3pt]\vdots \\[2pt]x_n^\top\end{pmatrix}\in \mathbb{R}^{n\times d},\quad Y = \begin{pmatrix}y_1^\top \\[6pt]y_2^\top \\[3pt]\vdots \\[2pt]y_n^\top\end{pmatrix}\in \mathbb{R}^{n\times d}$
  where each row of $X$ matches with each row of $Y$
Find the mapping matrix $W \in \mathbb{R}^{d\times d}$ , where $W^\top W = I$
- This $W$ should solve
  $W^* \;=\; \underset{W \in O_d}{\arg\min}\,\bigl\|W\,X^\top - Y^\top\bigr\|_F^2\;\Longleftrightarrow\;W^* \;=\; \underset{W}{\arg\min}\,\bigl\|W\,X - Y\bigr\|_F^2$
  Don’t fret; this is very similar to the method of least square in regression analysis, but in matrices and constrained to linear transformations
- Usually the matrix $W$ is found iteratively (just like in regression analysis), but in here, we have a different method to find $W$ using singular value decomposition (SVD)
Reducing to SVD problem
- Let’s first define $d \times d$ cross-variance matrix,
  $M = Y^\top X \in \mathbb{R}^{d\times d}$
  We want to compute the SVD for $M$ .
  $M = U\Sigma V^\top$
It is proven (see the wikipedia link) that the optimal $W$ is therefore
$W^* = U V^\top$
You can imagine $V^\top$ to rotate source space to align with the target space, and then $U$ is doing all the rigid transformations.
So now, for every new source embedding $x$ , we can get the predicted target embedding $x_{\text{aligned}}$ from
$x_{\text{aligned}} = W^*x$
The next step is comparing $x_{\text{aligned}}$ with the actual target embedding $y$ through top-k neighbor in the $x_{\text{aligned}}$ space. We then calculate the precision@k metric as our evaluation metric.

Experimental Setup

Models: 10 LLMs, varying in architectures (decoder-only and encoder-decoder), and sizes (7B to 70B).
Languages: 7 languages in total. Indo-European languages (i.e. English, French, Romanian), non-Indo-European languages but still in Latin script (i.e. Basque and Finnish), and non-Indo-European languages and not in Latin script (i.e. Japanese and Thai).
Embeddings
- Vanilla: Last-toke hidden state.
- Prompt-based: Final hidden state from the instruction template (”Summarize concept X in one [Lang] word”).
Metrics: Precision@k, measured before and after Procrustes alignment.

Key Findings

Let’s first demystify Figure 2 in the paper.
- The blue bar refers to the prompt-based embedding, while the orange bar refers to the vanilla embedding.
- The bar defines the upper bound performance. Basically, you are using the $W^*$ you calculated from the same test set. Obviously, the results would be the best.
- The black dashed lines refer to the zero-shot of the above. You use $W^*$ calculated to a completely different test set.
- The red dashed line is the precision@1 of the raw embeddings, without any Procrustes alignment.

Even before alignment, especially for vanilla embedding, large models achieve high precision@1 results.
- The ceiling is highest for vanilla word embeddings in LLaMA2-13B, which suggests near isomorphism between monolingual concepts spaces.
Prompt-based embedding is less linear, however.
- This suggests that the previous tokens may corrupt the partial isomorphism from vanilla embedding.
- From the gap between the dashed lines, prompt-based embedding also show worse degradation.
Best performance is for Indo-European languages and lowest is for non-Indo-European languages with non-Latin scripts.
Abstract vs physical concepts.
- Abstract terms (e.g. “justice”) achieves higher precision@k than concrete objects (e.g. “car”), likely because abstract ideas appear more frequently across different contexts and domains.

Final Words

The paper has evaluated concept alignment in multilingual LLMs by using a well-known, tried-and-tested Procrustes analysis method. The results show that multilingual LLMs has high-quality, linear concept alignment across different languages. However, the ability to generalize depends on the language, types of embeddings, and even type of words.

My Thoughts

I am impressed by the methods used throughout the paper. I initially thought they would just use vanilla cosine similarity between two embeddings to calculate the distance, but their usage of Procrustes analysis and Precision@k metric is definitely more robust.

I do feel like there is a lot of future works that can be done, for example by investigating in greater depth the different types of words, instead of just binary classification between abstract and physical concepts. Is it true that the better alignment for abstract concepts can be solely attributed to the frequency?

Another interesting thing is maybe to update this findings by using more updated LLMs. Or even maybe changing from words to audio. Will the linearity holds in different modes? Why is non-Indo-European languages are less linear? Could it be due to the lack of resources?

Food for thoughts. Otherwise, good read!