LS-Merge: Merging Language Models in Latent Space

Bedionita Soro^1*, Aoxuan Silvia Zhang^1*, Bruno Andreis^3,4*, Jaehyeong Jo¹, Song Chong¹, Sung Ju Hwang^1,2†

¹KAIST ²DeepAuto.ai ³University of Oxford ⁴Slater Labs

^*Equal contribution ^†Corresponding author

ICLR 2026

Paper OpenReview Code

LS-Merge encodes pretrained LLM weights into a smooth latent space with a transformer-based VAE, aligns heterogeneous representations via projection and Optimal Transport, and decodes merged latent codes back into parameters. This enables robust merging across model sizes, architectures, and families beyond the scope of standard weight-space methods.

Abstract

Model merging in weight space is an efficient way to reuse pretrained models, but existing methods typically assume matching architectures or sizes, making heterogeneous merges brittle or infeasible. We tackle these issues with a transformer-based variational autoencoder (VAE) trained in a two-stage compression curriculum with structured layer-aware chunking. To align heterogeneous models, we introduce a dimensionality-matching projection that allows interpolation between models of different sizes. Empirically, latent-space interpolation is consistently more robust than direct weight-space averaging and yields stronger downstream performance when merging models of different sizes.

The Problem: Weight-Space Merging Hits a Wall

Width mismatch: W_A ∈ ℝ^n×4096 vs. W_B ∈ ℝ^n×2048 — element-wise averaging is undefined.

Depth mismatch: 26 layers vs. 36 layers — no natural 1-to-1 correspondence.

Family mismatch: Gemma and LLaMA have fundamentally different weight layouts.

All existing weight-space methods (Model Soup, SLERP, Task Arithmetic, TIES, DARE) require identical architectures.

Key Idea: Merge in Latent Space

Instead of averaging weights directly, LS-Merge first maps pretrained model parameters into a shared latent space learned by a transformer-based VAE. Merging is then performed as interpolation between aligned latent representations, and the merged latent code is decoded back into a valid set of model weights.

This design is architecture-agnostic — models of different widths and depths map to the same latent dimensionality. It enables self-merging, where multiple latent samples from a single model's posterior are combined to improve robustness without requiring a second model. And because the latent manifold is smooth, interpolation avoids the weight-space barriers that make direct averaging brittle.

Methodology

LS-Merge maps pretrained model weights into a shared latent space, merges them there, and decodes the merged representation back into a usable set of model parameters. The pipeline consists of preprocessing, latent encoding, alignment and interpolation, and decoding.

Step 1: Preprocessing & Chunking

Each selected parameter tensor is flattened into a vector, zero-padded to a multiple of the chunk size c, and partitioned into n fixed-size chunks. This gives models with different tensor shapes a unified sequence format.

parameter tensor → flatten → w ∈ ℝ^L → padding → X ∈ ℝ^{n × c}

Step 2: Latent Encoding

Each chunk is linearly embedded to dimension d, then a 6-layer Transformer encoder maps the sequence into a compact latent code.

X ∈ ℝ^{n × c} → X_emb ∈ ℝ^{n × d} → z = E_θ(X_emb) ∈ ℝ^d_z

Training uses a two-stage β-VAE curriculum: first as a deterministic autoencoder, then with KL regularization enabled to avoid posterior collapse.

Step 3: Alignment & Latent Merging

Same-architecture models are merged by direct latent interpolation. For heterogeneous models, LS-Merge first applies dimensionality matching and then aligns latent distributions with Optimal Transport before interpolation.

z_λ = (1−λ)z_a + λz_b

In self-merging, multiple posterior samples from a single model are merged in latent space.

Step 4: Decoding & Reconstruction

The decoder maps the merged latent code back to a chunk sequence, which is concatenated, un-padded, and reshaped into the original parameter layout.

z_λ → D_φ(z_λ) = X̃ ∈ ℝ^{n × c} → reshape → W̃_λ

Evidence for a Latent Weight Manifold

LLM weights exhibit heavy-tailed distributions (high kurtosis) and low intrinsic dimensionality (sharp PCA spectral decay). This means they concentrate near a low-dimensional non-linear manifold — which a VAE can learn. Linear methods like PCA completely fail because the manifold is curved.

Llama-3.2-3B-it

Gemma-3-1B-it

Gemma-3-4B-it

PCA explained-variance ratios for self-attention k_proj matrices in Llama-3.2-3B-it and Gemma-3 (1B, 4B). In all cases, a few leading components capture most of the variance, indicating a shared low-rank manifold that a VAE can exploit for compression.

Latent Alignment for Heterogeneous Models

When two models come from the same architecture family, their latent representations often lie on overlapping regions of the manifold, so direct interpolation is already meaningful. For heterogeneous models, however, the latent distributions can be disjoint. LS-Merge therefore aligns them before interpolation using dimensionality-matching projection and Optimal Transport.

Homogeneous: Same-family fine-tunes overlap.

Heterogeneous: Disjoint clusters.

After OT: Aligned in shared manifold.

Experimental Results

We evaluate LS-Merge in three regimes: self-merging a single model, merging multiple LoRA experts on a shared base, and transferring across architectures and families. In all cases, latent-space merging improves or matches carefully designed baselines.

Self-Merging: Enhancing a Single Model

	MMLU	MMLU-pro	HellaSwag	GSM8k
Gemma-3-4B-it	53.10	20.90	47.40	29.90
+ VAE recon.	54.10	20.80	49.03	31.27
+ LS-Merge	54.20	21.02	50.10	32.20

Gemma-3-1B-it	32.20	7.10	28.70	16.90
+ VAE recon.	32.60	7.60	28.57	16.77
+ LS-Merge	35.13	10.30	31.16	17.50

Self-merging improves a single Gemma checkpoint by ≈4% on average over both the base model and its VAE reconstruction, especially for the smaller 1B model.

Expert Merging: LoRA Experts on Gemma-7B-it

Method	MMLU	MMLU-pro	HellaSwag	GSM8k	TruthfulQA	NLGraph	K-Cross	AbstainQA
Best Expert	45.7	14.3	46.6	26.1	32.4	51.7	32.7	-10.8
Uniform Soup	49.7	19.4	54.0	7.9	31.2	47.5	29.6	-0.1
SLERP	52.5	18.8	50.4	25.5	28.7	49.8	30.0	-0.2
Greedy Soup	50.8	22.1	54.6	23.9	31.9	52.9	28.0	3.3
DARE-TIES	49.1	18.8	53.7	7.3	28.2	52.8	29.0	1.4
LS-Merge (lerp)	54.7	21.6	58.1	28.1	33.0	53.1	35.6	2.0
LS-Merge (soup)	56.0	22.2	60.1	24.2	32.0	56.1	35.2	4.0

LS-Merge consistently outperforms all weight-space baselines across 8 benchmarks.

Cross-Architecture Merging

Gemma-4B → Gemma-1B: latent-space interpolation transfers strength from a larger source model into a smaller target without catastrophic degradation.

Performance peaks for modest interpolation coefficients (λ between 0.05 and 0.20), confirming that small injections of source structure are most beneficial.

Cross-Family: LLaMA-1B → Gemma-1B

Strategy	WinoGrande	ARC-C	HellaSwag
Base	56.83	42.78	49.07
OT only	51.13	34.25	48.50
OT + interp.	57.75	43.34	50.10

After aligning LLaMA-1B to Gemma-1B, a small interpolation weight (λ ≈ 0.1) already improves the Gemma-1B target across all three benchmarks.

PCA vs. VAE (Gemma-1B)

Method	Ratio	MMLU	ARC-C
Base	1.0×	41.4	42.4
PCA	1.6×	25.5	27.7
VAE	1.6×	39.9	41.6
PCA	4.0×	24.1	25.9
VAE	4.0×	39.8	42.8

PCA collapses at all ratios. Non-linear encoding is a geometric necessity.

Which Layers to Merge?

Layers	WinoGrande	ARC-C	MMLU
MLP only	56.84	43.89	41.02
Attn only	56.67	40.23	39.80
Both	57.75	43.34	42.10

Merging both attention and MLP blocks yields the strongest overall performance, especially on MMLU, compared to merging either component alone.

vs. Activation-Based Methods (Llama-2-13B)

Method	MMLU	IFEval	GSM8k
Task Arith.	52.2	25.1	4.2
AIM (acts.)	54.2	32.0	46.2
LS-Merge	55.1	36.4	44.1

LS-Merge uses only weights, yet matches activation-based AIM.

BibTeX

@inproceedings{
      soro2026lsmerge,
      title={{LS}-Merge: Merging Language Models in Latent Space},
      author={Bedionita Soro and Aoxuan Silvia Zhang and Bruno Andreis and Jaehyeong Jo and Song Chong and Sung Ju Hwang},
      booktitle={The Fourteenth International Conference on Learning Representations},
      year={2026},
      url={https://openreview.net/forum?id=VSDV0SWwOC}
      }