LS-Merge: Merging Language Models in Latent Space

1KAIST 2DeepAuto.ai 3University of Oxford 4Slater Labs
*Equal contribution Corresponding author
ICLR 2026
LS-Merge pipeline

LS-Merge encodes pretrained LLM weights into a smooth latent space with a transformer-based VAE, aligns heterogeneous representations via projection and Optimal Transport, and decodes merged latent codes back into parameters. This enables robust merging across model sizes, architectures, and families beyond the scope of standard weight-space methods.

Abstract

Model merging in weight space is an efficient way to reuse pretrained models, but existing methods typically assume matching architectures or sizes, making heterogeneous merges brittle or infeasible. We tackle these issues with a transformer-based variational autoencoder (VAE) trained in a two-stage compression curriculum with structured layer-aware chunking. To align heterogeneous models, we introduce a dimensionality-matching projection that allows interpolation between models of different sizes. Empirically, latent-space interpolation is consistently more robust than direct weight-space averaging and yields stronger downstream performance when merging models of different sizes.

The Problem: Weight-Space Merging Hits a Wall

Model A d=4096, 26 layers Layer 1 Layer 2 Layer 3 Layer 4 Model B d=2048, 36 layers Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6 Cannot average: different shapes!

Width mismatch: WA ∈ ℝn×4096 vs. WB ∈ ℝn×2048 — element-wise averaging is undefined.

Depth mismatch: 26 layers vs. 36 layers — no natural 1-to-1 correspondence.

Family mismatch: Gemma and LLaMA have fundamentally different weight layouts.

All existing weight-space methods (Model Soup, SLERP, Task Arithmetic, TIES, DARE) require identical architectures.

Key Idea: Merge in Latent Space

Instead of averaging weights directly, LS-Merge first maps pretrained model parameters into a shared latent space learned by a transformer-based VAE. Merging is then performed as interpolation between aligned latent representations, and the merged latent code is decoded back into a valid set of model weights.

This design is architecture-agnostic — models of different widths and depths map to the same latent dimensionality. It enables self-merging, where multiple latent samples from a single model's posterior are combined to improve robustness without requiring a second model. And because the latent manifold is smooth, interpolation avoids the weight-space barriers that make direct averaging brittle.

Model 1 W₁ Model 2 W₂ Encoder E₁ Encoder E₂ Shared Latent Space z₁ z₂ align + merge zα Decoder D Merged α

Methodology

LS-Merge maps pretrained model weights into a shared latent space, merges them there, and decodes the merged representation back into a usable set of model parameters. The pipeline consists of preprocessing, latent encoding, alignment and interpolation, and decoding.

1. Chunking w → X ∈ ℝn×c 2. Encode za zb shared latent representation 3. Align + Merge OT / interpolation / self-merge 4. Decode merged model W̃ Incompatible weights are never averaged directly — they are mapped to a shared manifold, aligned, and decoded back.

Step 1: Preprocessing & Chunking

Each selected parameter tensor is flattened into a vector, zero-padded to a multiple of the chunk size c, and partitioned into n fixed-size chunks. This gives models with different tensor shapes a unified sequence format.

parameter tensor → flatten → w ∈ ℝL → padding → X ∈ ℝn × c

Step 2: Latent Encoding

Each chunk is linearly embedded to dimension d, then a 6-layer Transformer encoder maps the sequence into a compact latent code.

X ∈ ℝn × cXemb ∈ ℝn × dz = Eθ(Xemb) ∈ ℝdz

Training uses a two-stage β-VAE curriculum: first as a deterministic autoencoder, then with KL regularization enabled to avoid posterior collapse.

Step 3: Alignment & Latent Merging

Same-architecture models are merged by direct latent interpolation. For heterogeneous models, LS-Merge first applies dimensionality matching and then aligns latent distributions with Optimal Transport before interpolation.

zλ = (1−λ)za + λzb

In self-merging, multiple posterior samples from a single model are merged in latent space.

Step 4: Decoding & Reconstruction

The decoder maps the merged latent code back to a chunk sequence, which is concatenated, un-padded, and reshaped into the original parameter layout.

zλDφ(zλ) = ∈ ℝn × c → reshape → λ

Evidence for a Latent Weight Manifold

LLM weights exhibit heavy-tailed distributions (high kurtosis) and low intrinsic dimensionality (sharp PCA spectral decay). This means they concentrate near a low-dimensional non-linear manifold — which a VAE can learn. Linear methods like PCA completely fail because the manifold is curved.

PCA Llama

Llama-3.2-3B-it

PCA Gemma 1B

Gemma-3-1B-it

PCA Gemma 4B

Gemma-3-4B-it

PCA explained-variance ratios for self-attention k_proj matrices in Llama-3.2-3B-it and Gemma-3 (1B, 4B). In all cases, a few leading components capture most of the variance, indicating a shared low-rank manifold that a VAE can exploit for compression.

Latent Alignment for Heterogeneous Models

When two models come from the same architecture family, their latent representations often lie on overlapping regions of the manifold, so direct interpolation is already meaningful. For heterogeneous models, however, the latent distributions can be disjoint. LS-Merge therefore aligns them before interpolation using dimensionality-matching projection and Optimal Transport.

Homogeneous (overlapping) Fine-tune A Fine-tune B ✓ Linear interpolation OK
Heterogeneous (disjoint) Gemma-1B LLaMA-1B OT ⚠ Requires OT alignment
t-SNE homogeneous

Homogeneous: Same-family fine-tunes overlap.

t-SNE cross-arch

Heterogeneous: Disjoint clusters.

After OT alignment

After OT: Aligned in shared manifold.

Experimental Results

We evaluate LS-Merge in three regimes: self-merging a single model, merging multiple LoRA experts on a shared base, and transferring across architectures and families. In all cases, latent-space merging improves or matches carefully designed baselines.

Self-Merging: Enhancing a Single Model

MMLUMMLU-proHellaSwagGSM8k
Gemma-3-4B-it53.1020.9047.4029.90
+ VAE recon.54.1020.8049.0331.27
+ LS-Merge54.2021.0250.1032.20
Gemma-3-1B-it32.207.1028.7016.90
+ VAE recon.32.607.6028.5716.77
+ LS-Merge35.1310.3031.1617.50

Self-merging improves a single Gemma checkpoint by ≈4% on average over both the base model and its VAE reconstruction, especially for the smaller 1B model.

Expert Merging: LoRA Experts on Gemma-7B-it

MethodMMLUMMLU-proHellaSwagGSM8kTruthfulQANLGraphK-CrossAbstainQA
Best Expert45.714.346.626.132.451.732.7-10.8
Uniform Soup49.719.454.07.931.247.529.6-0.1
SLERP52.518.850.425.528.749.830.0-0.2
Greedy Soup50.822.154.623.931.952.928.03.3
DARE-TIES49.118.853.77.328.252.829.01.4
LS-Merge (lerp)54.721.658.128.133.053.135.62.0
LS-Merge (soup)56.022.260.124.232.056.135.24.0

LS-Merge consistently outperforms all weight-space baselines across 8 benchmarks.

Cross-Architecture Merging

Cross-arch bars

Gemma-4B → Gemma-1B: latent-space interpolation transfers strength from a larger source model into a smaller target without catastrophic degradation.

Coefficient impact

Performance peaks for modest interpolation coefficients (λ between 0.05 and 0.20), confirming that small injections of source structure are most beneficial.

Cross-Family: LLaMA-1B → Gemma-1B

StrategyWinoGrandeARC-CHellaSwag
Base56.8342.7849.07
OT only51.1334.2548.50
OT + interp.57.7543.3450.10

After aligning LLaMA-1B to Gemma-1B, a small interpolation weight (λ ≈ 0.1) already improves the Gemma-1B target across all three benchmarks.

PCA vs. VAE (Gemma-1B)

MethodRatioMMLUARC-C
Base1.0×41.442.4
PCA1.6×25.527.7
VAE1.6×39.941.6
PCA4.0×24.125.9
VAE4.0×39.842.8

PCA collapses at all ratios. Non-linear encoding is a geometric necessity.

Which Layers to Merge?

LayersWinoGrandeARC-CMMLU
MLP only56.8443.8941.02
Attn only56.6740.2339.80
Both57.7543.3442.10

Merging both attention and MLP blocks yields the strongest overall performance, especially on MMLU, compared to merging either component alone.

vs. Activation-Based Methods (Llama-2-13B)

MethodMMLUIFEvalGSM8k
Task Arith.52.225.14.2
AIM (acts.)54.232.046.2
LS-Merge55.136.444.1

LS-Merge uses only weights, yet matches activation-based AIM.

BibTeX

@inproceedings{
      soro2026lsmerge,
      title={{LS}-Merge: Merging Language Models in Latent Space},
      author={Bedionita Soro and Aoxuan Silvia Zhang and Bruno Andreis and Jaehyeong Jo and Song Chong and Sung Ju Hwang},
      booktitle={The Fourteenth International Conference on Learning Representations},
      year={2026},
      url={https://openreview.net/forum?id=VSDV0SWwOC}
      }