DAO — Siamese Foundation Models for Crystal Structure Prediction

Motivation & Overview

Predicting crystal structures from chemical compositions is a fundamental challenge in materials discovery — analogous to protein folding but with far more complex 3D geometries.

⚡

High Computational Cost

Traditional CSP methods — first-principles calculations, stochastic sampling, and evolutionary optimization — are inherently limited by high computational costs and poor scalability with system complexity.

🧩

Limited Generalizability

Existing deep generative models rely on domain-specific small datasets for training, leading to limited generalizability to unseen structures and unsatisfactory performance on widely recognized CSP benchmarks like MPTS-52.

🔭

Missing CSP-Specific Foundation Models

Prior crystal foundation models either target force-field prediction (GNoME, MACE-MP-0) or general-purpose generation (MatterGen) — none specifically targets CSP with thorough investigation.

Our Solution: Siamese Foundation Models

We propose Diffusion-based Crystal Omni (DAO), a pretrain–finetune framework comprising two complementary foundation models: DAO-G for generating stable crystal structures and DAO-P for predicting energy and assisting DAO-G. Both are built upon Crysformer, a geometric graph Transformer ensuring O(3) and periodic invariance for crystal structures.

The DAO framework: pretraining pipeline and downstream validation of DAO-G and DAO-P.

Key Contributions

Six principal advances that collectively push CSP forward.

Siamese Foundation Model Framework

First foundation model framework specifically designed for CSP, comprising DAO-G (generator) and DAO-P (predictor) that synergistically cooperate: DAO-P relaxes data and guides generation for DAO-G, while DAO-G augments structural data for DAO-P.

CrysDB: ~940K Crystal Pretraining Dataset

Curated from Materials Project and OQMD, comprising ~940K entries of stable and unstable crystals with energy annotations, enabling large-scale pretraining with rigorous deduplication to prevent data leakage.

Two-Stage Pretraining with Dataset Relaxation

Stage I pretrains DAO-G on all crystals; Stage II refines on a dataset where unstable structures are relaxed by DAO-P using L-BFGS, mitigating bias toward unstable energy landscapes.

Energy-Guided Sampling via Boltzmann Distribution

DAO-P provides energy-based guidance during DAO-G's sampling, steering generated structures toward lower-energy, more thermodynamically stable configurations using a principled exponential energy loss.

SOTA on CSP Benchmarks

Pretraining consistently improves performance across multiple backbone architectures. DAO-G (Crysformer + FlowMM) achieves the best Match Rates of 74.17% on MP-20 and 42.01% on MPTS-52.

Real-World Superconductor Validation

On Cr₆Os₂, DAO achieves 100% match rate with RMSE 0.0012 and over 2000× speedup per iteration vs. DFT. DAO-P predicts critical temperatures with errors as low as 0.04 K.

Statistics of CrysDB: source distribution, stable/unstable proportions, and feature distributions.

Method / Framework

DAO's pretrain–finetune pipeline with two Siamese foundation models and a two-stage pretraining strategy.

CrysDB Construction

Compile ~940K crystal entries from Materials Project (94,779 entries) and OQMD (848,105 entries), containing 3–30 atoms with E_hull < 1.0 eV/atom. After deduplication against downstream benchmarks, the final CrysDB contains 919,258 entries — 29% stable, 71% unstable from OQMD; 55% stable, 45% unstable from MP.

Stage I: Pretrain DAO-G on Full CrysDB

DAO-G is pretrained via a diffusion process (DiffCSP) to predict lattice noise and fractional coordinates score. Training on both stable and unstable crystals enables learning from a broader distribution. Simultaneously, DAO-P is pretrained with a mix-supervised loss: the diffusion CSP loss (self-supervised) plus an exponential energy loss (supervised) that provably converges to ground-truth intermediate energies under Boltzmann-constrained modeling.

Dataset Relaxation via DAO-P

DAO-P predicts energy gradients (force fields) for unstable structures (0.08 < E_hull ≤ 0.5 eV/atom) and relaxes them toward more stable configurations using the L-BFGS optimizer — replacing expensive DFT calculations with a fast ML-based alternative.

Stage II: Refine DAO-G on Relaxed Dataset

Continue pretraining DAO-G on the relaxed dataset with a reduced learning rate, refining the denoising process based on improved data quality and mitigating bias toward unstable regions.

Energy-Guided Sampling

During generation, DAO-P steers the sampling of DAO-G via energy guidance: ∇_{M_t} log p_t(M_t) = ∇_{M_t} log q_t(M_t) − β∇_{M_t}E_t(M_t, t). The Boltzmann-weighted distribution promotes thermodynamically stable structures.

Finetune for Downstream Tasks

DAO-G is directly finetuned for CSP without architecture modification. DAO-P is finetuned for energy/property prediction with specialized heads across eight distinct datasets.

Crysformer Architecture

Both DAO-G and DAO-P are built on Crysformer, a geometric graph Transformer with four modules: (1) an embedding module with CGCNN embeddings and Fourier-Transform-based invariant edge features; (2) an invariant graph attention module with separate parametric networks for keys, values, and edge features; (3) a gated addition module for flexible residual connections; (4) noise and energy prediction heads. Crysformer ensures O(3) equivariance for noise output and O(3) invariance for energy output, along with periodic translation invariance — critical symmetries for crystal structures.

Crysformer: embedding, invariant graph attention, gated addition, and prediction heads.

Experiments & Results

Evaluation on two well-recognized CSP benchmarks: MP-20 (≤20 atoms, 45,231 crystals) and MPTS-52 (≤52 atoms, 40,476 crystals).

74.17%

Best Match Rate on MP-20 (1-shot)

42.01%

Best Match Rate on MPTS-52 (1-shot)

919K

Deduplicated CrysDB Entries

2000×

Speedup vs. DFT (per iteration)

CSP Performance (1-shot) on MP-20 & MPTS-52

Category	Model	Size	MP-20 MR (%) ↑	MP-20 RMSE ↓	MPTS-52 MR (%) ↑	MPTS-52 RMSE ↓
Non-Pretrained	CDVAE	–	33.90	0.1045	5.34	0.2106
	DiffCSP	–	51.49	0.0631	12.19	0.1786
	EquiCSP	–	57.39	0.0510	14.85	0.1169
	FlowMM	–	61.39	0.0560	17.54	0.1726
	Crysformer + DiffCSP	–	51.55	0.0915	17.65	0.1428
Pretrained	DiffCSP	12.3M	51.23	0.0552	18.50	0.0825
	DiffCSP-large	26.2M	64.04	0.0433	30.77	0.0640
	MatterGen	25.3M	67.40	0.0332	30.28	0.0703
	FlowMM-large	25.7M	69.95	0.0378	33.78	0.0951
	Crysformer + DiffCSP (DAO-G Stage I)	25.2M	65.60	0.0411	32.52	0.0731
	Crysformer + FlowMM	25.2M	74.17	0.0400	42.01	0.1083

Bold green = best. All pretrained models trained on CrysDB. Results averaged over three runs.

Ablation studies: two-stage pretraining, polymorph generation, energy guidance, and stability rates.

      Key Findings
      Impact of Pretraining: Large-scale pretraining boosts DAO-G's Match Rate
        from 51.55% to 65.60% on MP-20. FlowMM also benefits substantially from pretraining.
Efficacy of Crysformer: DAO-G outperforms DiffCSP-large across nearly all
        metrics. Crysformer + FlowMM also surpasses FlowMM-large in Match Rate.
Priority on Large Systems: While MatterGen slightly outperforms on MP-20,
        DAO-G achieves higher MR on MPTS-52 (32.52% vs. 30.28%), demonstrating better
        scaling to larger-atom systems.
Flow Matching Advantage: Replacing diffusion with flow matching yields
        the best Match Rates of 74.17% on MP-20 and 42.01% on MPTS-52.

    

Ablation: Two-Stage Pretraining & Energy Guidance

📊

Stage I vs. Stage I+II

Including unstable data in pretraining (Stage I) outperforms stable-only pretraining. Adding Stage II (data relaxation) further improves MR and reduces RMSE on MP-20, and significantly reduces RMSE variance on MPTS-52.

🔋

Energy Guidance Benefits

Energy-guided sampling increases stability rate from 85.99% → 87.42% on MP-20 and 73.75% → 75.05% on MPTS-52. It reduces RMSE on MPTS-52 (0.0695 → 0.0688).

🔮

Polymorph Generation

DAO-G successfully generates all polymorphs in 72.2%, 54.5%, and 81.8% of 2-, 3-, and 4-polymorph cases. For Ni₆O₂F₁₀ (4 conformations), all are hit with RMSEs of 0.0063, 0.0305, 0.0309, 0.0049.

DAO-P Energy Prediction Accuracy (Zero-Shot)

Without finetuning on MP-20 or MPTS-52, DAO-P achieves MAEs of 0.0260 eV/atom on MP-20 and 0.0514 eV/atom on MPTS-52 test sets — accuracy considered acceptable for materials science. DAO-P also achieves SOTA results on four out of eight crystal property prediction datasets.

Real-World Superconductor Analysis

Validating DAO on three real-world superconductors unseen during pretraining and finetuning: Cr₆Os₂, Zr₁₆Rh₈O₄, and Zr₁₆Pd₈O₄.

Superconductor experiments: structure prediction, T_c estimation, and speed comparison with DFT.

100%

Match Rate on Cr₆Os₂ (20-shot)

0.0012

RMSE on Cr₆Os₂ (20-shot best)

0.04 K

T_c Error on Zr₁₆Pd₈O₄

Cr₆Os₂ (A15 Structure)

DAO-G achieves 100% Match Rate and RMSE = 0.0012 over 20 runs. DFT E_hull of generated structure: 0.02918 vs. experimental 0.02916 — a difference of only 0.00002 eV/atom.

Although unstable Cr₆Os₂ structures existed in pretraining data, DAO-G generates the stable superconducting structure — not merely memorizing training examples.

vs. QE optimizer: 75% MR, avg. RMSE 0.1310, and >2000× slower per iteration.

Zr₁₆Pd₈O₄ (η-carbide, Fd3̄m)

Features rigid Wyckoff site occupancy and geometrically frustrated stella quadrangula lattices. DAO-G generates the structure with RMSE = 0.0172. E_hull difference: 0.0003 eV/atom.

Zr₁₆Rh₈O₄

A minor lattice change (~0.5%) substituting Rh for Pd significantly affects superconducting properties (T_c: 2.73 K → 3.73 K). DAO-G resolves this with RMSE = 0.0212.

DAO-P T_c errors: 2.02 K (Cr₆Os₂), 0.26 K (Zr₁₆Rh₈O₄), 0.04 K (Zr₁₆Pd₈O₄).

Data Augmentation Improves T_c Prediction

Using DAO-G to generate structures for 748 superconductors without structural data consistently improves DAO-P's T_c prediction across all 5 cross-validation folds, reducing average MAE (logK) from 0.761 → 0.714.

Speed Advantage

QE optimizer: ~138 minutes over 38 iterations. DAO-G: 1000 sampling iterations in 1.5 minutes — over 2000× faster per iteration.

Conclusion & Impact

DAO demonstrates the significant potential of Siamese foundation models for advancing materials science research and development.

🎯

SOTA Crystal Structure Prediction

Pretrained FlowMM achieves state-of-the-art results on both MP-20 and MPTS-52 benchmarks, with 74.17% and 42.01% Match Rates. Pretraining consistently benefits multiple backbone architectures.

🔄

Synergistic Generator–Predictor Interaction

DAO-P enhances DAO-G via dataset relaxation and energy-guided sampling; DAO-G augments structural data for DAO-P when structural information is unavailable.

🧪

Practical Superconductor Analysis

DAO accurately predicts structures and critical temperatures for real-world superconductors, outperforming DFT in both efficiency and accuracy — a promising step toward designing novel high-temperature superconductors.

      Future Directions
      Scaling to larger systems: Expanding the pretraining dataset beyond
        30 atoms could improve MPTS-52 performance (currently 42.01% MR in 1-shot, 46.78% in 20-shot).
Advanced generative models: Integrating more advanced generative
        approaches into the pretraining process.
Novel superconductor design: Moving beyond structure prediction and
        Tc estimation toward property-guided design of novel high-temperature
        superconductors.