CAO Bin 曹斌

I am working in AI4CM (Artificial Intelligence for Computational Materials), with a focus on crystallography and spectroscopy. My research spans physics-based diffraction pattern simulation, machine learning representations in spectrum-based sequence models, and graph-based modeling of crystal structures. You can learn more on my personal website: www.caobin.asia.

🔬 Research Focus

  • Crystal structure representation for downstream property prediction and material generation.
  • Spectral modeling and representation for crystal structure and symmetry identification.

In parallel, I actively promote active learning in materials science through the development of BGOlearn, a Bayesian global optimization package tailored for materials design. Collaborating with experimental teams, I continuously enhance BGOlearn to bridge mature ML techniques with real-world scientific workflows.

🌍 Open Science Advocacy

I am passionate about open science and firmly support the unrestricted sharing of knowledge. To that end, I openly release the code and datasets from my research to ensure transparency, reproducibility, and community benefit.

🎓 Academic Path

I completed my MPhil under the supervision of Prof. Zhang Tong-yi and am currently pursuing a PhD with him at The Hong Kong University of Science and Technology (Guangzhou) since 2023.

📝 Selected Publications (co-/first author)

MGE advances 2025
sym

Optimize the quantum yield of G‐quartet‐based circularly polarized luminescence materials via active learning strategy‐BgoFace

Tianliang Li, Lifei Chen(co-first author) Bin Cao(co-first author), Siyuan Liu,…, Tong-Yi Zhang, Lingyan Feng

BgoFace

  • This work developed an integrated AL software, BgoFace, which satisfies most material property optimization re-quirements. The application of BgoFace (with default setting) successfully accel-erated the discovery of G4-based CPL materials, achievingresults within six iterations and synthesizing 24 experimentalgroups. The final QY nearly doubled the initial best QY inthe training dataset.
arXiv 2025
sym

Materials Generation in the Era of Artificial Intelligence: A Comprehensive Survey

Zhixun Li, Bin Cao(co-first author), Rui Jiao(co-first author), Liang Wang(co-first author), Ding Wang, Yang Liu, Dingshuo Chen, Jia Li, Qiang Liu, Yu Rong, Liang Wang, Tong-Yi Zhang, Jeffrey Xu Yu MatGen

  • We first organize various types of materials and illustrate multiple representations of crystalline materials. We then provide a detailed summary and taxonomy of current AI-driven materials generation approaches. Furthermore, we discuss the common evaluation metrics and summarize open-source codes and benchmark datasets. Finally, we conclude with potential future directions and challenges in this fast-growing field.
Aggregate 2025
sym

Interpretable Active Learning Identifies Iron-Doped Carbon Dots With High Photothermal Conversion Efficiency for Antitumor Synergistic Therapy

Tianliang Li, Bin Cao(co-first author), Yitong Wang, Lixing Lin, …, Lingyan Feng, Tong-yi Zhang

SHAP-AL

  • We apply an interpretable AL strategy to efficiently optimize the photothermal conversion efficiency (PCE) of carbon dots (CDs) in photothermal therapy (PTT). Using this approach, we successfully synthesized irondoped CDs (Fe-CDs) with PCE exceeding 78.7% after only 16 experimental trials over four iterations.
ICLR 2025
sym

SimXRD-4M: Big Simulated X-ray Diffraction Data and Crystal Symmetry Classification Benchmark

Bin Cao, Yang Liu, Zinan Zheng, Ruifeng Tan, Jia Li, Tong-yi Zhang

SimXRD-4M

  • We developed a novel XRD simulation method that incorporates comprehensive physical interactions, resulting in a high-fidelity database.
SMALL 2024
sym

Machine Learning-Engineered Nanozyme System for Synergistic Anti-Tumor Ferroptosis/Apoptosis Therapy

Tianliang Li, Bin Cao(co-first author), Tianhao Su, …, Lingyan Feng, Tong-yi Zhang

TCGPR+Bgolearn

  • A novel ML model, termed the sequential backward Tree-Classifier for Gaussian Process Regression (TCGPR), is proposed to improve data pattern recognition following the divide-and-conquer principle.
JMI 2024
sym

CGWGAN: crystal generative framework based on Wyckoff generative adversarial network

Tianhao Su, Bin Cao(co-first author), Shunbo Hu, Musen Li, Tong-yi Zhang

Crystal Generative Framework

  • In this work, we present a crystal generative framework based on Wyckoff generative adversarial network (CGWGAN) to efficiently discover novel crystals.
arXiv 2024
sym

SimXRD-4M: Big Simulated X-ray Diffraction Data Accelerate the Crystalline Symmetry Classification

Bin Cao, Yang Liu, Zinan Zheng, Ruifeng Tan, Jia Li, Tong-yi Zhang

Database & Benchmark

  • In this work, a large open-source dataset of powder XRD patterns designed for symmetry identification. 21 existing ML models are assessed, summarizing the XRD sequence data characteristics, and providing suggestions for the further development of ML models best suited for analyzing XRD patterns.
IUCrJ 2024
sym

Crystallographic Phase Identifier of a Convolutional Self-Attention Neural Network (CPICANN) on Powder Diffraction Patterns

Shouyang Zhang, Bin Cao (co-first), Tianhao Su, Yue Wu, Zhenjie Feng, Jie Xiong, Tong-Yi Zhang

Phase

  • In this work, we developed a machine learning phase identifier that achieved excellent performance within a relatively small scope.
M&D 2024
sym

Active Learning Accelerates the Discovery of High Strength and High Ductility Lead-Free Solder Alloys

B Cao, T Su, S Yu, T Li, T Zhang, Z Dong, TY Zhang

Project

  • To facilitate materials informatics development, all active learning algorithms were made open-source in our designed framework, Bgolearn…
NPJ 2024
sym

MLMD: a programming-free AI platform to predict and design materials

Jiaxuan Ma, Bin Cao (co-first author), Shuya Dong, Yuan Tian, Menghuan Wang, Jie Xiong, Sheng Sun

Project

  • We developed MLMD, an AI platform for materials design. It is capable of effectively discovering novel materials with high-potential advanced properties end-to-end, utilizing model inference, surrogate optimization, and even working in situations of data scarcity based on active learning..
NPJ 2023
sym

Divide and conquer: Machine learning accelerated design of lead-free solder alloys with high strength and high ductility Qinghua Wei, Bin Cao (co-first author), Hao Yuan (co-first author), Youyang Chen, Kangdong You, Shuting Yu, Tixin Yang, Ziqiang Dong, Tong-Yi Zhang

Project

  • In general, small in size and big in noise, while the design space is huge, by a newly developed data preprocessing algorithm, named the Tree-Classifier for Gaussian Process Regression (TCGPR)…
JMI 2022
sym

(Cover paper)Domain knowledge-guided interpretive machine learning: formula discovery for the oxidation behavior of ferritic-martensitic steels in supercritical water Bin Cao, Shuang Yang, Ankang Sun, Ziqiang Dong, Tong-Yi Zhang

Project

  • In this study, we propose a domain knowledge-guided interpretive machine learning strategy and demonstrate it by studying the oxidation behavior of ferritic-martensitic steels in supercritical water…