Publications

Selected publication list

( ___: equal contribution, *: corresponding author)

Concept-Guided Tokenization: Closing the Gap Between Reconstruction and Generation
Yunqiao Yang, Haokun Lin, Guanzhong Wu, Ying Wei*
Forty-third International Conference on Machine Learning (ICML), 2026
pdf / code

Existing image tokenizers preserve low-level pixel details well but lack explicit semantic guidance, leading to a reconstruction-generation trade-off in downstream generation. This work proposes ConceptTok, which integrates text only at the encoder and uses sparse autoencoders to decompose pre-trained vision-language features into a disentangled concept space, guiding the tokenizer to predict top-K activated concept indices. ConceptTok achieves strong reconstruction-generation balance with 1.39 rFID / 2.65 gFID on ImageNet and 2.85 rFID / 10.73 gFID on COCO-30k.

RetrOrchestrator: A Multi-Step Retrosynthesis Agent Dynamically Orchestrating Single-Step Transition Models
Liao Chang, Luotian Yuan, Yiping Ke, Ying Wei*
Forty-third International Conference on Machine Learning (ICML), 2026
pdf / code

Existing multi-step retrosynthesis planners commit to a single single-step retrosynthesis (SSR) model throughout the entire search, ignoring the pronounced skill disparity of SSR models across molecule states. This work introduces RetrOrchestrator, an LLM-powered agent that reformulates retrosynthesis planning as a POMDP and dynamically selects an SSR model as a tool at each step, trained via a scaffold-aware GRPO algorithm. RetrOrchestrator achieves a state-of-the-art 94.21% success rate on Retro*-190 and establishes a new Pareto front in success rate versus query cost.

Plug-and-Play Compositionality for Boosting Continual Learning with Foundation Models
Weiduo Liao, Fei Han, Hisao Ishibuchi*, Qingfu Zhang*, Ying Wei*
Fourteenth International Conference on Learning Representations (ICLR), 2026
pdf / code (oral)

Vision learners often suffer catastrophic forgetting because they recognize classes by comparison rather than as compositions of representative concepts, an issue that persists even with foundation-model-based continual learners. This work proposes CompSLOT, a universal concept-level framework that extracts semantically meaningful slots from ImageNet-pretrained vision transformers and distills concept-based sample similarity into the classifier via method-agnostic self-supervision. CompSLOT consistently boosts diverse continual learners, especially when current tasks contain few classes.

Scaling Reasoning Hop Exposes Weaknesses: Demystifying and Improving Hop Generalization in Large Language Models
Zhaoyi Li, Jiatong Li, Gangwei Jiang, Linqi Song*, Defu Lian*, Ying Wei*
Fourteenth International Conference on Learning Representations (ICLR), 2026
pdf / code

Chain-of-thought reasoning suffers sharp performance drops in reasoning hop generalization, where the required number of reasoning steps exceeds training distributions, yet the internal cause remains unclear. This work identifies that errors concentrate at a few critical token positions due to erroneous processing heads (ep heads) — attention heads that amplify incorrect trajectories while suppressing correct ones. We propose a lightweight test-time intervention that dynamically deactivates ep heads, consistently improving hop generalization across tasks and LLMs.

A³E: Towards Compositional Model Editing
Hongming Piao, Hao Wang, Dapeng Oliver Wu*, Ying Wei*
Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025
pdf / code

Existing model editing methods are evaluated one edit at a time, overlooking compositional model editing (CME) where multiple edits must be jointly integrated to answer multifaceted questions. This work benchmarks CME and identifies three undesirable issues — knowledge loss, incorrect preceding, and knowledge sinking. We propose A³E, which adaptively combines and regularizes pre-trained knowledge during edit training, and adaptively merges multiple edits during edit composing, improving composability by at least 22.45% without sacrificing single-edit performance.

Curriculum Model Merging: Harmonizing Chemical LLMs for Enhanced Cross-Task Generalization
Baoyi He, Luotian Yuan, Ying Wei*, Fei Wu*
Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025
pdf / code

Existing chemical LLMs are typically tailored to narrow tasks, and merging them into a unified model is uniquely difficult due to significant disparities among chemical experts and a highly imbalanced distribution across downstream tasks. This work proposes Curriculum Model Merging (CMM), which progressively merges expert chemical LLMs in a moderate and continual manner to harmonize inconsistencies while preserving domain-specific expertise. CMM outperforms state-of-the-art merging methods by 29.03% in overall average performance, generalizing robustly across prediction and generative tasks.

Data Selection Matters: Towards Robust Instruction Tuning of Large Multimodal Models
Xu Yang, Chen Liu*, Ying Wei*
Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025
pdf / code

Selecting a compact subset of visual instruction-following data is an effective alignment strategy for large multimodal models, but both full-data training and existing selection methods tend to inherit dataset biases such as position bias and spurious correlations. This work introduces ARDS, a robustness-aware selection framework that first identifies worst-case evaluation subgroups via task-specific perturbations, then prioritizes samples semantically closer to these subgroups. ARDS substantially boosts both robustness and data efficiency, with robust mixtures that transfer effectively to larger architectures.

What Makes a Good Reasoning Chain? Uncovering Structural Patterns in Long Chain-of-Thought Reasoning
Gangwei Jiang, Yahui Liu, Zhaoyi Li, Victoria W., Fuzheng Zhang, Linqi Song*, Ying Wei*, Defu Lian*
The 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025
pdf / code

While Long Chain-of-Thought (LCoT) reasoning has unlocked expert-level LLM performance, how the internal structure of these reasoning chains drives final-answer correctness remains underexplored. This work presents LCoT2Tree, an automated framework that converts sequential LCoTs into hierarchical tree structures and analyzes them via graph neural networks. The extracted structural patterns (exploration, backtracking, verification) serve as stronger predictors of final performance than length-based features, and improve downstream applications such as Best-of-N decoding.

Automatic Expert Discovery in LLM Upcycling via Sparse Interpolated Mixture-of-Experts
Shengzhuang Chen, Ying Wei*, Jonathan Richard Schwarz*
Sixty-third Annual Meeting of the Association for Computational Linguistics (ACL), 2025
pdf / code (oral)

Existing dense LLMs lack efficient mechanisms for specializing across multiple domains without excessive computational overhead. This work proposes SIMoE, an end-to-end method that fine-tunes dense LLMs into sparse, domain-specialized MOEs. SIMoE identifies structurally sparse domain-specific experts and learns a router network for input-dependent expert merging. This achieves superior generalization and state-of-the-art instruction-tuning performance while maintaining optimal computational efficiency.

Come Together, But Not Right Now: A Progressive Strategy to Boost Low-Rank Adaptation
Zhan Zhuang, Xiequn Wang, Wei Li, Yulong Zhang, Qiushi Huang, Shuhao Chen, Xuehao Wang, Yanbin Wei, Yuhe Nie, Kede Ma, Yu Zhang*, Ying Wei*
Forty-second International Conference on Machine Learning (ICML), 2025
pdf / code

Current LoRA fine-tuning often traps LoRAs near suboptimal initializations, limiting model generalization and downstream adapter manipulation (merging/pruning). This work introduces CoTo, progressively increasing LoRA activation during training. This stochastic method promotes better optimization, stability, and exploration. Theoretically, CoTo supports dropout stability and linear mode connectivity, measuring LoRAs' contributions through cooperative-game analysis. Experiments show CoTo improves performance, merging accuracy, pruning robustness, and training efficiency, compatible across diverse LoRA methods.

Reaction Graph: Towards Reaction-Level Modeling for Chemical Reactions with 3D Structures
Yingzhao Jian, Yue Zhang, Ying Wei, Hehe Fan, Yi Yang
Forty-second International Conference on Machine Learning (ICML), 2025
pdf / code

Current AI approaches excel in single-molecule property predictions but largely overlook modeling intermolecular interactions, especially chemical reactions. This paper introduces Reaction Graph (RG), a unified representation capturing both reactants and products' molecular structures and interatomic relationships in chemical reactions, explicitly incorporating 3D molecular information. RG significantly improves performance on reaction classification, condition prediction, and yield prediction tasks, achieving state-of-the-art accuracy across multiple datasets.

Unlocking the Power of Function Vectors for Characterizing and Mitigating Catastrophic Forgetting in Continual Instruction Tuning
Gangwei Jiang, Caigao Jiang, Zhaoyi Li, Siqiao Xue, Jun Zhou, Linqi Song, Defu Lian*, Ying Wei*
Thirteenth International Conference on Learning Representations (ICLR), 2025
pdf / code (oral)

Current studies on catastrophic forgetting in LLMs mainly analyze a single training sequence and overlook how different tasks and model architectures influence forgetting, lacking a deeper understanding of underlying mechanisms. This work introduces function vectors (FVs) to interpret and quantify forgetting in LLMs, revealing that CF stems from biased function activation rather than task overwriting. Based on this insight, the authors propose an FV-guided training method with regularization to stabilize FVs and reduce forgetting. The approach is theoretically grounded and empirically validated across four benchmarks.

SD-LoRA: Scalable Decoupled Low-Rank Adaptation for Class Incremental Learning
Yichen Wu, Hongming Piao, Long-Kai Huang, Renzhen Wang, Hanspeter Pfister, Deyu Meng, Kede Ma*, Ying Wei*
Thirteenth International Conference on Learning Representations (ICLR), 2025
pdf / code (oral)

The paper tackles the scalability limitations of existing continual learning methods with foundation models—specifically, the need to expand prompt/LoRA pools or store task data, which becomes inefficient as the number of tasks increases. The paper introduces SD-LoRA, a scalable continual learning method that avoids storing past data or expanding model components. It separates how LoRA learns direction and magnitude to improve stability and flexibility. The method supports efficient end-to-end training and inference, and performs strongly across multiple benchmarks and foundation models.

CLDyB: Towards Dynamic Benchmarking for Continual Learning with Pre-trained Models
Shengzhuang Chen, Yikai Liao, Xiaoxiao Sun, Kede Ma*, Ying Wei*
Thirteenth International Conference on Learning Representations (ICLR), 2025
pdf / code

The paper addresses two key issues in continual learning with foundation models: data contamination risks in large-scale pre-training datasets, and the static, overly simplistic nature of standard benchmarks that fail to reflect real-world complexities, leading to poor robustness and overfitting. It introduces CLDyB, a dynamic benchmarking framework that generates challenging task sequences via tree-search to better evaluate continual learning methods. CLDyB identifies performance gaps in current methods, offers insight into their strengths and weaknesses, and provides the community with reusable, high-difficulty benchmarks for more robust CL evaluation.

Time-Varying LoRA: Towards Effective Cross-Domain Fine-Tuning of Diffusion Models
Zhan Zhuang, Yulong Zhang, Xuehao Wang, Jiangang Lu, Ying Wei*, Yu Zhang*
Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS), 2024
pdf / code

This paper introduces Terra, a novel Time-varying low-rank adapter that offers a fine-tuning framework for domain flow generation. Terra constructs a continuous parameter manifold via a time variable, with its expressive power theoretically analyzed. This domain flow generation framework flexibly supports both unsupervised domain adaptation and domain generalization, achieving state-of-the-art performance by generating interpolated domains with varying styles to bridge the gap between source and target domains.

Learning Where to Edit Vision Transformers
Yunqiao Yang, Long-Kai Huang, Shengzhuang Chen, Kede Ma, Ying Wei*
Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS), 2024
pdf / code

This work addresses the gap in model editing for vision models by (1) curating two benchmarks that existing pre-trained ViTs struggle to predict correctly, and (2) correcting the predictive errors of ViTs, particularly those arising from subpopulation shifts. We propose a learning-to-learn approach that identifies a small set of critical parameters for editing in response to erroneous samples, with the locations of these parameters determined by a hypernetwork. By simulating the edit process and explicitly optimizing for edit success, the hypernetwork is trained to output reliable and generalizable editing locations. Additionally, the sparsity constraint imposed on the hypernetwork ensures that edits are localized, without distorting irrelevant parameters.

DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs
Haokun Lin, Haobo Xu, Yichen Wu, Jingzhi Cui, Yingtao Zhang, Linzhan Mou, Linqi Song, Zhenan Sun*, Ying Wei*
Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS), 2024
pdf / code (oral)

This work tackles Massive Outliers (outlier activations) in LLMs that lead to significant performance degradation in low-bit quantization. We introduce DuQuant, a novel approach that utilizes rotation and permutation transformations to more effectively mitigate massive outliers besides normal ones. DuQuant outperforms the state-of-the-art baselines across various sizes and types of LLMs on multiple tasks, even with 4-bit weight-activation quantization.

Mixture of Adversarial LoRAs: Boosting Robust Generalization in Meta-tuning
Xu Yang, Chen Liu, Ying Wei*
Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS), 2024
pdf / code

This work introduces AMT, an Adversarial Meta-Tuning methodology designed to enhance the robust generalization of pre-trained models for out-of-domain (OOD) few-shot learning. The core innovation of AMT is a robust LoRAPool, which consists of LoRAs meta-tuned with dual perturbations on both inputs and singular values/vectors across varying robustness levels. Extensive evaluations demonstrate that AMT significantly outperforms previous state-of-the-art methods across a range of OOD few-shot image classification tasks.

Mitigating the Language Mismatch and Repetition Issues in LLM-based Machine Translation via Model Editing
Weichuan Wang, Zhaoyi Li, Defu Lian, Chen Ma, Linqi Song*, Ying Wei*
The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
pdf h

This work targets two major translation errors encountered by current LLMs — language mismatch and repetition — through model editing methods. We find that direct application of localization-based edits either yields limited impact or negatively affects general translation quality. To address this, we refine the identified components by intersecting localization results across languages, filtering out irrelevant information. Experiments show our approach effectively reduces these errors while preserving or improving overall translation quality.

Understanding and Patching Compositional Reasoning in LLMs
Zhaoyi Li, Gangwei Jiang, Hong Xie, Linqi Song, Defu Lian*, Ying Wei*
Sixty-second Annual Meeting of the Association for Computational Linguistics (ACL) Findings, 2024
pdf / code

This paper is among the first to reveal that in LLMs implicit reasoning results indeed surface within middle layers and play a causative role in shaping the final explicit reasoning results. The findings support us to develop CREME, a lightweight method to patch errors in compositional reasoning via editing the located MHSA modules. Our empirical evidence stands testament to CREME’s effectiveness, paving the way for autonomously and continuously enhancing compositional reasoning capabilities in LLMs.

Benchmarking and Improving Compositional Generalization of Multi-aspect Controllable Text Generation
Tianqi Zhong, Zhaoyi Li, Quan Wang, Linqi Song, Ying Wei, Defu Lian, Zhendong Mao*.
Sixty-second Annual Meeting of the Association for Computational Linguistics (ACL), 2024
pdf / code

This paper proposes CompMCTG which serves as a benchmark encompassing diverse multi-aspect labeled datasets and a crafted three-dimensional evaluation protocol to holistically evaluate the compositional generalization of multi-aspect controllable text generation (MCTG) approaches, as well as Meta-MCTG that is a meta-learning inspired framework to mitigate the noticeable performance drop of existing MCTG approaches in compositional generalization.

Unleashing the Power of Meta-tuning for Few-shot Generalization Through Sparse Interpolated Experts
Shengzhuang Chen, Jihoon Tack, Yunqiao Yang, Yee Whye Teh, Jonathan Richard Schwarz, Ying Wei*
Forty-first International Conference on Machine Learning (ICML), 2024
pdf / code

This paper addresses the so far limited success of meta-tuning on especially out-of-domain (OOD) tasks, where meta-tuning is a subsequent optimization stage for foundation models that attempts to harness the best of both parameter-efficient fine-tuning and meta-learning. The proposed approach Sparse MetA-Tuning (SMAT), trained to automatically isolate subsets of pre-trained parameters for meta-tuning on each task, successfully overcomes OOD sensitivity and delivers on the promise of enhancing the transfer abilities of vision foundation models beyond parameter-efficient fine-tuning.

One Meta-tuned Transformer is What You Need for Few-shot Learning
Xu Yang, Huaxiu Yao, Ying Wei*
Forty-first International Conference on Machine Learning (ICML), 2024
pdf / code (spotlight)

This paper introduces MetaFormer, a new meta-tuning framework exclusively based on attention. MetaFormer enhances the few-shot learning capacity of vision transformers by integrating both sample and task relationships into the model, which includes Masked Sample Attention for embedding sample relationships and Patch-grained Task Attention for encapsulating task relationships. MetaFormer demonstrates coherence and compatibility with off-the-shelf pre-trained vision transformers and shows significant improvements in both inductive and transductive few-shot learning scenarios.

Mitigating Catastrophic Forgetting in Online Continual Learning by Modeling Previous Task Interrelations
Yichen Wu, Hong Wang, Peilin Zhao, Yefeng Zheng, Ying Wei*, Long-Kai Huang*
Forty-first International Conference on Machine Learning (ICML), 2024
pdf / code

This work reformulates replay-based continual learning methods as a unified framework, upon which we design a Pareto-Optimized CL algorithm (POCL) that leverages Pareto optimization to capture the interrelationship among previously learned tasks. POCL thus effectively enhances the overall performance of past tasks while ensuring the performance of the current task, further alleviating catastrophic forgetting.

Federated Continual Learning via Prompt-based Dual Knowledge Transfer
Hongming Piao, Yichen Wu, Dapeng Wu, Ying Wei*
Forty-first International Conference on Machine Learning (ICML), 2024
pdf / code

This paper introduces the Prompt-based Knowledge Transfer FCL (PKT-FCL) algorithm that prompts positive knowledge transfer across tasks and clients, which has been overlooked before in federated continual learning. PKT-FCL not only reduces communication costs but also addresses privacy concerns through a novel approach for prompt generation and aggregation, showing superior performance in comprehensive experimental evaluations.

MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with Module-wise Pruning Error Metric
Haokun Lin, Haoli Bai, Zhili Liu, Lu Hou, Muyi Sun, Linqi Song, Ying Wei*, Zhenan Sun*
IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), 2024
pdf / code

This paper addresses the challenge of deploying large vision-language pre-trained models on platforms with limited computational resources by introducing a new metric, Module-wise Pruning Error (MoPE), which quantifies the impact of module removal on cross-modal task performance. Utilizing the MoPE metric, we propose a unified pruning framework that applies to both pre-training and fine-tuning stages, effectively compressing vision-language models while preserving their performance capabilities.

Meta Continual Learning Revisited: Implicitly Enhancing Online Hessian Approximation via Variance Reduction
Yichen Wu, Long-Kai Huang, Renzhen Wang, Deyu Meng, Ying Wei*
Twelfth International Conference on Learning Representations (ICLR), 2024 (Outstanding Honorable Mention / oral)
pdf / code

This study revisits Meta-Continual Learning (Meta-CL) and for the first time bridge Meta-CL with regularization-based methods. Concretely, Meta-CL implicitly approximates Hessian in an online manner, which enjoys the benefits of timely adaptation but meantime suffers from high variance induced by random memory buffer sampling. We are thus highly motivated to combine the best of both worlds, through the proposal of Variance Reduced Meta-CL (VR-MCL) to achieve both timely and accurate Hessian approximation.

Gradual Domain Adaptation via Gradient Flow
Zhan Zhuang, Yu Zhang*, Ying Wei*
Twelfth International Conference on Learning Representations (ICLR), 2024 (spotlight)
pdf / code

To address the challenge of ineffective intermediate domains for gradual domain adaptation (GDA), this work explores gradient flow to generate intermediate domains with preserving labels, thereby enabling us a fine-tuning method for GDA. We employ the Wasserstein gradient flow in Kullback–Leibler divergence to transport samples from the source to the target domain. To simulate the dynamics, we utilize the Langevin algorithm. Since the Langevin algorithm disregards label information and introduces diffusion noise, we introduce classifier-based and sample-based potentials to avoid label switching and dramatic deviations in the sampling process.

Active Retrosynthetic Planning Aware of Route Quality
Luotian Yuan, Yemin Yu, Ying Wei*, Yongwei Wang, Zhihua Wang, Fei Wu*
Twelfth International Conference on Learning Representations (ICLR), 2024
pdf / code

This study addresses the long-standing challenge of route quality evaluation in retrosynthetic planning, through an Active Retrosynthetic Planning (ARP) framework that involves a minimum annotation from chemists. The proposed ARP remains compatible with established retrosynthetic planners, which trains an actor that decides whether to query the quality of a reaction and resorts to a critic to estimate the value of a molecule with its preceding reaction quality as input. On both the benchmark and an expert dataset, ARP outperforms the existing state-of-the-art approach by 6.2% in route quality while reducing the query cost by 12.8%.

RetroOOD: Understanding Out-of-Distribution Generalization in Retrosynthesis Prediction
Yemin Yu, Luotian Yuan, Ying Wei*, Hanyu Gao, Xinhai Ye, Zhihua Wang, Fei Wu
Thirty-eighth Annual AAAI Conference on Artificial Intelligence (AAAI), 2024
pdf / code

Despite steady progress of existing retrosynthesis methods on standard benchmarks, our understanding of them under the premise of distribution shifts remains stagnant. This study fills in the gap by (1) formally sorting out two types of distribution shifts in retrosynthesis prediction, (2) constructing two groups of benchmark datasets, (3) conducting comprehensive experiments to reveal the limitations of previous in-distribution evaluation and state-of-the-art methods. and (4) proposing two model-agnostic techniques that can improve the OOD generalization of arbitrary off-the-shelf retrosynthesis prediction algorithms.

Towards Anytime Fine-tuning: Continually Pre-trained Language Models with Hypernetwork Prompts
Gangwei Jiang, Caigao Jiang, Siqiao Xue, James Y. Zhang, Jun Zhou, Defu Lian*, Ying Wei*
2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
pdf

This study first investigates "anytime fine-tuning" effectiveness of existing continual learning approaches, concluding with unanimously decreased performance on unseen domains. To this end, we propose a prompt-guided continual pre-training method, where we train a hypernetwork to generate domain-specific prompts by both agreement and disagreement losses. Our method achieves improvements of 3.57% and 3.4% on two real-world datasets (including domain shift and temporal shift), respectively.

Secure Out-of-Distribution Task Generalization with Energy-Based Models
Shengzhuang Chen, Long-Kai Huang, Jonathan Richard Schwarz, Yilun Du, Ying Wei*
Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS), 2023
pdf / code

In this work, we propose a single coherent framework named Energy-Based Meta-Learning (EBML) that supports both detection and adaptation of OOD tasks, while remaining compatible with off-the-shelf meta-learning backbones. EBML learns to characterize any arbitrary meta-training task distribution with the composition of two expressive neural-network-based energy functions. We deploy the sum of the two energy functions, being proportional to the joint distribution of a task, as a reliable score for detecting OOD tasks; during meta-testing, we adapt the OOD task to in-distribution tasks by energy minimization.

Does Continual Learning Meet Compositionality? New Benchmarks and An Evaluation Framework
Weiduo Liao, Ying Wei*, Mingchen Jiang, Qingfu Zhang*, Hisao Ishibuchi*
Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS), 2023
Track on Datasets and Benchmarks
pdf / code

We present two vision benchmarks, namely Compositional GQA (CGQA) and Compositional OBJects365 (COBJ), along with a novel evaluation framework called Compositional Few-Shot Testing (CFST). Comprehensive empirical results on systematicity, productivity, and substitutivity aspects of compositional generalization demonstrate that current continual learning techniques do exhibit somewhat favorable compositionality in their learned feature extractors, while future research on modularity is urgently needed.

Concept-wise Fine-tuning Matters in Preventing Negative Transfer
Yunqiao Yang, Long-Kai Huang, Ying Wei*
IEEE/CVF International Conference on Computer Vision (ICCV), 2023
pdf / code

We propose a Concept-wise fine-Tuning (Concept-Tuning) approach which refines feature representations in the level of patches with each patch encoding a concept. Concept-Tuning minimizes the negative impacts of rare features and spuriously correlated features in a pre-trained model by (1) maximizing the mutual information between examples in the same category with regard to a slice of rare features (a patch) and (2) applying front-door adjustment via attention neural networks in channels and feature slices (patches).

Learning to Substitute Spans towards Improving Compositional Generalization
Zhaoyi Li, Ying Wei*, Defu Lian*
Sixty-first Annual Meeting of the Association for Computational Linguistics (ACL), 2023 (oral)
pdf / code

This work introduces a compositional data augmentation approach that incurs additional compositional inductive biasto pre-trained models. We first propose a novel compositional augmentation strategy dubbed Span Substitution (SpanSub) that enables multi-grained composition of substantial substructures in the whole training set. Over and above that, we introduce the Learning to Substitute Span (L2S2) framework which empowers the learning of span substitution probabilities in SpanSub in an end-to-end manner by maximizing the loss of neural sequence models, so as to outweigh those challenging compositions with elusive concepts and novel surroundings

Blind Image Quality Assessment via Vision-Language Correspondence: A Multitask Learning Perspective
Weixia Zhang, Guangtao Zhai, Ying Wei, Xiaokang Yang, Kede Ma
IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), 2023
pdf

We develop a general and automated multitask learning scheme for blind image quality assessment to exploit auxiliary knowledge from other tasks, in a way that the model parameter sharing and the loss weighting are determined automatically. Specifically, we first describe all candidate label combinations (from multiple tasks) using a textual template, and compute the joint probability from the cosine similarities of the visual-textual embeddings in CLIP. Predictions of each task can be inferred from the joint distribution, and optimized by carefully designed loss functions.

Learning Chemical Rules of Retrosynthesis with Pre-training
Yinjie Jiang, Ying Wei*, Fei Wu*, Zhengxing Huang, Kun Kuang, Zhihua Wang
Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI), 2023
pdf

Towards the very burgeoning research area of AI-aided retrosythesis, we propose a pre-training solution to address the pronounced remaining issue regarding template-free methods, i.e., failing to conform to chemical rules. Concretely, we enforce the atom conservation rule via a molecule reconstruction pre-training task, and the reaction rule that dictates reaction centers via a reaction type guided contrastive pre-training task. Our empirical results show that the pre-training solution significantly boosts the single-step retrosynthesis accuracies.

Adversarial Task Up-sampling for Meta-learning
Yichen Wu, Long-Kai Huang*, Ying Wei*
36th Conference on Neural Information Processing Systems (NeurIPS), 2022 (spotlight)
pdf / code

This work named Adversarial Task Up-sampling (ATU) pushes ahead augmentation of sufficiently imaginary meta-training tasks with task-correctness guarantee, where we seek an approach that up-samples meta-training tasks from the task manifold via a task up-sampling network. ATU also suffices to generate tasks that can maximally contribute to the latest meta-learner by maximizing an adversarial loss.

Improving Task-Specific Generalization in Few-Shot Learning via Adaptive Vicinal Risk Minimization
Long-Kai Huang, Ying Wei*
36th Conference on Neural Information Processing Systems (NeurIPS), 2022 (spotlight)
pdf

This work focuses on improving task-specific generalization in the meta-testing stage, where we derive the vicinal loss function that approximates the true task distribution with aggregation of per-sample Gaussian-like vicinal distributions. We estimate the statistical parameters of the vicinal distribution for each training sample by 1) initiating a random walk from the sample and 2) computing the weighted mean and variance of those unlabeled data passed by the walk. The proposed method outperforms state-of-the-art few-shot learning baselines in four benchmarks.

GRASP: Navigating Retrosynthetic Planning with Goal-driven Policy
Yemin Yu, Ying Wei*, Kun Kuang, Zhengxing Huang, Huaxiu Yao, Fei Wu*
36th Conference on Neural Information Processing Systems (NeurIPS), 2022
pdf / code

Retrosynthetic planning occupies a crucial position in synthetic chemistry and, accordingly, drug discovery, which aims to find synthetic pathways of a target molecule through a sequential decision-making process on a set of feasible reactions. This work named Goal-dRiven Actor-critic retroSynthetic Planning (GRASP) framework first (1) formulates the retrosynthetic planning into a reinforcement learning framework which enjoys more efficient and accurate value estimation of a molecule, and (2) achieves goal-driven retrosynthesis navigation toward a user-demand objective.

Frustratingly Easy Transferability Estimation
Long-Kai Huang, Junzhou Huang, Qiang Yang, Ying Wei*
39th International Conference on Machine Learning (ICML), 2022
pdf / code

We propose a simple (10 lines of codes), efficient (through a single pass over examples of a target task), yet effective (on 26 pre-trained models and 16 downstream tasks) transferability measure named TransRate for fine-tuning pre-trained models. TransRate measures the mutual information between features of target examples by a pre-trained model and labels of them, which we estimate using coding rate.

The Role of Deconfounding in Meta-learning
Yinjie Jiang, Zhengyu Chen, Kun Kuang*, Luotian Yuan, Xinhai Ye, Zhihua Wang, Fei Wu*, Ying Wei*
39th International Conference on Machine Learning (ICML), 2022
pdf

This work offers a novel causal perspective of meta-learning, through which we explain the memorization effect as a confounder and frame previous anti-memorization methods as different deconfounder approaches. Derived from the causal inference principle of front-door adjustment, we propose two frustratingly easy but effective deconfounder algorithms.

Artificial Intelligence for Retrosynthesis Prediction
Yinjie Jiang, Yemin Yu, Ming Kong, Yu Mei, Luotian Yuan, Zhengxing Huang, Kun Kuang, Zhihua Wang, Huaxiu Yao, James Zou, Connor W. Coley, Ying Wei*
Engineering, 2022
pdf

In recent years, there has been a dramatic rise in interest in retrosynthesis prediction with AI techniques. This survey describes the current landscape of AI-driven retrosynthesis prediction, including (1) formal definitions of the retrosynthesis problem, (2) the outstanding research challenges therein, (3) related AI techniques and recent progress that enable retrosynthesis prediction, (4) a novel landscape that provides a comprehensive categorization of different retrosynthesis prediction components, (5) how AI reshapes each component, and (6) promising areas for future research.

Disentangling Task Relations for Few-shot Text Classification via Self-Supervised Hierarchical Task Clustering
Juan Zha, Zheng Li, Ying Wei, Yu Zhang
2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
pdf

This work named self-supervised hierarchical task cluster (SS-HTC) improves few-shot text classification. SS-HTC customizes cluster-specific knowledge by dynamically organizing heterogeneous tasks into different clusters in hierarchical levels and also disentangles underlying relations between tasks to improve the interpretability. Extensive experiments on five public FSTC benchmark datasets demonstrate the effectiveness of SS-HTC.

Self-supervised Text Erasing with Controllable Image Synthesis
Gangwei Jiang, Shiyao Wang, Tiezheng Ge, Yuning Jiang, Ying Wei, Defu Lian
30th ACM International Conference on Multimedia (MM), 2022
pdf

This work studies a novel self-supervised text erasing framework to alleviate the heavy reliance on costly annotations. Specifically, we propose a style-aware image synthesis function that generates synthetic images with diverse style texts and a policy network that controls the synthetic mechanisms to bridge the text style gap between synthetic and real-world data. We have also constructed a new dataset called PosterErase.

Meta-learning with an Adaptive Task Scheduler
Huaxiu Yao, Yu Wang, Ying Wei*, Peilin Zhao, Mehrdad Mahdavi, Defu Lian, Chelsea Finn
35th Conference on Neural Information Processing Systems (NeurIPS), 2021
pdf / code

This work pursues an adaptive task scheduler for meta-learning tasks that are likely detrimental with noise or imbalanced given a limited number of meta-training tasks. We for the first time design a neural scheduler to decide which meta-training tasks to use next and train the scheduler to optimize the generalization capacity of the meta-knowledge to unseen tasks. We have shown that such a scheduler theoretically improves the optimization landscape and empirically outshines conventional schedulers (including the commonly adopted random sampling).

Functionally Regionalized Knowledge Transfer for Low-resource Drug Discovery
Huaxiu Yao, Ying Wei*, Long-Kai Huang, Ding Xue, Junzhou Huang, Zhenhui Li
35th Conference on Neural Information Processing Systems (NeurIPS), 2021
pdf

This paper seeks to remedy the lack of labeled compounds with activities (ADMET properties) in virtual screening (lead optimization) of drug by transferring the knowledge from previous assays, namely in-vivo experiments, collected by different laboratories and against various target proteins. We propose a functionally regionalized meta-learning algorithm, with the architectural compositional capability, to accommodate wildly different assays and meantime capture the relationship between assays.

MetaTS: Meta Teacher-Student Network for Multilingual SequenceLabeling with Minimal Supervision
Zheng Li, Danqing Zhang, Tianyu Cao, Ying Wei, Yiwei Song, Bing Yin
2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021
pdf

We explore multilingual sequence labeling with a single unified model for multiple languages and minimal supervision. Specifically, we resort to the teacher-student framework to leverage large multilingual unlabeled data. We propose a meta teacher-student (MetaTS) network that allows the teacher to dynamically adapt its pseudo-annotation strategies by the student's feedback on the generated pseudo-labeled data of each language.

Meta-learning Hyperparameter Performance Prediction with Neural Processes
Ying Wei, Peilin Zhao, Junzhou Huang
38th International Conference on Machine Learning (ICML), 2021
pdf / code

We transfer knowledge from historical hyperparameter optimization (HPO) trials on other datasets to speed up HPO of a huge dataset where even a single trial is costly. The proposed meta-learning algorithm first introduces neural processes (NPs) as a surrogate model which empowers the simultaneous transfer of trial observations, parameters of NPs, and initial hyperparameter configurations.

Improving Generalization in Meta-learning via Task Augmentation
Huaxiu Yao, Long-Kai Huang, Linjun Zhang, Ying Wei*, Li Tian, James Zou, Junzhou Huang, Zhenhui Li
38th International Conference on Machine Learning (ICML), 2021
pdf / arXiv / code

This work addresses the meta-overfitting problem. We solve the problem by augmenting as many tasks as possible. Concretely, we propose the two criteria for valid task augmentation and also two task augmentation methods that satisfy the criteria. Theoretical studies and empirical results both demonstrate that the proposed task augmentation strategies significantly mitigate the meta-overfitting. Also, the task augmentation strategies remain compatible with any advanced meta-learning algorithms.

Learn to Cross-lingual Transfer with Meta Graph Learning Across Heterogeneous Languages
Zheng Li, Mukul Kumar, William Headden, Bing Yin, Ying Wei, Yu Zhang, Qiang Yang
2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020
pdf

This work focuses on the problem of cross-lingual transfer (CLT). For each CLT task, we formulate the transfer process as information propagation over a dynamic graph. More importantly, we improve the transfer effectiveness by extracting meta-knowledge such as propagation strategies from previous CLT experiences.

Self-Supervised Graph Transformer on Large-Scale Molecular Data
Yu Rong, Yatao Bian, Tingyang Xu, Weiyang Xie, Ying Wei, Wenbing Huang, Junzhou Huang
34th Annual Conference on Neural Information Processing Systems (NeurIPS), 2020
pdf

We propose a novel framework, GROVER, for effective molecular representation which is a crucial prerequisite in AI-driven drug design and discovery. GROVER learns to characterize molecules with rich and semantic features from enormous unlabeled molecular data, with carefully designed self-supervised tasks in node-, edge-, and graph-level. Besides, GROVER itself is more expressive, integrating Message Passing Networks into the Transformer-style architecture.

Adversarial Sparse Transformer for Time Series Forecasting
Sifan Wu, Xi Xiao, Qianggang Ding, Peilin Zhao, Ying Wei, Junzhou Huang
34th Annual Conference on Neural Information Processing Systems (NeurIPS), 2020
pdf

Existing time series forecasting methods fail to either capture stochasticity of data or forecast for a long time horizon due to error accumulation. In this work, we are motivated to address the two issues with a novel time series forecasting model. The model, Adversarial Sparse Transformer (AST), based on GAN, adopts a sparse Transformer as the generator to learn a sparse attention map for forecasting and meanwhile takes a discriminator to improve the prediction performance at a sequence level.

Collaborative Unsupervised Domain Adaptation for Medical Image Diagnosis
Yifan Zhang, Ying Wei, Qingyao Wu, Peilin Zhao, Shuaicheng Niu, Mingkui Tan, Junzhou Huang
IEEE Transactions on Image Processing, 2020
pdf

We are strongly motivated to improve unsupervised domain adaptation for medical image diagnosis, from the perspectives of denoising noisy annotations due to limited expertise and differentiating the adaptation difficulty of images that have significant discrepancies. The proposed, harnessing the collective intelligence of two peer networks, achieves the goals via a noise co-adaptation layer and a transferability-aware weight for each image.

TranSlider: Transfer Ensemble Learning from Exploitation to Exploration
Kuo Zhong, Ying Wei*, Chun Yuan, Haoli Bai, Junzhou Huang
Twenty-sixth ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2020
pdf

Learning a strategy dictating what and where to transfer is key to avoid negative transfer, while the strategy always suffers from overfitting in light of limited annotations in a target domain. For the first time, we propose transfer ensemble learning to solve the problem. We propose to generate a spectrum of models in decreasing transferability, ranging from pure exploitation of the source model to unconstrained exploration for the target domain.

Graph Few-shot Learning via Knowledge Transfer
Huaxiu Yao, Chuxu Zhang, Ying Wei*, Meng Jiang, Suhang Wang, Junzhou Huang, Nitesh Chawla, Zhenhui Li
Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI), 2020
pdf

For the first time, we attack the problem of semi-supervised node classification by transferring the knowledge learned from historical graphs. We propose a novel meta-learning algorithm on graphs instead of i.i.d. data. We learn a transferable metric space for node similarity, where two embedding functions encrypting both local and global structures are learned from previous graphs.

Transferable Neural Processes for Hyperparameter Optimization
Ying Wei, Peilin Zhao, Huaxiu Yao, Junzhou Huang
The Meta Learning Workshop at NeurIPS, 2019
arXiv

Conventional hyperparameter optimization (HPO) algorithms require considerable hyperparameter evaluation trials, which impedes their success in wider applications where a single trial on a huge dataset is often costly. Thereon, we are inspired to speed up HPO by transferring knowledge from historical HPO trials on other datasets. The proposed meta-learning algorithm innovates the dataset-aware attention to identify the most similar datasets, and first transfers trial observations, neural processes parameters, and initial hyperparameter configurations collectively from these datasets.

Transferable End-to-End Aspect-based Sentiment Analysis with Selective Adversarial Learning
Zheng Li, Xin Li, Ying Wei, Lidong Bing, Yu Zhang, Qiang Yang
2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019
pdf / code

Jointly extracting aspects and sentiments for sentiment classification requires considerable labeled sentences which are highly labor-intensive. We innovatively alleviate the problem via unsupervised domain adaptation from a sufficiently labeled domain. We propose a novel selective adversarial learning method to learn correlation vectors between aspects and sentiments and attentively transfer them across domains.

From Whole Slide Imaging to Microscopy: Deep Microscopy Adaptation Network for Histopathology Cancer Image Classification
Yifan Zhang, Hanbo Chen, Ying Wei, Peilin Zhao, Jiezhang Cao, Xinjuan Fan, Xiaoying Lou, Hailing Liu, Jinlong Hou, Xiao Han, Jianhua Yao, Qingyao Wu, Mingkui Tan, Junzhou Huang
22nd International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), 2019
pdf

This work is the first to empower digital pathology image classification directly based on microscopy images. Specifically, we resort to unsupervised domain adaptation from whole slide images to remedy the lack of annotated microscopy images. The proposed resolves intra-domain discrepancy and class imbalance via entropy minimization and sample re-weighting, respectively, besides inter-domain discrepancy.

Hierarchically Structured Meta-learning
Huaxiu Yao, Ying Wei*, Junzhou Huang, Zhenhui Li
36th International Conference on Machine Learning (ICML), 2019
pdf / code

We devote to conquering a critical challenge in meta-learning, namely task uncertainty and heterogeneity, where tasks may be originated from wildly different distributions. We propose a highly-motivated meta-learning algorithm with hierarchical task clustering. It not only alleviates task heterogeneity via knowledge customization to different clusters of tasks, but also preserves knowledge generalization among a cluster of similar tasks.

Learning from Multiple Cities: A Meta-Learning Approach for Spatial-Temporal Prediction
Huaxiu Yao, Yiding Liu, Ying Wei, Xianfeng Tang, Zhenhui Li
The Web Conference (WWW), 2019
pdf / code

This work improves spatial-temporal prediction tasks like traffic prediction for those cities with only limited training data in a short period. The improvement is attributed to the knowledge transferred from other cities with sufficient data covering long periods. We first introduce the meta-learning paradigm into spatial-temporal prediction, and formulate the transferable knowledge as both short-term and long-term spatial-temporal patterns which are represented as model parameters and an explicit memory, respectively.

Exploiting Coarse-to-Fine Task Transfer for Aspect-Level Sentiment Classification
Zheng Li, Ying Wei, Yu Zhang, Xiang Zhang, Xin Li
33rd AAAI Conference on Artificial Intelligence (AAAI) , 2019
pdf / code

We aim at identifying sentiment towards aspect terms in a sentence, while annotating sentences in this case is prohibitively expensive. Innovatively, we leverage knowledge from more easily accessible sentences whose sentiment is annotated to aspect categories. We propose a multi-granularity alignment network to achieve domain adaptation, which resolves both aspect granularity inconsistency and feature discrepancy between domains.

Learning to Multitask
Yu Zhang, Ying Wei, Qiang Yang
32nd Annual Conference on Neural Information Processing Systems (NeurIPS), 2018
pdf

This work is the pioneer in automatically identifying an effective multitask model for a multitask problem, empowered by a groundbreaking learning to multitask framework.

Transfer learning via Learning to Transfer
Ying Wei, Yu Zhang, Junzhou Huang, Qiang Yang
35th International Conference on Machine Learning (ICML), 2018 (long talk)
pdf

This work opens a new door to improve transfer learning effectiveness. We propose a groundbreaking learning to transfer framework to automatically optimize what and how to transfer across domains, by taking advantage of previous transfer learning experiences.

Hierarchical Attention Transfer Network for Cross-domain Sentiment Classification
Zheng Li, Ying Wei, Yu Zhang, Qiang Yang
32nd AAAI Conference on Artificial Intelligence (AAAI), 2018
pdf / code

We are dedicated to improve cross-domain sentiment classification, from the perspectives of discovering domain-invariant emotion words of higher quality for knowledge transfer as well as capturing domain-specific emotion words for sentiment classification. The proposed hierarchical attention transfer network achieves the two goals with a hierarchical attention mechanism and a non-pivots network, respectively.

Transferable Contextual Bandit for Cross-Domain Recommendation
Bo Liu, Ying Wei, Yu Zhang, Qiang Yang
32nd AAAI Conference on Artificial Intelligence (AAAI), 2018
pdf

Though contextual bandit effectively solves the exploitation-exploration dilemma in recommendation systems, it suffers from over-exploration in the cold-start scenario. This work is the first to alleviate the problem by transferring knowledge from other domains. We propose a transferable contextual bandit policy which transfers observations to improve user interests estimation for exploitation and thus accelerates the exploration.transfer network achieves the two goals with a hierarchical attention mechanism and a non-pivots network, respectively.

Learning to Transfer
Ying Wei, Yu Zhang, Qiang Yang
arXiv

Highly motivated by human beings' capabilities to reflect on transfer learning experiences, we propose a novel transfer learning framework to learn meta-knowledge from historical transfer learning experiences and apply the meta-knowledge to automatically optimize what to transfer in the future.

End-to-End Adversarial Memory Network for Cross-domain Sentiment Classification
Zheng Li, Yu Zhang, Ying Wei, Qiang Yang
26th International Joint Conference on Artificial Intelligence (IJCAI), 2017
pdf / code

This work focuses on cross-domain sentiment classification, e.g., sentiment classification of book reviews by transferring knowledge from electronics product reviews. The key here is to identify domain-invariant emotion words as the transferable knowledge. We are the first to automatically learn domain-invariant emotion words by introducing an end-to-end adversarial memory network and offer a direct visualization of them.

Deep Neural Networks for High Dimension, Low Sample Size Data
Bo Liu, Ying Wei, Yu Zhang, Qiang Yang
26th International Joint Conference on Artificial Intelligence (IJCAI), 2017
pdf

We devote to address the problems of overfitting and high-variance gradients, when training deep neural networks on high dimension but low sample size data such as genetic data for phenotype prediction in bioinformatics. We propose a deep neural pursuit network which alleviates overfitting by selecting a subset of features and reduces variance by averaging the gradients over multiple dropouts.

Heterogeneous Translated Hashing: A Scalable Solution towards Multi-modal Similarity Search
Ying Wei, Yangqiu Song, Yi Zhen, Bo Liu, Qiang Yang
ACM Transactions on Knowledge Discovery from Data (TKDD), 10(4):36, 2016
pdf

This work provides a theoretical analysis and guarantee for the scalable heterogeneous translated hashing method which is proposed to build the correspondence between heterogeneous domains.

Transfer Knowledge between Cities
Ying Wei, Yu Zheng, Qiang Yang
22nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2016
pdf

We propose the first principled approach to transfer knowledge between domains, each of which comprises multiple modalities of datasets. We conduct a case study of air quality prediction -- borrowing knowledge from the cities with sufficient annotations and data to the cities with either scarce annotations or insufficient data in any modality. The proposed method formulates the transferable knowledge as semantically related dictionaries for multiple modalities learned from a source domain and labeled examples.

Instilling Social to Physical: Co-Regularized Heterogeneous Transfer Learning
Ying Wei, Yin Zhu, Cane Wing-ki Leung Yangqiu Song, Qiang Yang
30th AAAI Conference on Artificial Intelligence (AAAI), 2016
pdf

This work first transfers knowledge from posts in the social media side to sensors in the physical world to improve ubiquitous computing tasks such as activity recognition. We propose a co-regularized heterogeneous transfer learning model to discover the transferable feature representations that bridge two domains in heterogeneous representation structures, co-regularized by both correspondence and labels.

Scalable Heterogeneous Translated Hashing
Ying Wei, Yangqiu Song, Yi Zhen, Bo Liu, Qiang Yang
20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2014
Best Paper Finalist
pdf

Knowledge transfer between domains that lie in heterogeneous feature spaces but have no access to explicit correspondence is almost impossible. This work is the pioneer in using hashing to build the correspondence between such domains. The proposed method simultaneously learns hash functions embedding heterogeneous domains into different Hamming spaces, and a translator aligning these spaces.