About me

I am an MS student at the University of Waterloo, advised by Wenhu Chen. Concurrently, I am working as a part-time research assistant at TIGER LAB.

News

06/2025: Released VisCoder, an open-source large language model fine-tuned for Python visualization code generation and iterative self-correction. Check out our paper and huggingface collections!
05/2025: Released PhyX, first large-scale benchmark designed to assess models capacity for physics-grounded reasoning in visual scenarios.
02/2025: Released SuperGPQA, a comprehensive benchmark designed to evaluate the knowledge and reasoning abilities of Large Language Models (LLMs) across 285 graduate-level disciplines.
10/2024: Introducing MEGA-Bench, an evaluation suite that scales multimodal evaluation to over 500 real-world tasks!
09/2024: Released MMMU Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark.
07/2024: Graduated from Zhejiang University with a Bachelor’s degree in Computer Science!🎉
06/2024: Thrilled to announce 📽️VideoScore, the first-ever fine-grained and reliable evaluator/reward model for text-to-video generation tasks! Check out our paper and demo now!
06/2024: Released II-Bench, an image implication understanding benchmark for multimodal LLMs.
06/2024: Released MMLU-Pro, a more robust and challenging massive multi-task understanding dataset tailored to more rigorously benchmark large language models’ capabilities.
02/2024: Our MMMU Benchmark was accepted to CVPR’24 as Best Paper Finalist (0.2%) !
01/2024: Release a new paper:”A Comprehensive Study of Knowledge Editing for Large Language Models” with a new benchmark KnowEdit!
11/2023: Released MMMU Benchmark for massive perception, knowledge, and reasoning evaluation on large multimodal models.
09/2023: Started my Mathematics Exchange Program at University of Waterloo!

Selected Publications

See full list in Publications.

Xiang Yue*, Yuansheng Ni*, Kai Zhang*, Tianyu Zheng*, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI. In CVPR’24 Best Paper Finalist (0.2%) [Paper] [Project Page] [Code] [Data]

Xiang Yue*, Tianyu Zheng*, Yuansheng Ni*, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Ming Yin, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, Graham Neubig. MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark. In Arxiv [Paper] [Project Page] [Code] [Data]

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, Wenhu Chen. MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark. In In NeurIPS’24 D&B Spotlight [Paper] [Data]

Contact

Email: yuansheng.[LAST_NAME]@uwaterloo.ca OR yuansheng[LAST_NAME]11@gmail.com