Seungone Kim

Ph.D. Student at CMU LTI

seungone@cmu.edu

About

Hello! I am a Ph.D. Student at Carnegie Mellon University Language Technologies Institute, co-advised by Graham Neubig and Sean Welleck. I am also a research intern at FAIR (Meta) (2025 - 2026). I obtained my M.S. in AI at KAIST AI, where I was fortunate to be advised by Minjoon Seo. Prior to that, I was a research intern at NAVER AI Lab and LG AI Research, and did my B.S. in CS at Yonsei University.

My primary research focus is centered around LLM Evaluation and AI for Science. Particularly, I aim to develop better (1) evaluation benchmarks to systematically identify weaknesses in AI scientists and AI reviewers and (2) synthetic data generation/post-training/inference methods to solve the most challenging problems in science and engineering domains with AI Scientists.

I am hosting weekly office hours to discuss about research projects or talk about Ph.D. application. Please sign up at this Calendly Link!

News

May 2026 Try out our CMU Paper Reviewer service!

May 2026 Our CMU Paper Reviewer and Soohak benchmark papers are released! More AI for Science papers to come out soon :)

April 2026 I've reached 3,000 citations!

April 2026 Our Evaluation-time Scaling paper is accepted at ACL 2026!

April 2026 Our RefineBench, CoT Encyclopedia, OptimalThinkingBench, and VideoJudge papers are accepted at ICLR 2026!

December 2025 I've reached 2,000 citations!

December 2025 Our RefineBench paper was selected for the Best Runner-Up Paper at the Multi-turn Interaction LLM Workshop (@ NeurIPS 2025)!

April 2025 I've reached 1,000 citations!

April 2025 Our BiGGen Bench paper was selected for the Best Paper Award at NAACL 2025!

October 2024 I received the NEC Student Research Fellowship which will thankfully support my research on harnessing synthetic data for improving LLMs!

Mar 2024 I got admitted at Carnegie Mellon University Language Technologies Institute as a Ph.D. student.

December 2023 Our FLASK paper was selected for the Honorable Mention Award at the Workshop on Instruction Tuning and Instruction Following (@ NeurIPS 2023)!

Oct 2022 I got admitted at KAIST AI as a M.S. student.

Education

Language Technologies Institute, Carnegie Mellon UniversitySep. 2024 - Present

Ph.D. in Computer Science (Advisors: Graham Neubig and Sean Welleck)

KAIST AIMar. 2023 - Aug. 2024

M.S. in Artificial Intelligence (Advisor: Minjoon Seo)

Yonsei UniversityMar. 2018 - Feb. 2023

B.S. in Computer Science

Publications

Preprints

On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists

, Dongkeun Yoon, Kiril Gashteovski, Juyoung Suk, Jinheon Baek, Pranjal Aggarwal, Ian Wu, Viktor Zaverkin, Spase Petkoski, Daniel R Schrider, Ilija Dukovski, Francesco Santini, Biljana Mitreska, Yong Jeong, Kyeongha Kwon, Young Min Sim, Dragana Manasova, Arthur Porto, Biljana Mojsoska, Makoto Takamoto, Marko Shuntov, Ruoqi Liu, Hyunjoo Jenny Lee, Niyazi Ulas Dinç, Yehhyun Jo, Sunkyu Han, Chungwoo Lee, Huishan Li, Esther HR Tsai, Ergun Simsek, Khushboo Shafi, Yeonseung Chung, Jihye Park, Aleksandar Shulevski, Henrik Christiansen, Yoosang Son, Elly Knight, Amanda Montoya, Jeongyoun Ahn, Christian Langkammer, Heera Moon, Changwon Yoon, Nikola Stikov, Mooseok Jang, Edward Choi, Junhan Kim, Yeon Sik Jung, Woo Youn Kim, Jae Kyoung Kim, Ishraq Md Anjum, Hyun Uk Kim, Drew Bridges, Carolin Lawrence, Xiang Yue, Alice Oh, Akari Asai, Sean Welleck, Graham Neubig

Preprint Under Review

Paper Code

K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

Nahyun Lee, Dongkeun Yoon, Guijin Son, Geewook Kim, Dayoon Ko, Jeonghun Park, Haneul Yoo, Jaewon Cho, Junghun Park, Changyoon Lee, Kyochul Jang, Jaeyeon Kim, Eunsu Kim, Woojin Cho, ,

Preprint Under Review

Paper

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

Guijin Son, , Catherine Arnett, Hyunwoo Ko, Hyein Lee, Hyeonah Kang, Jiang Longxi, Jin Yun, JungYup Lee, Kyungmin Lee, Sam Yoosuk Kim, Sang Park, Seunghyeok Hong, SeungJae Lee, Seungyeop Yi, Shinae Shin, SunHye Bok, Sunyoung Shin, Yonghoon Ji, Youngtaek Kim, Hanearl Jung, Akari Asai, Graham Neubig, Sean Welleck, Youngjae Yu, Alexander B Ivanov, Boboev Muhammadjon, Chaeyoung Han, Christian Stump, Dmitrii Karp, Dohyun Kwon, DoYong Kwon, Duk-Soon Oh, Giovanni Resta, Greta Panova, Huiyun Noh, Hyungryul Baik, Hyungsun Bae, Inomov Mashrafdzhon, Jeewon Kim, Ji Eun Lee, Jiaqi Liu, Jieui Kang, Jimin Kim, Jon-Lark Kim, Junseo Yoon, Junwoo Jo, Kibeom Kim, Kiwoon Kwon, Mario Kummer, Max Mercer, Minjun Kim, Nahyun Lee, Ng Ze-An, Rafał Marcin Łochowski, Raphaël Lachièze-Rey, Ruichen Zhang, Sejin Park, Seonguk Seo, Shin Jaehoon, Taewoong Eom, Yeachan Park, Yongseok Jang, Youchan Oh, Zhaoyang Wang, Zoltán Kovács

Preprint Under Review

Paper

Verus-SpecGym: An Agentic Environment for Evaluating Specification Autoformalization

Anmol Agarwal, Natalie Neamtu, Pranjal Aggarwal, , Jannis Limperg, Cedric Flamant, Kanna Shimizu, Bryan Parno, Sean Welleck

Preprint Under Review

Paper

Reasoning over mathematical objects: on-policy reward modeling and test time aggregation

Pranjal Aggarwal, Marjan Ghazvininejad, , Ilia Kulikov, Jack Lanchantin, Xian Li, Tianjian Li, Bo Liu, Graham Neubig, Anaelia Ovalle, Swarnadeep Saha, Sainbayar Sukhbaatar, Sean Welleck, Jason Weston, Chenxi Whitehouse, Adina Williams, Jing Xu, Ping Yu, Weizhe Yuan, Jingyu Zhang, Wenting Zhao

Preprint Under Review

Paper

SPICE: Self-play in corpus environments improves reasoning

Bo Liu, Chuanyang Jin, , Weizhe Yuan, Wenting Zhao, Ilia Kulikov, Xian Li, Sainbayar Sukhbaatar, Jack Lanchantin, Jason Weston

Preprint Under Review

Paper

Datasheets Aren't Enough: DataRubrics for Automated Quality Metrics and Accountability

Genta Indra Winata, David Anugraha, Emmy Liu, Alham Fikri Aji, Shou-Yi Hung, Aditya Parashar, Patrick Amadeus Irawan, Ruochen Zhang, Zheng-Xin Yong, Jan Christian Blaise Cruz, Niklas Muennighoff, , Hanyang Zhao, Sudipta Kar, Kezia Erina Suryoraharjo, M Farid Adilazuarda, En-Shiun Annie Lee, Ayu Purwarianti, Derry Tanti Wijaya, Monojit Choudhury

Preprint Under Review

Paper

FREESON: Retriever-Free Retrieval-Augmented Reasoning via Corpus-Traversing MCTS

Chaeeun Kim,

Preprint Under Review

Paper

2026

Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators

*, Ian Wu*, Jinu Lee*, Xiang Yue, Seongyun Lee, Mingyeong Moon, Kiril Gashteovski, Carolin Lawrence, Julia Hockenmaier, Graham Neubig, Sean Welleck

ACL 2026 Findings

Paper Code

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

Maggie Huan, Yuetai Li, Tuney Zheng, Xiaoyu Xu, , Minxin Du, Radha Poovendran, Graham Neubig, Xiang Yue

ICML 2026

Paper Code

RefineBench: Evaluating Refinement Capability of Language Models via Checklists

Young-Jun Lee*, *, Byung-Kwan Lee, Minkyeong Moon, Yechan Hwang, Jong Myoung Kim, Graham Neubig, Sean Welleck, Ho-Jin Choi

ICLR 2026

Paper Code

The CoT Encyclopedia: Analyzing, Predicting, and Controlling how a Reasoning Model will Think

Seongyun Lee*, *, Minju Seo, Yongrae Jo, Dongyoung Go, Hyeonbin Hwang, Jinho Park, Xiang Yue, Sean Welleck, Graham Neubig, Moontae Lee, Minjoon Seo

ICLR 2026

Paper Code

OptimalThinkingBench: Evaluating over and underthinking in LLMs

Pranjal Aggarwal, , Jack Lanchantin, Sean Welleck, Jason Weston, Ilia Kulikov, Swarnadeep Saha

ICLR 2026

Paper

VideoJudge: Bootstrapping Enables Scalable Supervision of MLLM-as-a-Judge for Video Understanding

Abdul Waheed, Zhen Wu, Dareen Alharthi, , Bhiksha Raj

ICLR 2026

Paper

MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models

Guijin Son, Dongkeun Yoon, Juyoung Suk, Javier Aula-Blasco, Mano Aslan, Vu Trong Kim, Shayekh Bin Islam, Jaume Prats-Cristia, Luc'ia Tormo-Banuelos,

ICLR 2026 VerifAI-2 Workshop

Paper Code

2025

Reasoning Models Better Express Their Confidence

Dongkeun Yoon, , Sohee Yang, Sunkyoung Kim, Soyeon Kim, Yongil Kim, Eunbi Choi, Yireun Kim, Minjoon Seo

NeurIPS 2025

Paper Code

Web-Shepherd: Advancing PRMs for Reinforcing Web Agents

Hyungjoo Chae, Sunghwan Kim, Junhee Cho, , Seungjun Moon, Gyeom Hwangbo, Dongha Lim, Minjin Kim, Yeonjun Hwang, Minju Gwak, Dongwook Choi, Minseok Kang, Gwanhoon Im, ByeongUng Cho, Hyojun Kim, Jun Hee Han, Taeyoon Kwon, Minju Kim, Beong-woo Kwak, Dongjin Kang, Jinyoung Yeo

NeurIPS 2025 (Spotlight)

Paper Code

Measuring Sycophancy of Language Models in Multi-turn Dialogues

Jiseung Hong, Grace Byun, , Kai Shu, Jinho D. Choi

EMNLP 2025

Paper Code

M-Prometheus: A Suite of Open Multilingual LLM Judges

Jose Pombal, Dongkeun Yoon, Patrick Fernandes, Ian Wu, , Ricardo Rei, Graham Neubig, Andre F.T. Martins

COLM 2025

Paper

Let's Predict Sentence by Sentence

Hyeonbin Hwang, Byeongguk Jeon, , Jiyeon Kim, Hoyeon Chang, Sohee Yang, Seungpil Won, Dohaeng Lee, Youbin Ahn, Minjoon Seo

COLM 2025 RAM2 Workshop (Oral)

Paper

Evaluating Language Models as Synthetic Data Generators

, Juyoung Suk, Xiang Yue, Vijay Viswanathan, Seongyun Lee, Yizhong Wang, Kiril Gashteovski, Carolin Lawrence, Sean Welleck, Graham Neubig

ACL 2025

Paper Code

LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation

Eunsu Kim, Juyoung Suk, , Niklas Muennighoff, Dongkwan Kim, Alice Oh

ACL 2025 Findings

Paper Code

The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models

, Juyoung Suk, Ji Yong Cho, Shayne Longpre, Chaeeun Kim, Dongkeun Yoon, Guijin Son, Yejin Cho, Sheikh Shafayat, Jinheon Baek, Sue Hyun Park, Hyeonbin Hwang, Jinkyung Jo, Hyowon Cho, Haebin Shin, Seongyun Lee, Hanseok Oh, Noah Lee, Namgyu Ho, Se June Joo, Miyoung Ko, Yoonjoo Lee, Hyungjoo Chae, Jamin Shin, Joel Jang, Seonghyeon Ye, Bill Yuchen Lin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, Minjoon Seo

NAACL 2025 (Best Paper Award)

Paper Code

KMMLU: Measuring Massive Multitask Language Understanding in Korean

Guijin Son, Hanwool Lee, Sungdong Kim, , Niklas Muennighoff, Taekyoon Choi, Cheonbok Park, Kang Min Yoo, Stella Biderman

NAACL 2025

Paper Code

Bridging the Data Provenance Gap Across Text, Speech, and Video

Shayne Longpre, Nikhil Singh, Manuel Cherep, Kushagra Tiwary, Joanna Materzynska, William Brannon, Robert Mahari, Manan Dey, Mohammed Hamdy, Nayan Saxena, Ahmad Mustafa Anis, Emad A. Alghamdi, Vu Minh Chien, Naana Obeng-Marnu, Da Yin, Kun Qian, Yizhi LI, Minnie Liang, An Dinh, Shrestha Mohanty, Deividas Mataciunas, Tobin South, Jianguo Zhang, Ariel N. Lee, Campbell S. Lund, Christopher Klamm, Damien Sileo, Diganta Misra, Enrico Shippole, Kevin Klyman, Lester James Validad Miranda, Niklas Muennighoff, Seonghyeon Ye, , Vipul Gupta, Vivek Sharma, Xuhui Zhou, Caiming Xiong, Luis Villa, Stella Biderman, Alex Pentland, Sara Hooker, Jad Kabbara

ICLR 2025

Paper

Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages

Xiang Yue, Yueqi Song, Akari Asai, , Jean de Dieu Nyandwi, Simran Khanuja, Anjali Kantharuban, Lintang Sutawika, Sathyanarayanan Ramamoorthy, Graham Neubig

ICLR 2025

Paper Code

Better Instruction-Following Through Minimum Bayes Risk

Ian Wu, Patrick Fernandes, Amanda Bertsch, , Sina Pakazad, Graham Neubig

ICLR 2025 (Spotlight)

Paper

2024

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models

*, Juyoung Suk*, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, Minjoon Seo

EMNLP 2024

Paper Code

Self-Explore to Avoid the Pit: Improving the Reasoning Capabilities of Language Models with Fine-grained Rewards

Hyeonbin Hwang, Doyoung Kim, , Seonghyeon Ye, Minjoon Seo

EMNLP 2024 Findings

Paper Code

Language Models as Compilers: Simulating Pseudocode Execution Improves Algorithmic Reasoning in Language Models

Hyungjoo Chae, Yeonghyeon Kim, , Kai Tzu-iunn Ong, Beong-woo Kwak, Moohyeon Kim, Seonghwan Kim, Taeyoon Kwon, Jiwan Chung, Youngjae Yu, Jinyoung Yeo

EMNLP 2024

Paper Code

Consent in Crisis: The Rapid Decline of the AI Data Commons

Shayne Longpre, Robert Mahari, Ariel Lee, Campbell Lund, Hamidah Oderinwale, William Brannon, Nayan Saxena, Naana Obeng-Marnu, Tobin South, Cole Hunter, Kevin Klyman, Christopher Klamm, Hailey Schoelkopf, Nikhil Singh, Manuel Cherep, Ahmad Anis, An Dinh, Caroline Chitongo, Da Yin, Damien Sileo, Deividas Mataciunas, Diganta Misra, Emad Alghamdi, Enrico Shippole, Jianguo Zhang, Joanna Materzynska, Kun Qian, Kush Tiwary, Lester James Validad Miranda, Manan Dey, Minnie Liang, Mohammed Hamdy, Niklas Muennighoff, Seonghyeon Ye, , Shrestha Mohanty, Vipul Gupta, Vivek Sharma, Vu Minh Chien, Xuhui Zhou, Yizhi Li, Caiming Xiong, Luis Villa, Stella Biderman, Hanlin Li, Daphne Ippolito, Sara Hooker, Jad Kabbara, Sandy Pentland

NeurIPS 2024

Paper Code

Aligning to Thousands of Preferences via System Prompt Generalization

Seongyun Lee*, Sue Hyun Park*, , Minjoon Seo

NeurIPS 2024

Paper Code

Personalized Soups: Personalized Large Language Model Alignment via Post-hoc Parameter Merging

Joel Jang, , Bill Yuchen Lin, Yizhong Wang, Jack Hessel, Luke Zettlemoyer, Hannaneh Hajishirzi, Yejin Choi, Prithviraj Ammanabrolu

NeurIPS 2024 AFM Workshop (Oral)

Paper Code

Prometheus-Vision: Vision-Language Model as a Judge for Fine-grained Evaluation

Seongyun Lee*, *, Sue Hyun Park, Geewook Kim, Minjoon Seo

ACL 2024 Findings

Paper Code

LangBridge: Multilingual Reasoning without Multilingual Supervision

Dongkeun Yoon, Joel Jang, Sungdong Kim, , Sheikh Shafayat, Minjoon Seo

ACL 2024

Paper Code

Multi-Task Inference: Can Large Language Models Follow Multiple Instructions at Once?

Guijin Son*, Sangwon Baek, Sangdae Nam, Ilgyun Jeong, *

ACL 2024

Paper Code

Prometheus: Inducing Fine-grained Evaluation Capability in Language Models

*, Jamin Shin*, Yejin Cho*, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, Minjoon Seo

ICLR 2024

Paper Code

FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets

Seonghyeon Ye, Doyoung Kim, Sungdong Kim, Hyeonbin Hwang, , Yongrae Jo, James Thorne, Juho Kim, Minjoon Seo

ICLR 2024 (Spotlight)

Paper Code

2023

The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-tuning

*, Se June Joo*, Doyoung Kim, Joel Jang, Seonghyeon Ye, Jamin Shin, Minjoon Seo

EMNLP 2023

Paper Code

Exploring the Benefits of Training Expert Language Models over Instruction Tuning

Joel Jang, , Seonghyeon Ye, Doyoung Kim, Lajanugen Logeswaran, Moontae Lee, Kyungjae Lee, Minjoon Seo

ICML 2023

Paper Code

CoTEVer: Chain of Thought Prompting Annotation Toolkit for Explanation Verification

, Se June Joo, Yul Jang, Hyungjoo Chae, Jinyoung Yeo

EACL 2023 Demo Track

Paper Code

2022

Mind the Gap! Injecting Commonsense Knowledge for Abstractive Dialogue Summarization

*, Se June Joo*, Hyungjoo Chae*, Chaehyeong Kim, Seung-won Hwang, Jinyoung Yeo

COLING 2022

Paper Code

***( * indicates equal contribution )***

Vitæ

Full CV in PDF.

FAIR @ Meta May. 2026 - Dec. 2026

Research Intern (Mentors: Jason Weston)
TBD
FAIR @ Meta May. 2025 - Dec. 2025

Research Intern (Mentors: Ilia Kulikov, Jason Weston)
Worked on developing a synthetic dataset that improves reasoning capabilities of LMs.
CMU LTI Aug. 2024 - Present

Ph.D. in Computer Science (Advisors: Graham Neubig and Sean Welleck)
Working on LLM Evaluation and AI for Science.
AML Lab @ LG AI Research Jan. 2024 - Jun. 2024

Research Intern (Mentor: Kyungjae Lee)
Worked on building a comprehensive NLG benchmark that could mimic the fineness of human evaluation.
Language Lab @ Naver AI Lab Mar. 2023 - Dec. 2023

Research Intern (Mentor: Jamin Shin)
Worked on building an open-sourced evaluator LM & VLM that could potentially replace GPT-4 and GPT-4V Evaluation.
KAIST AI Mar. 2023 - Jul. 2024

M.S. in Artificial Intelligence (Advisor: Minjoon Seo)
Worked on developing evaluator LM & VLMs and Chain-of-Thought fine-tuning. Early Graduation (3 semesters).
LK Lab @ KAIST AI Jul. 2022 - Feb. 2023

Research Intern (Mentor: Joel Jang)
Worked on developing expert LMs that can generalize to novel tasks.
Yonsei University Mar. 2018 - Feb. 2023

B.S. in Computer Science
Early Graduation (7 semesters).

Seungone Kim

Last updated on April 24, 2025

URL: https://seungonekim.github.io/

⇱ Seungone Kim

Seungone Kim

About

News

Education

Publications

Preprints

2026

2025

2024

2023

2022

***( * indicates equal contribution )***

Vitæ