About Me

I am currently an Applied Scientist at Amazon Ads Foundational Model & Agent, working on multi-agent systems and post-training data for LLM agents. I co-develop A-Evolve, an open-source infrastructure for continual agent and model improvement — evolvers that mutate harnesses and search post-training recipes across SWE-bench Verified, Terminal-Bench, OSWorld, and ARC-AGI.

Before this, I was at Apple AIML (2022–2025), where I worked on Siri’s open-domain knowledge extraction — knowledge graph construction, entity linking evaluation, and data quality measurement for production knowledge systems. I interned at Fidelity on dialogue summarization, and volunteered with Hugging Face on the GEM benchmark, designing human evaluation criteria for multilingual NLG.

I received my PhD in Information Science from Syracuse University, where I worked on narrative understanding and human-AI collaboration. Across AI and HCI venues (NAACL, ACL, EMNLP, ICML, IJCAI, CHI, CSCW, TOCHI), my work has consistently focused on the data and evaluation side of intelligent systems.

Research Interests: Data and Evaluation for LLM Agents and Reasoning Models.

News

Our paper Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents is now available as an arXiv preprint! [arXiv:2605.30621]
Lin, M., Wu, J., Wang, Z., Shi, Z., Sang, Y., He, B., Liu, Z., Wei, T., Wu, Z., Zhang, Z., Wang, D., Zhang, X., Dumoulin, B., Xie, C., Zhou, Y., Wang, S., & Lu, H. (2026).

Our paper Firefly: Illuminating Large-Scale Verified Tool-Call Data Generation from Real APIs is now available as an arXiv preprint! [arXiv:2605.17558]
Lu, Y., Wang, Z., Lu, Y., Sang, Y., Gesi, J., Tang, X., … & Wang, D. (2026).

Our paper Trajectory2Task: Training Robust Tool-Calling Agents with Synthesized Yet Verifiable Data for Complex User Intents has been accepted to ACL 2026!
Wang, Z., Lu, Y., Zhang, Y., Chen, P., Dong, Z., Huang, J., … & Wang, D. (2026).

Our paper In-Context Sampling Strategy for Reliable LLM Prompting (Arxiv version will be updated soon) was accepted to NAACL 2024!

Our paper FLEEK: Factual Error Detection and Correction with Evidence Retrieved from External Knowledge has been accepted to EMNLP2023! [PDF].

Our paper Malicious Selling Strategies in E-Commerce Livestream: A Case Study of Alibaba’s Taobao and ByteDance’s Douyin has been accepted to TOCHI! [PDF].

Happy to have my first student workshop paper! Our paper Machine Narrative Comprehension in a Fictional Characters Personality Prediction Task has been accepted to NAACL SRW 2022!

Our survey A Survey of Machine Narrative Reading Comprehension Assessments has been accepted to the IJCAI-ECAI2022 Survey Track (acceptance rate 18%) [PDF].

Our paper TVShowGuess: Character Comprehension in Stories as Speaker Guessing [PDF, Github] has been accepted to NAACL 2022!

Yisi Sang

About Me

News