Tarun Suresh

I'm a third-year undergraduate student in the Computer Science Department at the University of Illinois, Urbana-Champaign (UIUC). I work with Prof. Sasa Misailovic (UIUC), Prof. Gagandeep Singh (UIUC), Prof. Alex Aiken (Stanford), and Prof. Heng Ji (UIUC).

Email / LinkedIn / Google Scholar

Research

My research is at the intersection of deep learning, formal methods, and programming languages. I am largely interested in improving the capabilities of AI systems in challenging, real-world coding, math, and logical reasoning tasks.

I'm currently researching:

- Deep Learning for Program Synthesis and Code Semantics Understanding

- Tool Use with Language Models and Language Model-driven Software Engineering

- LLM Post-Training (Preference and Reinforcement Finetuning) and Inference (Decoding, Reasoning, Search, Planning) Algorithms For Code

- AI for Formal Methods and Formal Methods for AI

Papers

▼ Constrained Generation of LLMs

	DINGO: Constrained Inference for Diffusion LLMs Tarun Suresh, Debangshu Banerjee, Shubham Ugare, Sasa Misailovic, Gagandeep Singh R2-FM @ ICML 2025 [Paper] We introduce DINGO, a novel constrained inference method for diffusion-based large language models that enables efficient generation while satisfying user-defined constraints. Our approach provides a principled way to incorporate constraints into the diffusion process, enabling more controllable and reliable LLM generation.
	CRANE: Reasoning with Constrained LLM Generation Debangshu Banerjee, Tarun Suresh, Shubham Ugare, Sasa Misailovic, Gagandeep Singh ICML 2025; Also at VerifAI @ ICLR 2025 [Paper] We show theoretically that constraining LLM generation to a fixed output grammar can reduce reasoning capabilities and by augmenting the grammar with additional production rules for CoT reasoning steps, we can preserve LLM expressivity. Building on these theoretical results, we introduce CRANE, a constrained decoding algorithm that only enforces the grammar when generating final answers and intermediate expressions, while keeping reasoning steps unconstrained. CRANE boosts accuracy by up to 10 percentage points on first-order logic generation and other symbolic reasoning tasks.
	IterGen: Iterative Semantic-aware Structured LLM Generation with Backtracking Shubham Ugare, Rohan Gumaste, Tarun Suresh, Gagandeep Singh, Sasa Misailovic ICLR 2025 [Paper][PDF] IterGen is a decoding algorithm which leverages grammar-based backtracking and selective rejection sampling to efficiently enforce user-defined semantic constraints into LLM output. IterGen improves LLM-generated SQL accuracy by 18% and eliminates LLM privacy leakage.
	SynCode: LLM Generation with Grammar Augmentation Shubham Ugare, Tarun Suresh, Hangoo Kang, Sasa Misailovic, Gagandeep Singh TMLR 2025 [Paper][PDF][Code] SynCode is a novel framework for the grammar-guided generation of Large Language Models (LLMs) that is scalable to general-purpose programming languages and has soundness and completeness guarantees. SynCode reduces syntax errors by 96-100% for various languages (JSON, Python, Go) and enables 1.5x-10x faster LLM inference than existing approaches.

▼ Reinforcement Learning, Reward Modeling, and Reward Hacking

Learning a Pessimistic Reward Model in RLHF
Yinglun Xu*, Hangoo Kang*, Tarun Suresh, Yuxuan Wan, Gagandeep Singh
In Submission
[Paper]

We propose PET, a novel pessimistic reward fine-tuning method, to learn a pessimistic reward model robust against reward hacking in offline reinforcement learning from human feedback (RLHF). Our method shows that when optimizing a policy on a pessimistic reward model fine-tuned through PET, reward hacking can be prevented without relying on any regularization, enabling agents to greedily search for high-reward policies without suffering from reward hacking.

Two-Step Offline Preference-Based Reinforcement Learning with Constrained Actions
Yinglun Xu, Tarun Suresh, Rohan Gumaste, David Zhu, Ruirui Li, Zhengyang Wang, Haoming Jiang, Xianfeng Tang, Qingyu Yin, Monica Xiao Cheng, Qi Zheng, Chao Zhang, Gagandeep Singh
Under Review
[Paper][PDF]

To address challenges from the risk of reward hacking and the complexity of reinforcement learning during preference-based reinforcement learning, we develop a novel two-step learning method called PRC. The high-level idea is to limit the reinforcement learning agent to optimize over a constrained action space that excludes out-of-distribution state-actions, which are unreliable and increase the complexity of the reinforcement learning problem.

▼ Language Models for Programming Synthesis, Optimization, and Verification and Software Development

	CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis Anjiang Wei, Tarun Suresh, Jiannan Cao, Naveen Kannan, Yuheng Wu, Kai Yan, Thiago S. F. X. Teixeira, Ke Wang, Alex Aiken COLM 2025 [Paper][Code] We propose CodeARC, the Code Abstraction and Reasoning Challenge, a new evaluation framework where agents interact with a hidden target function by querying it with new inputs, synthesizing candidate functions, and iteratively refining their solutions using a differential testing oracle. We construct the first large-scale benchmark for general-purpose inductive program synthesis, featuring 1114 functions. Among 18 models evaluated, o3-mini performs best with a success rate of 52.7%, highlighting the difficulty of this task.
	CoRNStack: High-Quality Contrastive Data for Better Code Retrieval and Reranking Tarun Suresh, Revanth Gangi Reddy, Yifei Xu, Zach Nussbaum, Andriy Mulyar, Brandon Duderstadt, Heng Ji ICLR 2025 [Blog Post][Paper][Code] We introduce CoRNStack, a large-scale, high-quality contrastive training dataset for code that spans multiple programming languages. We demonstrate that contrastive training of embedding models and LLM re-rankers using CoRNStack leads to state-of-the-art performance across a variety of code retrieval tasks. Notably, our lightweight retriever + re-ranker achieves state-of-the-art repoistory level bug localization on SWE-Bench over top automated software development frameworks.
	SweRank: Software Issue Localization with Code Ranking Revanth Gangi Reddy, Tarun Suresh, JaeHyeok Doo, Ye Liu, Xuan Phi Nguyen, Yingbo Zhou, Semih Yavuz, Caiming Xiong, Heng Ji, Shafiq Joty In Submission* [Paper][Code] We present SweRank, a novel approach for software issue localization that leverages code ranking techniques to identify relevant code segments for bug fixing. Our method significantly improves localization accuracy by effectively ranking code files and functions based on their relevance to reported software issues.
	Improving Assembly Code Performance with Large Language Models via Reinforcement Learning Anjiang Wei, Tarun Suresh, Huanmi Tan, Yinglun Xu, Gagandeep Singh, Ke Wang, Alex Aiken In Submission [Paper] We present a reinforcement learning framework that trains LLMs using Proximal Policy Optimization (PPO) to optimize assembly code performance. Our model, Qwen2.5-Coder-7B-PPO, achieves 96.0% test pass rates and an average speedup of 1.47x over the gcc -O3 baseline on a benchmark of 8,072 real-world programs, demonstrating that reinforcement learning can unlock the potential of LLMs to serve as effective optimizers for assembly code performance.

▼ Neural Network Robustness and Verification

	Tamper-Resistant Safeguards for Open-Weight LLMs Rishub Tamirisa, Bhrugu Bharathi, Long Phan, Andy Zhou, Alice Gatti, Tarun Suresh, Maxwell Lin, Justin Wang, Rowan Wang, Ron Arel, Andy Zou, Dawn Song, Bo Li, Dan Hendrycks, Mantas Mazeika ICLR 2025 [Paper][PDF] We develop a method, called TAR, for building tamper-resistant safeguards into open-weight LLMs such that adversaries cannot remove the safeguards even after thousands of steps of fine-tuning. In extensive evaluations and red teaming analyses, we find that our method greatly improves tamper-resistance while preserving benign capabilities.
	Relational Verification Leaps Forward with RABBit Tarun Suresh, Debangshu Banerjee, Gagandeep Singh NeurIPS 2024 [Paper][PDF][Code] We introduce a GPU-accelerated, Branch-and-Bound-based verifier RABBit for formally verifying hyperproperties, such as ensembles, conformal prediction, and robustness, defined over multiple executions of Deep Neural Networks. RABBit improves neural network verified accuracy by 8% over state-of-the-art verifiers within the same compute budget.
	Incremental Randomized Smoothing Certification Shubham Ugare, Tarun Suresh, Debangshu Banerjee, Sasa Misailovic, Gagandeep Singh ICLR 2024 [Paper][PDF][Code] We present IRS, the first probabilistic approach for 5x faster robustness re-certification of Deep Neural Networks after model compression (pruning, quantization) or fine-tuning
	Is Watermarking LLM Generated Code Robust? Tarun Suresh, Shubham Ugare, Gagandeep Singh, Sasa Misailovic Tiny ICLR 2024 (Oral Presentation) [Paper][PDF][Code] We present the first study of the robustness of existing watermarking techniques on code generated by large language models and propose a parsing-based algorithm that easily removes these watermarks via semantic preserving transformations of the code.
	Towards Continuous Verification of DNNs Shubham Ugare, Debangshu Banerjee, Tarun Suresh, Gagandeep Singh, Sasa Misailovic WFML @ ICML 2023 [Paper][PDF][Code] We propose efficient deterministic formal verifiers to speed up DNN re-verification after pruning, quantization, or fine-tuning.

Template from here