Selected Projects 2024-2025

USC + Amazon Center on Secure & Trusted Machine Learning

Learning Predictive Models using Imperfect Data: An Integrated Continuous-Discrete Approach

PI and Co-PI: Andres Gomez & Johannes Royset, Daniel J. Epstein Department of Industrial & Systems Eng.
‍
‍Capital One Fellow Student: Jad Soucar, (PhD Student)

In marketing contexts multi-touch attribution (MTA) aims to assign credit to a sequence of observed advertisements influencing a customer’s decision to make a purchase. Existing state-of-the-art models often rely on opaque black-box predictors with post-hoc attribution (e.g., Shapley values), which can be difficult to interpret and operationalize. We propose SDMTA, a novel interpretable state and time-dependent MTA framework that explicitly models how advertising exposures accumulate and decay in a customer’s latent purchase propensity. To efficiently solve the resulting optimization problem, we propose a multi-block penalty algorithm that employs a dynamic programming based splitting scheme and a knowledge distillation step, enabling computational tractability at scale. On synthetic data with known ground truth, the proposed algorithm method is robust to noise and recovers accurate purchase patterns. On a large real-world dataset provided by a leading financial services provider, the proposed approach matches or outperforms black-box methods from the literature, while preserving white-box attribution.

Publications: Submitted:
Jad Soucar, Andres Gomez, Johannes Royset, Kaland Mishra, Swapnil Shinde, Pranab Mohanty (2025) Interpretable State and Time Dependent Multi-Touch Attribution. Submitted to ICML 2026

Reliable AI Predictions from Imperfect Financial Datasets: A Conformal Inference Approach

PI and Co-PI: Matteo Sesia, Assistant Professor of Data Sciences and Operations, Marshall School of Business with joint appointment with Computer Science Department
‍
‍Capital One Fellow Student: Yanfei Zhou, (PhD Student)

Financial institutions increasingly rely on machine learning and AI for many problems, including credit risk assessment, fraud detection, and personalized services. However, these models often lack reliable uncertainty estimates, leading to overconfident predictions and reduced interpretability, particularly when training data are imperfect, contaminated, or evolving. Our research addresses this challenge through conformal inference, a statistical framework for principled uncertainty quantification in complex predictive models. To date, this project has produced seven research papers: four journal submissions currently under review or major revision, two peer-reviewed conference publications, and one working paper in preparation. These works develop new conformal methods that remain valid under realistic data imperfections, including structured label noise, imbalanced classes, and contaminated reference datasets. As part of this research, we established an ongoing collaboration with Capital One researchers to develop uncertainty-aware classification methods that remain reliable under distribution shift, a critical challenge for deployed fraud detection and risk models facing changing economic conditions and customer behavior. This joint work is currently being prepared for submission to a top machine learning conference.

Publications & Conferences:
M. Bashari, M. Sesia, Y. Romano. ‘ Robust Conformal Outlier Detection under Contaminated Reference Data’ Proceedings of the 42nd International Conference on Machine Learning (ICML), 2025 (https://arxiv.org/abs/2502.04807)
‍
M. Sesia, V. Svetnik, ‘ Conformal Survival Bands for Risk Screening under Right-Censoring’ , Proceedings of the 14th Symposium on Conformal and Probabilistic Prediction with Applications (COPA), 2025 (https://arxiv.org/abs/2505.04568)
‍
Open SkyAI Conference. 09/04/2025, in Chicago, IL.
‍
ICML Expo Workshop on AI in Finance. 07/14/2025, in Vancouver, Canada

Knowledge Graph-Enhanced RAG for Financial GenAI System

PI: Viktor Prasanna, Professor of Electrical Engineering and Computer Science, Viterbi
‍
‍Capital One Fellow Student: Yuxin Yang, (PhD Student)

Generative AI, which leverages the power of Large Language Models (LLMs), has recently become a powerful tool across various domains. Yet, it faces several challenges, such as hallucination (i.e., generating irrelevant or made-up contents), low interpretability of the AI model’s decision-making, and difficulties in handling complex questions that require synthesizing information from multiple sources. These limitations are especially problematic in finance, where accuracy and contextual understanding are critical. While retrieval augmented generation (RAG) has been introduced to mitigate some of these challenges, traditional RAG systems often fall short in scenarios requiring multi-hop logical reasoning and contextual understanding; this is mainly due to their reliance on top-k relevant data retrieval for a given query. To overcome these limitations, we propose integrating knowledge graph retrieval within RAG systems. Storing data in knowledge graphs enables the traversal of interconnected data points, supporting multi-hop reasoning, and capturing contextual information. This approach is particularly beneficial for financial services that deal with complex questions that require contextual understanding and multi-hop reasoning from multiple data sources, such as Risk Assessment, Fraud Detection, and Investment Advisory. By addressing the current limitations of existing RAG systems, our proposed method aims to improve the quality and reliability of AI-generated responses in financial services.

Publications & Conferences:
Submitted: Yuxin Yang, Gangda Deng, Ömer Faruk Akgül, Nima Chitsazan, Yash Govilkar, Akasha Tigalappanavara, Shi-Xiong Zhang, Sambit Sahu, Viktor Prasanna. SPARC-RAG: Adaptive Sequential–Parallel Scaling with Context Management for Retrieval-Augmented Generation. ACL, 2026 (under review).
‍
Yuxin Yang is a 3^rd year PhD student. Poster presentation at the USC–Capital One CREDIF Research Symposium 2025

Privacy Preserving Synthetic Data Generation with Differentially Private Reward Models

PI and Co-PI: Sai Praneeth Karimireddy & Robin Jia, Assistant Professors of Computer Science, Viterbi
‍
‍Capital One Fellow Student: Amin Banayeeanzade and Deqing Fu, (PhD Students)

Our work focuses on techniques for leveraging private datasets without leaking private information. In our main project, we have developed EpsVec, a new technique for generating synthetic data that matches the distribution of a private dataset without leaking private information; this synthetic data can be freely shared to facilitate collaboration and research. EpsVec works by creating private “dataset vectors” that capture the difference between private data and public priors. We steer language model generation with these vectors to close the distributional gap between synthetic data and private data. We construct these dataset vectors in a differentially private manner, yielding strong theoretical guarantees that our method does not leak private information. Compared to other differentially private techniques for synthetic data generation, EpsVec more closely matches the private data in distributional similarity and scales much better, i.e., it can generate large datasets when other methods cannot. We have also developed empirical techniques to audit the privacy leakage of in-context learning methods. In-context learning uses a small number of examples as a prompt to “teach” a language model how to perform a task. If these examples contain private information, extra care is required to ensure that the model does not leak this private information. We create the first framework to empirically measure worst-case leakage of private information from in-context examples. We find that existing in-context learning approaches often substantially leak private information and have a poor tradeoff between privacy and utility, signaling an important direction for future work.

Publications & Conferences:
Jacob Choi, Shuying Cao, Xingjian Dong, Wang Bill Zhu, Robin Jia, Sai Praneeth Karimireddy. ContextLeak: Auditing Leakage in Private In-Context Learning Methods. Submitted to ACL Rolling Review.
‍
Amin Banayeeanzade, Qingchuan Yang, Deqing Fu, Spencer Hong, Erin Babinsky, Alfy Samuel, Anoop Kumar, Robin Jia, Sai Praneeth Karimireddy. EPSVec: Efficient and Private Synthetic Text Generation via Dataset Vectors. In preparation for submission to ICML 2026

Revolutionizing Financial Fraud Detection: Advanced Graph-based Anomaly Detection with LLM Integration

PI and Co-PI: Yue Zhao & Jieyu Zhao, Assistant Professor, Department of Computer Science, Viterbi
‍
‍Capital One Fellow Student: Amin Banayeeanzade and Deqing Fu, (PhD Students)Haoyan Xu, Li Li, Ziyi Liu (each student gets one semester support)

This project studies graph-based anomaly detection methods for financial applications such as fraud detection and transaction monitoring, with a particular focus on integrating large language models (LLMs). The research addresses three core challenges in financial graph anomaly detection: limited labeled data, difficulty in model selection across heterogeneous graphs, and the need for interpretable and trustworthy results.The project investigates label-efficient graph anomaly detection techniques, automated model selection without access to evaluation labels, and LLM-based explanation and reasoning mechanisms to improve interpretability. The methods are designed to support realistic financial settings, including text-attributed graphs and complex relational structures commonly observed in financial systems. The work also emphasizes reproducibility and practical deployment considerations relevant to industry use cases.

Publications & Conferences:
Xu, Haoyan; Qian, Ruizhi; Yao, Zhengtao; Liu, Ziyi; Li, Li; Li, Yuqi; Li, Yanshu; Zheng, Wenqing; Rosa, Daniele; Barcklow, Daniel; Kumar, Senthil; Zhao, Jieyu; Zhao, Yue. LLM-Powered Text-Attributed Graph Anomaly Detection via Retrieval-Augmented Reasoning. Submitted to ACL (under review). (arXiv preprint: 2511.17584) (This paper is a joint collaboration with Capital One researchers and directly aligns with the project’s goals on LLM-assisted graph anomaly detection.)
‍
Li, Yuangang; Shen, Yiqing; Nian, Yi; Gao, Jiechao; Wang, Ziyi; Yu, Chenxiao; Li, Shawn; Wang, Jie; Hu, Xiyang; Zhao, Yue. Mitigating Hallucinations in Large Language Models via Causal Reasoning. Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI), 2026.(Acknowledges support from this award; not a direct collaboration with Capital One researchers.) https://arxiv.org/abs/2508.12495

Li, Shawn; Qu, Jiashu; Zhou, Yuxiao; Qin, Yuehan; Yang, Tiankai; Zhao, Yue.
Treble Counterfactual VLMs: A Causal Approach to Hallucination. Findings of the Association for Computational Linguistics (EMNLP), 2025.(Acknowledges support from this award; not a direct collaboration with Capital One researchers. https://aclanthology.org/2025.findings-emnlp.1000/

Li, Yuangang; Li, Jiaqi; Xiao, Zhuo; Yang, Tiankai; Nian, Yi; Hu, Xiyang; Zhao, Yue.NLP-ADBench: NLP Anomaly Detection Benchmark.Findings of the Association for Computational Linguistics (EMNLP), 2025.(Acknowledges support from this award; not a direct collaboration with Capital One researchers.) https://aclanthology.org/2025.findings-emnlp.133/

Disentangling and Explaining Model-level Subjectivity with Diverse Large Language Models in Multi-agent Decision-making

PI: Shrikanth Narayanan, Professor and Nikias Chair in Engineering
‍
‍Capital One Fellow Student: Georgios Chochlakis, (PhD Student)

We collected the EVALUATE (Ecologically-Valid Affective and Logical Understanding of Ad Trust and Emotion) dataset, a multimodal corpus curated for studying how advertisements influence trust and emotion, with a focus on economic decisions. We also evaluated the qualities of LLM inference and evaluation in subjective tasks.

Publications & Conferences:
Jacob Choi, Shuying Cao, Xingjian Dong, Wang Bill Zhu, Robin Jia, Sai Praneeth Karimireddy. ContextLeak: Auditing Leakage in Private In-Context Learning Methods. Submitted to ACL Rolling Review.Published: Marcus Ma, Georgios Chochlakis, Niyantha Maruthu Pandiyan, Jesse Thomason, and Shrikanth Narayanan. 2025. Large Language Models Do Multi-Label Classification Differently. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Suzhou, China. Association for Computational Linguistics. https://aclanthology.org/2025.emnlp-main.126/.
‍
Published: Georgios Chochlakis, Peter Wu, Arjun Bedi, Marcus Ma, Kristina Lerman, and Shrikanth Narayanan. 2025. Humans Hallucinate Too: Language Models Identify and Correct Subjective Annotation Errors With Label-in-a-Haystack Prompts. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Suzhou, China. Association for Computational Linguistics. https://aclanthology.org/2025.emnlp-main.993/

Submitted: Georgios Chochlakis, Jackson Trager, Vedant Jhaveri, Nikhil Ravichandran, Alexandros Potamianos, and Shrikanth Narayanan. Semantic F1 Scores: Fair Evaluation Under Fuzzy Class Boundaries. Under review.

CapitalOne Fellow student Status PhD Candidate (Thesis Proposal Completed, 2025)