Consistency Gap under Retrieval Corruption: Stress-Testing RAG Robustness with Adversarial Evidence
DOI:
https://doi.org/10.70088/5tpptt45Keywords:
retrieval-augmented generation, robustness testing, consistency gap, adversarial evidence, retrieval contaminationAbstract
Retrieval-Augmented Generation (RAG) significantly enhances the factual answering capability and contextual awareness of large language models by dynamically incorporating external knowledge sources into the generation process. However, the overall reliability of these systems critically depends on the accuracy and quality of the documents provided during the initial retrieval stage. When retrieval results are inadvertently or maliciously contaminated with adversarial evidence—defined as documents that are highly relevant to the user's query but contain fundamentally incorrect information—the model's generated content frequently exhibits systematic drift. This phenomenon forms what is formally termed a "consistency gap," wherein the model struggles to reconcile its internal parametric knowledge with the flawed external context. This paper investigates retrieval contamination as a primary research setting and constructs a comprehensive stress-testing framework based on adversarial evidence to systematically examine how erroneous documents affect the logical consistency of model outputs. Extensive empirical evaluations reveal several critical findings: under conditions of retrieval contamination, models exhibit a highly asymmetric sensitivity to incorrect evidence, often prioritizing flawed retrieved text over accurate internal knowledge. Furthermore, different decoding strategies demonstrate highly differentiated robustness characteristics, suggesting that generation parameters can mitigate or exacerbate the issue. Finally, the magnitude of the observed consistency gap is significantly correlated with an increased risk of severe hallucination. Ultimately, this work provides new analytical perspectives and robust testing methodologies for evaluating, benchmarking, and improving the resilience of retrieval-augmented generation systems in real-world applications.References
S. Amirshahi, A. Bigdeli, C. L. Clarke, and A. Ghenai, "Evaluating the Robustness of Retrieval-Augmented Generation to Adversarial Evidence in the Health Domain," arXiv preprint arXiv:2509.03787, 2025.
F. Fang, Y. Bai, S. Ni, M. Yang, X. Chen, and R. Xu, "Enhancing noise robustness of retrieval-augmented language models with adaptive adversarial training," in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Aug. 2024, pp. 10028-10039.
Y. Tu, W. Su, Y. Zhou, Y. Liu, and Q. Ai, "Robust Fine-tuning for Retrieval Augmented Generation against Retrieval Defects," in Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, Jul. 2025, pp. 1272-1282.
M. V. D. S. de Oliveira, J. D. A. Silva, and A. D. L. Fontao, "Fairness Testing in Retrieval-Augmented Generation: How Small Perturbations Reveal Bias in Small Language Models," arXiv preprint arXiv:2509.26584, 2025.
Y. Tu, W. Su, Y. Zhou, Y. Liu, and Q. Ai, "RbFT: Robust Fine-tuning for Retrieval-Augmented Generation against Retrieval Defects," arXiv preprint arXiv:2501.18365, 2025.
H. Zhou, K. H. Lee, Z. Zhan, Y. Chen, Z. Li, Z. Wang, et al., "TrustRAG: enhancing robustness and trustworthiness in retrieval-augmented generation," arXiv preprint arXiv:2501.00879, 2025.
Y. Zeng, T. Cao, D. Wang, X. Zhao, Z. Qiu, M. Ziyadi, et al., "Rare: Retrieval-aware robustness evaluation for retrieval-augmented generation systems," arXiv preprint arXiv:2506.00789, 2025.
C. Sharma, "Retrieval-augmented generation: A comprehensive survey of architectures, enhancements, and robustness frontiers," arXiv preprint arXiv:2506.00054, 2025.
T. Sun, A. Somalwar, and H. Chan, "Multimodal retrieval augmented generation evaluation benchmark," in 2024 IEEE 99th Vehicular Technology Conference (VTC2024-Spring), Jun. 2024, pp. 1-5.
S. Perçin, X. Su, Q. S. Syed, P. Howard, A. Kuvshinov, L. Schwinn, and K. U. Scholl, "Investigating the robustness of retrieval-augmented generation at the query level," in Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²), Jul. 2025, pp. 439-457.
Y. Zhou, Y. Liu, X. Li, J. Jin, H. Qian, Z. Liu, et al., "Trustworthiness in retrieval-augmented generation systems: A survey," arXiv preprint arXiv:2409.10102, 2024.
J. Chen, H. Lin, X. Han, and L. Sun, "Benchmarking large language models in retrieval-augmented generation," in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 16, Mar. 2024, pp. 17754-17762.
S. Yang, J. Wu, W. Ding, N. Wu, S. Liang, M. Gong, et al., "Quantifying the robustness of retrieval-augmented language models against spurious features in grounding data," arXiv preprint arXiv:2503.05587, 2025.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Juyi Yang (Author)

This work is licensed under a Creative Commons Attribution 4.0 International License.







