Practice of Mobile Application LLM System Driven by End-Edge-Cloud Collaboration

Authors

  • Xin Yu TikTok Pte. Ltd., Singapore 048583, Singapore Author

DOI:

https://doi.org/10.70088/b0av0404

Keywords:

end-edge-cloud collaboration, large-small model coordination, mobile LLM system, privacy-preserving inference, cost-aware routing, on-device intelligence

Abstract

Large language models (LLMs) have rapidly become a general-purpose capability layer for mobile applications, yet "cloud-only LLM" deployment faces persistent bottlenecks in privacy-sensitive data access, inference cost, and real-time reliability. This paper presents a practical system design for a mobile application LLM stack driven by end-edge-cloud collaboration and coordinated "large-small model" execution. We summarize why a single large model cannot adequately address (i) user-level data richness and privacy constraints on-device, (ii) the high marginal cost of cloud inference at scale, and (iii) responsiveness and stability requirements under variable networks. We propose an architecture that assigns personalized, latency-critical, and privacy-preserving functions to on-device small models and local runtimes; delegates cacheable, low-latency coordination and retrieval services to the edge; and reserves cloud LLMs for complex reasoning and generation. We further describe orchestration mechanisms, routing policies, and optimization techniques, including context condensation, selective retrieval, speculative execution, and feedback-driven adaptation. Results are reported as system-level outcomes in terms of latency, cloud token reduction, and robustness under network degradation, concluding that end-edge-cloud collaboration can improve user experience while materially reducing cloud-side cost and expanding privacy-respecting capability coverage.

References

S. Han, H. Mao, and W. J. Dally, "Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding," arXiv preprint arXiv:1510.00149, 2015.

P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, and S. Zhao, "Advances and open problems in federated learning," Foundations and trends® in machine learning, vol. 14, no. 1-2, pp. 1-210, 2021.

P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, and D. Kiela, "Retrieval-augmented generation for knowledge-intensive nlp tasks," Advances in neural information processing systems, vol. 33, pp. 9459-9474, 2020.

W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, "Edge computing: Vision and challenges," IEEE internet of things journal, vol. 3, no. 5, pp. 637-646, 2016. doi: 10.1109/jiot.2016.2579198

G. Hinton, O. Vinyals, and J. Dean, "Distilling the knowledge in a neural network," arXiv preprint arXiv:1503.02531, 2015.

Downloads

Published

25 February 2026

Issue

Section

Article

How to Cite

Yu, X. (2026). Practice of Mobile Application LLM System Driven by End-Edge-Cloud Collaboration. Artificial Intelligence and Digital Technology, 3(1), 62-68. https://doi.org/10.70088/b0av0404