Optimization of Scalable Machine Learning Pipelines for Big Data Analytics in Distributed Systems
DOI:
https://doi.org/10.70088/r0a1xh16Keywords:
machine learning, distributed systems, big data, scalability, optimization, performanceAbstract
This paper proposes an optimization approach for machine learning pipelines in distributed systems aimed at improving scalability and performance for big data analytics. The approach addresses key challenges such as data partitioning, load balancing, resource management, and fault tolerance. Experimental results demonstrate significant improvements in throughput, latency, scalability, and resource utilization, with up to a 43% increase in throughput and a 35% reduction in resource consumption. The optimized pipeline not only performs better under increasing dataset sizes and node counts but also exhibits enhanced fault tolerance and cost efficiency. This study contributes to advancing the efficiency and effectiveness of machine learning pipelines in distributed environments, offering valuable insights for large-scale data processing and analysis.