Show newer

AIBrix: Towards Scalable, Cost-Effective Large Language Model Inference Infrastructure arxiv.org/abs/2504.03648 .DC .AI

AIBrix: Towards Scalable, Cost-Effective Large Language Model Inference Infrastructure

We introduce AIBrix, a cloud-native, open-source framework designed to optimize and simplify large-scale LLM deployment in cloud environments. Unlike traditional cloud-native stacks, AIBrix follows a co-design philosophy, ensuring every layer of the infrastructure is purpose-built for seamless integration with inference engines like vLLM. AIBrix introduces several key innovations to reduce inference costs and enhance performance including high-density LoRA management for dynamic adapter scheduling, LLM-specific autoscalers, and prefix-aware, load-aware routing. To further improve efficiency, AIBrix incorporates a distributed KV cache, boosting token reuse across nodes, leading to a 50% increase in throughput and a 70% reduction in inference latency. AIBrix also supports unified AI runtime which streamlines model management while maintaining vendor-agnostic engine compatibility. For large-scale multi-node inference, AIBrix employs hybrid orchestration -- leveraging Kubernetes for coarse-grained scheduling and Ray for fine-grained execution -- to balance efficiency and flexibility. Additionally, an SLO-driven GPU optimizer dynamically adjusts resource allocations, optimizing heterogeneous serving to maximize cost efficiency while maintaining service guarantees. Finally, AIBrix enhances system reliability with AI accelerator diagnostic tools, enabling automated failure detection and mock-up testing to improve fault resilience. AIBrix is available at https://github.com/vllm-project/aibrix.

arXiv.org

Diagnostic Method for Hydropower Plant Condition-based Maintenance combining Autoencoder with Clustering Algorithms arxiv.org/abs/2504.03649 .AI .LG .NE

Diagnostic Method for Hydropower Plant Condition-based Maintenance combining Autoencoder with Clustering Algorithms

The French company EDF uses supervisory control and data acquisition systems in conjunction with a data management platform to monitor hydropower plant, allowing engineers and technicians to analyse the time-series collected. Depending on the strategic importance of the monitored hydropower plant, the number of time-series collected can vary greatly making it difficult to generate valuable information from the extracted data. In an attempt to provide an answer to this particular problem, a condition detection and diagnosis method combining clustering algorithms and autoencoder neural networks for pattern recognition has been developed and is presented in this paper. First, a dimension reduction algorithm is used to create a 2-or 3-dimensional projection that allows the users to identify unsuspected relationships between datapoints. Then, a collection of clustering algorithms regroups the datapoints into clusters. For each identified cluster, an autoencoder neural network is trained on the corresponding dataset. The aim is to measure the reconstruction error between each autoencoder model and the measured values, thus creating a proximity index for each state discovered during the clustering stage.

arXiv.org

Echo: Efficient Co-Scheduling of Hybrid Online-Offline Tasks for Large Language Model Serving arxiv.org/abs/2504.03651 .DC .AI .LG

Echo: Efficient Co-Scheduling of Hybrid Online-Offline Tasks for Large Language Model Serving

Large language models have been widely deployed in various applications, encompassing both interactive online tasks and batched offline tasks. Given the burstiness and latency sensitivity of online tasks, over-provisioning resources is common practice. This allows for the integration of latency-insensitive offline tasks during periods of low online load, enhancing resource utilization. However, strategically serving online and offline tasks through a preemption mechanism fails to fully leverage the flexibility of offline tasks and suffers from KV cache recomputation and irregular workloads. In this paper, we introduce Echo, a collaborative online-offline task serving system, including a scheduler, a KV cache manager, and estimation toolkits. The scheduler and KV cache manager work tightly to maximize the throughput of offline tasks, while the estimator further predicts execution time to ensure online task SLOs. The scheduler leverages the batch information of last iteration to reduce the search space for finding the optimal schedule. The KV cache manager sets the priority of the KV cache based on the type of tasks and the opportunity of prefix sharing to reduce the recomputation. Finally, the estimation toolkits predict the execution time, future memory consumption, and the throughput of offline tasks to guide the scheduler, KV cache manager, and the system deployer. Evaluation based on real-world workloads demonstrates that Echo can increase offline task throughput by up to $3.3\times$, while satisfying online task SLOs.

arXiv.org

PointSplit: Towards On-device 3D Object Detection with Heterogeneous Low-power Accelerators arxiv.org/abs/2504.03654 .DC .AI .CV

PointSplit: Towards On-device 3D Object Detection with Heterogeneous Low-power Accelerators

Running deep learning models on resource-constrained edge devices has drawn significant attention due to its fast response, privacy preservation, and robust operation regardless of Internet connectivity. While these devices already cope with various intelligent tasks, the latest edge devices that are equipped with multiple types of low-power accelerators (i.e., both mobile GPU and NPU) can bring another opportunity; a task that used to be too heavy for an edge device in the single-accelerator world might become viable in the upcoming heterogeneous-accelerator world.To realize the potential in the context of 3D object detection, we identify several technical challenges and propose PointSplit, a novel 3D object detection framework for multi-accelerator edge devices that addresses the problems. Specifically, our PointSplit design includes (1) 2D semantics-aware biased point sampling, (2) parallelized 3D feature extraction, and (3) role-based group-wise quantization. We implement PointSplit on TensorFlow Lite and evaluate it on a customized hardware platform comprising both mobile GPU and EdgeTPU. Experimental results on representative RGB-D datasets, SUN RGB-D and Scannet V2, demonstrate that PointSplit on a multi-accelerator device is 24.7 times faster with similar accuracy compared to the full-precision, 2D-3D fusion-based 3D detector on a GPU-only device.

arXiv.org

Memory and Bandwidth are All You Need for Fully Sharded Data Parallel arxiv.org/abs/2504.03655 .DC .LG

Memory and Bandwidth are All You Need for Fully Sharded Data Parallel

Transformer models have revolutionized a wide spectrum of disciplines, especially in language processing. The recent success has proven that model size scalability is crucial for achieving superior performance metrics. However, training large transformer models is challenging even on modern hardware with powerful GPUs and high-speed interconnects. Existing studies primarily focus on optimizing model training distribution strategies to minimize memory footprint and enhance training speed, often overlooking the scalability challenges related to model size and hardware constraints. To address this oversight, we thoroughly investigate computational, memory, and network demands of training large transformers using the Fully Sharded Data Parallel (FSDP) distributed strategy across different hardware clusters. We explore the intricate relationships between model size and hardware setups to identify configurations that ensure maximum model and hardware efficiency, effective sequence length management, and optimal training throughput. A significant finding of our study is the critical interplay of the cluster's connection bandwidth and GPU memory size compared to the computational performance of GPUs. This interplay limits training efficiency, underscoring the role of both hardware characteristics as a possible bottleneck. By integrating theoretical analysis with simulations and empirical tests, we demonstrate how hardware limitations affect training efficacy, identifying key hardware thresholds and the impact of network connectivity. Our findings prompt a reassessment of training strategies guiding users on the way to finding hardware-optimal FSDP configurations, enhancing training efficiency for large-scale transformer models.

arXiv.org

Comparative Analysis of Lightweight Kubernetes Distributions for Edge Computing: Performance and Resource Efficiency arxiv.org/abs/2504.03656 .DC

Comparative Analysis of Lightweight Kubernetes Distributions for Edge Computing: Performance and Resource Efficiency

Edge computing environments increasingly rely on lightweight container orchestration platforms to manage resource-constrained devices. This paper provides an empirical analysis of five lightweight kubernetes distributions (KD)(k0s, k3s, KubeEdge, OpenYurt, and Kubernetes (k8s)) focusing on their performance and resource efficiency in edge computing scenarios. We evaluated key metrics such as CPU, memory, disk usage, throughput, and latency under varying workloads, utilizing a testbed of Intel NUCs and Raspberry Pi devices. Our results demonstrate significant differences in performance: k3s exhibited the lowest resource consumption, while k0s and k8s excelled in data plane throughput and latency. Under heavy stress scenarios, k3s and k0s accomplished the same workloads faster than the other distributions. OpenYurt offered balanced performance, suitable for hybrid cloud-edge use cases, but was less efficient in terms of resource usage and scalability compared to k0s, k3s and k8s. KubeEdge, although feature-rich for edge environments, exhibited higher resource consumption and lower scalability. These findings offer valuable insights for developers and operators selecting appropriate KD based on specific performance and resource efficiency requirements for edge computing environments.

arXiv.org

Curvature-Constrained Vector Field for Motion Planning of Nonholonomic Robots arxiv.org/abs/2504.02852 .SY .RO .SY

Curvature-Constrained Vector Field for Motion Planning of Nonholonomic Robots

Vector fields are advantageous in handling nonholonomic motion planning as they provide reference orientation for robots. However, additionally incorporating curvature constraints becomes challenging, due to the interconnection between the design of the curvature-bounded vector field and the tracking controller under underactuation. In this paper, we present a novel framework to co-develop the vector field and the control laws, guiding the nonholonomic robot to the target configuration with curvature-bounded trajectory. First, we formulate the problem by introducing the target positive limit set, which allows the robot to converge to or pass through the target configuration, depending on different dynamics and tasks. Next, we construct a curvature-constrained vector field (CVF) via blending and distributing basic flow fields in workspace and propose the saturated control laws with a dynamic gain, under which the tracking error's magnitude decreases even when saturation occurs. Under the control laws, kinematically constrained nonholonomic robots are guaranteed to track the reference CVF and converge to the target positive limit set with bounded trajectory curvature. Numerical simulations show that the proposed CVF method outperforms other vector-field-based algorithms. Experiments on Ackermann UGVs and semi-physical fixed-wing UAVs demonstrate that the method can be effectively implemented in real-world scenarios.

arXiv.org

Mapping Technological Futures: Anticipatory Discourse Through Text Mining arxiv.org/abs/2504.02853 .SI .CL .CY

Mapping Technological Futures: Anticipatory Discourse Through Text Mining

The volatility and unpredictability of emerging technologies, such as artificial intelligence (AI), generate significant uncertainty, which is widely discussed on social media. This study examines anticipatory discourse surrounding technological futures by analysing 1.5 million posts from 400 key opinion leaders (KOLs) published on the X platform (from 2021 to 2023). Using advanced text mining techniques, including BERTopic modelling, sentiment, emotion, and attitude analyses, the research identifies 100 distinct topics reflecting anticipated tech-driven futures. Our findings emphasize the dual role of KOLs in framing \textit{present futures} -- optimistic visions of transformative technologies like AI and IoT -- and influencing \textit{future presents}, where these projections shape contemporary societal and geopolitical debates. Positive emotions such as Hope dominate, outweighing Anxiety, particularly in topics like ``Machine Learning, Data Science, and Deep Learning,'' while discussions around ``Climate Change'' and ``War, Ukraine, and Trump People'' elicit \textit{Anxiety}. By framing technologies as solutions to societal challenges, KOLs act as mediators of societal narratives, bridging imagined futures and current realities. These insights underscore their pivotal role in directing public attention with emerging technologies during periods of heightened uncertainty, advancing our understanding of anticipatory discourse in technology-mediated contexts.

arXiv.org

The epistemic dimension of algorithmic fairness: assessing its impact in innovation diffusion and fair policy making arxiv.org/abs/2504.02856 .CY .SI

The epistemic dimension of algorithmic fairness: assessing its impact in innovation diffusion and fair policy making

Algorithmic fairness is an expanding field that addresses a range of discrimination issues associated with algorithmic processes. However, most works in the literature focus on analyzing it only from an ethical perspective, focusing on moral principles and values that should be considered in the design and evaluation of algorithms, while disregarding the epistemic dimension related to knowledge transmission and validation. However, this aspect of algorithmic fairness should also be included in the debate, as it is crucial to introduce a specific type of harm: an individual may be systematically excluded from the dissemination of knowledge due to the attribution of a credibility deficit/excess. In this work, we specifically focus on characterizing and analyzing the impact of this credibility deficit or excess on the diffusion of innovations on a societal scale, a phenomenon driven by individual attitudes and social interactions, and also by the strength of mutual connections. Indeed, discrimination might shape the latter, ultimately modifying how innovations spread within the network. In this light, to incorporate, also from a formal point of view, the epistemic dimension in innovation diffusion models becomes paramount, especially if these models are intended to support fair policy design. For these reasons, we formalize the epistemic properties of a social environment, by extending the well-established Linear Threshold Model (LTM) in an epistemic direction to show the impact of epistemic biases in innovation diffusion. Focusing on the impact of epistemic bias in both open-loop and closed-loop scenarios featuring optimal fostering policies, our results shed light on the pivotal role the epistemic dimension might have in the debate of algorithmic fairness in decision-making.

arXiv.org

Optimizing Humor Generation in Large Language Models: Temperature Configurations and Architectural Trade-offs arxiv.org/abs/2504.02858 .CL .LG

Optimizing Humor Generation in Large Language Models: Temperature Configurations and Architectural Trade-offs

Large language models (LLMs) demonstrate increasing capabilities in creative text generation, yet systematic evaluations of their humor production remain underexplored. This study presents a comprehensive analysis of 13 state-of-the-art LLMs across five architectural families, evaluating their performance in generating technically relevant humor for software developers. Through a full factorial design testing 715 unique configurations of temperature settings and prompt variations, we assess model outputs using five weighted criteria: humor quality, domain relevance, concept originality, tone precision, and delivery efficiency. Our methodology employs rigorous statistical analysis including ANOVA, correlation studies, and quadratic regression to identify optimal configurations and architectural influences. Results reveal significant performance variations across models, with certain architectures achieving 21.8% superiority over baseline systems. Temperature sensitivity analysis demonstrates that 73% of models achieve peak performance at lower stochasticity settings (<= 0.5), though optimal ranges vary substantially by architecture. We identify distinct model clusters: compact high-performers maintaining efficiency-quality balance versus verbose specialists requiring longer outputs for marginal gains. Statistical validation confirms model architecture explains 38.7% of performance variance, with significant correlations between humor quality and concept originality. The study establishes practical guidelines for model selection and configuration, demonstrating how temperature adjustments and architectural considerations impact humor generation effectiveness. These findings advance understanding of LLM capabilities in creative technical writing and provide empirically validated configuration strategies for developers implementing humor-generation systems.

arXiv.org

AI-Enhanced Resilience in Power Systems: Adversarial Deep Learning for Robust Short-Term Voltage Stability Assessment under Cyber-Attacks arxiv.org/abs/2504.02859 .SY .SY

AI-Enhanced Resilience in Power Systems: Adversarial Deep Learning for Robust Short-Term Voltage Stability Assessment under Cyber-Attacks

In the era of Industry 4.0, ensuring the resilience of cyber-physical systems against sophisticated cyber threats is increasingly critical. This study proposes a pioneering AI-based control framework that enhances short-term voltage stability assessments (STVSA) in power systems under complex composite cyber-attacks. First, by incorporating white-box and black-box adversarial attacks with Denial-of-Service (DoS) perturbations during training, composite adversarial attacks are implemented. Second, the application of Spectral Normalized Conditional Wasserstein Generative Adversarial Network with Gradient Penalty (SNCWGAN-GP) and Fast Gradient Sign Method (FGSM) strengthens the model's resistance to adversarial disturbances, improving data quality and training stability. Third, an assessment model based on Long Short-Term Memory (LSTM)-enhanced Graph Attention Network (L-GAT) is developed to capture dynamic relationships between the post-fault dynamic trajectories and electrical grid topology. Experimental results on the IEEE 39-bus test system demonstrate the efficacy and superiority of the proposed method in composite cyber-attack scenarios. This contribution is pivotal to advancing AI-based resilient control strategies for nonlinear dynamical systems, marking a substantial enhancement in the security of cyber-physical systems.

arXiv.org
Show older
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.