Show newer

A Study on the Matching Rate of Dance Movements Using 2D Skeleton Detection and 3D Pose Estimation: Why Is SEVENTEEN's Performance So Bita-Zoroi (Perfectly Synchronized)? arxiv.org/abs/2503.19917 .CV

A Study on the Matching Rate of Dance Movements Using 2D Skeleton Detection and 3D Pose Estimation: Why Is SEVENTEEN's Performance So Bita-Zoroi (Perfectly Synchronized)?

SEVENTEEN is a K-pop group with a large number of members 13 in total and the significant physical disparity between the tallest and shortest members among K-pop groups. However, despite their large numbers and physical differences, their dance performances exhibit unparalleled unity in the K-pop industry. According to one theory, their dance synchronization rate is said to be 90% or even 97%. However, there is little concrete data to substantiate this synchronization rate. In this study, we analyzed SEVENTEEN's dance performances using videos available on YouTube. We applied 2D skeleton detection and 3D pose estimation to evaluate joint angles, body part movements, and jumping and crouching motions to investigate the factors contributing to their performance unity. The analysis revealed exceptionally high consistency in the movement direction of body parts, as well as in the ankle and head positions during jumping movements and the head position during crouching movements. These findings suggested that SEVENTEEN's high synchronization rate can be attributed to the consistency of movement direction and the synchronization of ankle and head heights during jumping and crouching movements.

arXiv.org

Unifying Structural Proximity and Equivalence for Enhanced Dynamic Network Embedding arxiv.org/abs/2503.19926 .SI .LG

Unifying Structural Proximity and Equivalence for Enhanced Dynamic Network Embedding

Dynamic network embedding methods transform nodes in a dynamic network into low-dimensional vectors while preserving network characteristics, facilitating tasks such as node classification and community detection. Several embedding methods have been proposed to capture structural proximity among nodes in a network, where densely connected communities are preserved, while others have been proposed to preserve structural equivalence among nodes, capturing their structural roles regardless of their relative distance in the network. However, most existing methods that aim to preserve both network characteristics mainly focus on static networks and those designed for dynamic networks do not explicitly account for inter-snapshot structural properties. This paper proposes a novel unifying dynamic network embedding method that simultaneously preserves both structural proximity and equivalence while considering inter-snapshot structural relationships in a dynamic network. Specifically, to define structural equivalence in a dynamic network, we use temporal subgraphs, known as dynamic graphlets, to capture how a node's neighborhood structure evolves over time. We then introduce a temporal-structural random walk to flexibly sample time-respecting sequences of nodes, considering both their temporal proximity and similarity in evolving structures. The proposed method is evaluated using five real-world networks on node classification where it outperforms benchmark methods, showing its effectiveness and flexibility in capturing various aspects of a network.

arXiv.org

Unlocking Health Insights with SDoH Data: A Comprehensive Open-Access Database and SDoH-EHR Linkage Tool arxiv.org/abs/2503.19928 .SI

Unlocking Health Insights with SDoH Data: A Comprehensive Open-Access Database and SDoH-EHR Linkage Tool

Background: Social determinants of health (SDoH) play a crucial role in influencing health outcomes, accounting for nearly 50% of modifiable health factors and bringing to light critical disparities among disadvantaged groups. Despite the significant impact of SDoH, existing data resources often fall short in terms of comprehensiveness, integration, and usability. Methods: To address these gaps, we developed an extensive Exposome database and a corresponding web application, aimed at enhancing data usability and integration with electronic health record (EHR) to foster personalized and informed healthcare. We created a robust database consisting of a wide array of SDoH indicators and an automated linkage tool designed to facilitate effortless integration with EHR. We emphasized a user-friendly interface to cater to researchers, clinicians, and public health professionals. Results: The resultant Exposome database and web application offer an extensive data catalog with enhanced usability features. The automated linkage tool has demonstrated efficiency in integrating SDoH data with EHRs, significantly improving data accessibility. Initial deployment has confirmed scalability and robust spatial data relationships, facilitating precise and contextually relevant healthcare insights. Conclusion: The development of an advanced Exposome database and linkage tool marks a significant step toward enhancing the accessibility and usability of SDoH data. By centralizing and integrating comprehensive SDoH indicators with EHRs, this tool empowers a wide range of users to access high-quality, standardized data. This resource will have a lasting impact on personalized healthcare and equitable health landscape.

arXiv.org

Robust Object Detection of Underwater Robot based on Domain Generalization arxiv.org/abs/2503.19929 .CV .LG

Robust Object Detection of Underwater Robot based on Domain Generalization

Object detection aims to obtain the location and the category of specific objects in a given image, which includes two tasks: classification and location. In recent years, researchers tend to apply object detection to underwater robots equipped with vision systems to complete tasks including seafood fishing, fish farming, biodiversity monitoring and so on. However, the diversity and complexity of underwater environments bring new challenges to object detection. First, aquatic organisms tend to live together, which leads to severe occlusion. Second, theaquatic organisms are good at hiding themselves, which have a similar color to the background. Third, the various water quality and changeable and extreme lighting conditions lead to the distorted, low contrast, blue or green images obtained by the underwater camera, resulting in domain shift. And the deep model is generally vulnerable to facing domain shift. Fourth, the movement of the underwater robot leads to the blur of the captured image and makes the water muddy, which results in low visibility of the water. This paper investigates the problems brought by the underwater environment mentioned above, and aims to design a high-performance and robust underwater object detector.

arXiv.org

Reverse Prompt: Cracking the Recipe Inside Text-to-Image Generation arxiv.org/abs/2503.19937 .CV .AI

Reverse Prompt: Cracking the Recipe Inside Text-to-Image Generation

Text-to-image generation has become increasingly popular, but achieving the desired images often requires extensive prompt engineering. In this paper, we explore how to decode textual prompts from reference images, a process we refer to as image reverse prompt engineering. This technique enables us to gain insights from reference images, understand the creative processes of great artists, and generate impressive new images. To address this challenge, we propose a method known as automatic reverse prompt optimization (ARPO). Specifically, our method refines an initial prompt into a high-quality prompt through an iteratively imitative gradient prompt optimization process: 1) generating a recreated image from the current prompt to instantiate its guidance capability; 2) producing textual gradients, which are candidate prompts intended to reduce the difference between the recreated image and the reference image; 3) updating the current prompt with textual gradients using a greedy search method to maximize the CLIP similarity between prompt and reference image. We compare ARPO with several baseline methods, including handcrafted techniques, gradient-based prompt tuning methods, image captioning, and data-driven selection method. Both quantitative and qualitative results demonstrate that our ARPO converges quickly to generate high-quality reverse prompts. More importantly, we can easily create novel images with diverse styles and content by directly editing these reverse prompts. Code will be made publicly available.

arXiv.org

Continual Learning With Quasi-Newton Methods arxiv.org/abs/2503.19939 .IV .LG

Continual Learning With Quasi-Newton Methods

Catastrophic forgetting remains a major challenge when neural networks learn tasks sequentially. Elastic Weight Consolidation (EWC) attempts to address this problem by introducing a Bayesian-inspired regularization loss to preserve knowledge of previously learned tasks. However, EWC relies on a Laplace approximation where the Hessian is simplified to the diagonal of the Fisher information matrix, assuming uncorrelated model parameters. This overly simplistic assumption often leads to poor Hessian estimates, limiting its effectiveness. To overcome this limitation, we introduce Continual Learning with Sampled Quasi-Newton (CSQN), which leverages Quasi-Newton methods to compute more accurate Hessian approximations. CSQN captures parameter interactions beyond the diagonal without requiring architecture-specific modifications, making it applicable across diverse tasks and architectures. Experimental results across four benchmarks demonstrate that CSQN consistently outperforms EWC and other state-of-the-art baselines, including rehearsal-based methods. CSQN reduces EWC's forgetting by 50 percent and improves its performance by 8 percent on average. Notably, CSQN achieves superior results on three out of four benchmarks, including the most challenging scenarios, highlighting its potential as a robust solution for continual learning.

arXiv.org

A Real-Time Human Action Recognition Model for Assisted Living arxiv.org/abs/2503.18957 .CV

A Real-Time Human Action Recognition Model for Assisted Living

Ensuring the safety and well-being of elderly and vulnerable populations in assisted living environments is a critical concern. Computer vision presents an innovative and powerful approach to predicting health risks through video monitoring, employing human action recognition (HAR) technology. However, real-time prediction of human actions with high performance and efficiency is a challenge. This research proposes a real-time human action recognition model that combines a deep learning model and a live video prediction and alert system, in order to predict falls, staggering and chest pain for residents in assisted living. Six thousand RGB video samples from the NTU RGB+D 60 dataset were selected to create a dataset with four classes: Falling, Staggering, Chest Pain, and Normal, with the Normal class comprising 40 daily activities. Transfer learning technique was applied to train four state-of-the-art HAR models on a GPU server, namely, UniFormerV2, TimeSformer, I3D, and SlowFast. Results of the four models are presented in this paper based on class-wise and macro performance metrics, inference efficiency, model complexity and computational costs. TimeSformer is proposed for developing the real-time human action recognition model, leveraging its leading macro F1 score (95.33%), recall (95.49%), and precision (95.19%) along with significantly higher inference throughput compared to the others. This research provides insights to enhance safety and health of the elderly and people with chronic illnesses in assisted living environments, fostering sustainable care, smarter communities and industry innovation.

arXiv.org

Advancing Deep Learning through Probability Engineering: A Pragmatic Paradigm for Modern AI arxiv.org/abs/2503.18958 .PR .ML .AI

Advancing Deep Learning through Probability Engineering: A Pragmatic Paradigm for Modern AI

Recent years have witnessed the rapid progression of deep learning, pushing us closer to the realization of AGI (Artificial General Intelligence). Probabilistic modeling is critical to many of these advancements, which provides a foundational framework for capturing data distributions. However, as the scale and complexity of AI applications grow, traditional probabilistic modeling faces escalating challenges, such as high-dimensional parameter spaces, heterogeneous data sources, and evolving real-world requirements often render classical approaches insufficiently flexible. This paper proposes a novel concept, Probability Engineering, which treats the already-learned probability distributions within deep learning as engineering artifacts. Rather than merely fitting or inferring distributions, we actively modify and reinforce them to better address the diverse and evolving demands of modern AI. Specifically, Probability Engineering introduces novel techniques and constraints to refine existing probability distributions, improving their robustness, efficiency, adaptability, or trustworthiness. We showcase this paradigm through a series of applications spanning Bayesian deep learning, Edge AI (including federated learning and knowledge distillation), and Generative AI (such as text-to-image generation with diffusion models and high-quality text generation with large language models). These case studies demonstrate how probability distributions once treated as static objects can be engineered to meet the diverse and evolving requirements of large-scale, data-intensive, and trustworthy AI systems. By systematically expanding and strengthening the role of probabilistic modeling, Probability Engineering paves the way for more robust, adaptive, efficient, and trustworthy deep learning solutions in today's fast-growing AI era.

arXiv.org

Unifying EEG and Speech for Emotion Recognition: A Two-Step Joint Learning Framework for Handling Missing EEG Data During Inference arxiv.org/abs/2503.18964 .SD .AI

Unifying EEG and Speech for Emotion Recognition: A Two-Step Joint Learning Framework for Handling Missing EEG Data During Inference

Computer interfaces are advancing towards using multi-modalities to enable better human-computer interactions. The use of automatic emotion recognition (AER) can make the interactions natural and meaningful thereby enhancing the user experience. Though speech is the most direct and intuitive modality for AER, it is not reliable because it can be intentionally faked by humans. On the other hand, physiological modalities like EEG, are more reliable and impossible to fake. However, use of EEG is infeasible for realistic scenarios usage because of the need for specialized recording setup. In this paper, one of our primary aims is to ride on the reliability of the EEG modality to facilitate robust AER on the speech modality. Our approach uses both the modalities during training to reliably identify emotion at the time of inference, even in the absence of the more reliable EEG modality. We propose, a two-step joint multi-modal learning approach (JMML) that exploits both the intra- and inter- modal characteristics to construct emotion embeddings that enrich the performance of AER. In the first step, using JEC-SSL, intra-modal learning is done independently on the individual modalities. This is followed by an inter-modal learning using the proposed extended variant of deep canonically correlated cross-modal autoencoder (E-DCC-CAE). The approach learns the joint properties of both the modalities by mapping them into a common representation space, such that the modalities are maximally correlated. These emotion embeddings, hold properties of both the modalities there by enhancing the performance of ML classifier used for AER. Experimental results show the efficacy of the proposed approach. To best of our knowledge, this is the first attempt to combine speech and EEG with joint multi-modal learning approach for reliable AER.

arXiv.org

MedAgent-Pro: Towards Multi-modal Evidence-based Medical Diagnosis via Reasoning Agentic Workflow arxiv.org/abs/2503.18968 .AI

MedAgent-Pro: Towards Multi-modal Evidence-based Medical Diagnosis via Reasoning Agentic Workflow

Developing reliable AI systems to assist human clinicians in multi-modal medical diagnosis has long been a key objective for researchers. Recently, Multi-modal Large Language Models (MLLMs) have gained significant attention and achieved success across various domains. With strong reasoning capabilities and the ability to perform diverse tasks based on user instructions, they hold great potential for enhancing medical diagnosis. However, directly applying MLLMs to the medical domain still presents challenges. They lack detailed perception of visual inputs, limiting their ability to perform quantitative image analysis, which is crucial for medical diagnostics. Additionally, MLLMs often exhibit hallucinations and inconsistencies in reasoning, whereas clinical diagnoses must adhere strictly to established criteria. To address these challenges, we propose MedAgent-Pro, an evidence-based reasoning agentic system designed to achieve reliable, explainable, and precise medical diagnoses. This is accomplished through a hierarchical workflow: at the task level, knowledge-based reasoning generate reliable diagnostic plans for specific diseases following retrieved clinical criteria. While at the case level, multiple tool agents process multi-modal inputs, analyze different indicators according to the plan, and provide a final diagnosis based on both quantitative and qualitative evidence. Comprehensive experiments on both 2D and 3D medical diagnosis tasks demonstrate the superiority and effectiveness of MedAgent-Pro, while case studies further highlight its reliability and interpretability. The code is available at https://github.com/jinlab-imvr/MedAgent-Pro.

arXiv.org
Show older
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.