Synthetic Data Generation of Body Motion Data by Neural Gas Network for Emotion Recognition arxiv.org/abs/2503.14513 .IV .CV .AI

Synthetic Data Generation of Body Motion Data by Neural Gas Network for Emotion Recognition

In the domain of emotion recognition using body motion, the primary challenge lies in the scarcity of diverse and generalizable datasets. Automatic emotion recognition uses machine learning and artificial intelligence techniques to recognize a person's emotional state from various data types, such as text, images, sound, and body motion. Body motion poses unique challenges as many factors, such as age, gender, ethnicity, personality, and illness, affect its appearance, leading to a lack of diverse and robust datasets specifically for emotion recognition. To address this, employing Synthetic Data Generation (SDG) methods, such as Generative Adversarial Networks (GANs) and Variational Auto Encoders (VAEs), offers potential solutions, though these methods are often complex. This research introduces a novel application of the Neural Gas Network (NGN) algorithm for synthesizing body motion data and optimizing diversity and generation speed. By learning skeletal structure topology, the NGN fits the neurons or gas particles on body joints. Generated gas particles, which form the skeletal structure later on, will be used to synthesize the new body posture. By attaching body postures over frames, the final synthetic body motion appears. We compared our generated dataset against others generated by GANs, VAEs, and another benchmark algorithm, using benchmark metrics such as Fréchet Inception Distance (FID), Diversity, and a few more. Furthermore, we continued evaluation using classification metrics such as accuracy, precision, recall, and a few others. Joint-related features or kinematic parameters were extracted, and the system assessed model performance against unseen data. Our findings demonstrate that the NGN algorithm produces more realistic and emotionally distinct body motion data and does so with more synthesizing speed than existing methods.

arXiv.org

AI Work Quantization Model: Closed-System AI Computational Effort Metric arxiv.org/abs/2503.14515 .PF .CE

AI Work Quantization Model: Closed-System AI Computational Effort Metric

The rapid adoption of AI-driven automation in IoT environments, particularly in smart cities and industrial systems, necessitates a standardized approach to quantify AIs computational workload. Existing methodologies lack a consistent framework for measuring AI computational effort across diverse architectures, posing challenges in fair taxation models and energy-aware workload assessments. This study introduces the Closed-System AI Computational Effort Metric, a theoretical framework that quantifies real-time computational effort by incorporating input/output complexity, execution dynamics, and hardware-specific performance factors. The model ensures comparability between AI workloads across traditional CPUs and modern GPU/TPU accelerators, facilitating standardized performance evaluations. Additionally, we propose an energy-aware extension to assess AIs environmental impact, enabling sustainability-focused AI optimizations and equitable taxation models. Our findings establish a direct correlation between AI workload and human productivity, where 5 AI Workload Units equate to approximately 60 to 72 hours of human labor, exceeding a full-time workweek. By systematically linking AI computational effort to human labor, this framework enhances the understanding of AIs role in workforce automation, industrial efficiency, and sustainable computing. Future work will focus on refining the model through dynamic workload adaptation, complexity normalization, and energy-aware AI cost estimation, further broadening its applicability in diverse AI-driven ecosystems.

arXiv.org

Cafe-Talk: Generating 3D Talking Face Animation with Multimodal Coarse- and Fine-grained Control arxiv.org/abs/2503.14517 .CV .AI

Cafe-Talk: Generating 3D Talking Face Animation with Multimodal Coarse- and Fine-grained Control

Speech-driven 3D talking face method should offer both accurate lip synchronization and controllable expressions. Previous methods solely adopt discrete emotion labels to globally control expressions throughout sequences while limiting flexible fine-grained facial control within the spatiotemporal domain. We propose a diffusion-transformer-based 3D talking face generation model, Cafe-Talk, which simultaneously incorporates coarse- and fine-grained multimodal control conditions. Nevertheless, the entanglement of multiple conditions challenges achieving satisfying performance. To disentangle speech audio and fine-grained conditions, we employ a two-stage training pipeline. Specifically, Cafe-Talk is initially trained using only speech audio and coarse-grained conditions. Then, a proposed fine-grained control adapter gradually adds fine-grained instructions represented by action units (AUs), preventing unfavorable speech-lip synchronization. To disentangle coarse- and fine-grained conditions, we design a swap-label training mechanism, which enables the dominance of the fine-grained conditions. We also devise a mask-based CFG technique to regulate the occurrence and intensity of fine-grained control. In addition, a text-based detector is introduced with text-AU alignment to enable natural language user input and further support multimodal control. Extensive experimental results prove that Cafe-Talk achieves state-of-the-art lip synchronization and expressiveness performance and receives wide acceptance in fine-grained control in user studies. Project page: https://harryxd2018.github.io/cafe-talk/

arXiv.org

ReBot: Scaling Robot Learning with Real-to-Sim-to-Real Robotic Video Synthesis arxiv.org/abs/2503.14526 .CV .GR .RO

ReBot: Scaling Robot Learning with Real-to-Sim-to-Real Robotic Video Synthesis

Vision-language-action (VLA) models present a promising paradigm by training policies directly on real robot datasets like Open X-Embodiment. However, the high cost of real-world data collection hinders further data scaling, thereby restricting the generalizability of VLAs. In this paper, we introduce ReBot, a novel real-to-sim-to-real approach for scaling real robot datasets and adapting VLA models to target domains, which is the last-mile deployment challenge in robot manipulation. Specifically, ReBot replays real-world robot trajectories in simulation to diversify manipulated objects (real-to-sim), and integrates the simulated movements with inpainted real-world background to synthesize physically realistic and temporally consistent robot videos (sim-to-real). Our approach has several advantages: 1) it enjoys the benefit of real data to minimize the sim-to-real gap; 2) it leverages the scalability of simulation; and 3) it can generalize a pretrained VLA to a target domain with fully automated data pipelines. Extensive experiments in both simulation and real-world environments show that ReBot significantly enhances the performance and robustness of VLAs. For example, in SimplerEnv with the WidowX robot, ReBot improved the in-domain performance of Octo by 7.2% and OpenVLA by 21.8%, and out-of-domain generalization by 19.9% and 9.4%, respectively. For real-world evaluation with a Franka robot, ReBot increased the success rates of Octo by 17% and OpenVLA by 20%. More information can be found at: https://yuffish.github.io/rebot/

arXiv.org

SAUCE: Selective Concept Unlearning in Vision-Language Models with Sparse Autoencoders arxiv.org/abs/2503.14530 .CV .AI

SAUCE: Selective Concept Unlearning in Vision-Language Models with Sparse Autoencoders

Unlearning methods for vision-language models (VLMs) have primarily adapted techniques from large language models (LLMs), relying on weight updates that demand extensive annotated forget sets. Moreover, these methods perform unlearning at a coarse granularity, often leading to excessive forgetting and reduced model utility. To address this issue, we introduce SAUCE, a novel method that leverages sparse autoencoders (SAEs) for fine-grained and selective concept unlearning in VLMs. Briefly, SAUCE first trains SAEs to capture high-dimensional, semantically rich sparse features. It then identifies the features most relevant to the target concept for unlearning. During inference, it selectively modifies these features to suppress specific concepts while preserving unrelated information. We evaluate SAUCE on two distinct VLMs, LLaVA-v1.5-7B and LLaMA-3.2-11B-Vision-Instruct, across two types of tasks: concrete concept unlearning (objects and sports scenes) and abstract concept unlearning (emotions, colors, and materials), encompassing a total of 60 concepts. Extensive experiments demonstrate that SAUCE outperforms state-of-the-art methods by 18.04% in unlearning quality while maintaining comparable model utility. Furthermore, we investigate SAUCE's robustness against widely used adversarial attacks, its transferability across models, and its scalability in handling multiple simultaneous unlearning requests. Our findings establish SAUCE as an effective and scalable solution for selective concept unlearning in VLMs.

arXiv.org

A Comprehensive Study of IPTV: Challenges, Opportunities, and Future Trends arxiv.org/abs/2503.13450 .IV .SP .NI

A Comprehensive Study of IPTV: Challenges, Opportunities, and Future Trends

IPTV (Internet Protocol Television) is a transformative approach to delivering audio and video services through high-speed Internet networks, enabling direct access to television content via home computers or set-top boxes. Despite its promising advantages, including flexibility, interactivity, and bundled services such as triple play (voice, Internet, and TV) and quadruple play (adding mobile services), IPTV is still in its development phase. Key challenges include achieving a Quality of Service (QoS) comparable to traditional broadcasters, addressing limited bandwidth, and overcoming a lack of standardization among service providers. This paper explores the technical, operational, and consumer-oriented aspects of IPTV. It discusses data compression techniques, protocols like IGMP and RTSP, and the role of advanced codecs like H.264 in ensuring efficient data transmission. The study also examines the distinctions between IPTV and open-network Internet TV, the importance of security and privacy, and the emergence of new business opportunities through targeted advertising and interactive services. Although IPTV is unlikely to completely replace traditional broadcasting, it is poised to play an important role in shaping the future of television by offering personalized, secure, and scalable viewing experiences.

arXiv.org

Digital audiovisual archives in humanities arxiv.org/abs/2503.13452 .DL

Digital audiovisual archives in humanities

This report, authored in 2003, presents an innovative approach to the management and utilization of audiovisual archives in the humanities and social sciences. Developed by the research team ESCoM, under the auspices of the Maison des Sciences de l'Homme (MSH) in Paris, this program predated platforms like YouTube and was groundbreaking in its vision for the digital preservation, segmentation, and classification of audiovisual content. Its objectives included creating a heritage of scientific knowledge, developing advanced tools for its annotation and reuse, and facilitating the dissemination of specialized research to a broad audience.At its core, the report outlines the development of an integrated environment that allows users to index, annotate, and classify audiovisual segments through personalized ontologies and thematic grids. The proposed methods rely on cutting-edge concepts, such as semantic web technologies, knowledge representation, and conceptual graph editing, to enable researchers and educators to create tailored archives and new multimedia resources. This forward-thinking approach aligns with modern practices of content reuse and republication, demonstrating a vision well ahead of its time.The program also emphasizes the importance of segmenting and indexing audiovisual materials based on user-defined criteria, enabling researchers to identify and highlight specific thematic or conceptual elements within a vast pool of data. By facilitating this level of granularity, the system supports personalized academic and professional applications, including multimedia presentations, educational resources, and research dissemination. It introduces tools such as enhanced media players, ontology builders, and annotation editors to make this process accessible and collaborative.Finally, the report discusses the Opales project, a collaborative initiative that exemplifies this innovative framework. The project developed a prototype environment integrating tools for creating ''hyper-documents'' and supporting multilingual, multi-platform content dissemination. Despite the technological and methodological challenges of the time, the report's vision of interactive, richly annotated audiovisual archives has set the stage for the development of contemporary digital knowledge ecosystems. Its emphasis on semantic representation and user-centric customization continues to resonate in the digital humanities today.

arXiv.org
E-Semiotics

E-Semiotics is a conceptual and practical framework for designing, developing, and managing digital information and knowledge products. It applies semiotic principles to digital environments, focusing on the structural, contextual, and narrative organization of information. Central to E-Semiotics is the concept of ''scenario building,'' which acts as a template or guide for creating and maintaining digital products and services, ensuring usability, adaptability, and efficiency.This approach distinguishes itself from traditional semiotics by addressing the unique features of digital media, such as interactivity, hypertextuality, and modularity. It requires a dual competency in semiotics and technology, making it particularly relevant for developing interactive digital products like e-learning systems, digital libraries, and web portals. E-Semiotics also integrates seamlessly with knowledge management, offering conceptual models and technological tools to optimize the storage, retrieval, and dissemination of information.The methodology includes both a semiotic approach, which focuses on understanding the structural and contextual dimensions of information, and a technological approach, which ensures interoperability, reusability, and scalability of digital tools. It has broad applications in areas such as multi-support publishing, semantic web development, and the creation of dynamic websites and web services. These applications empower organizations, particularly small and medium-sized ones, to leverage digital technologies without extensive technical expertise.E-Semiotics faces challenges like conceptual complexity and economic barriers, but its potential lies in democratizing access to digital tools and fostering innovation. It bridges the gap between theory and practice, offering scalable solutions that respond to evolving user needs. This framework is poised to play a critical role in the digital transformation of communication and knowledge systems, supporting organizations in adapting to the demands of a rapidly changing digital landscape.

arXiv.org

Modeling and Analysis of Non-Terrestrial Networks by Spherical Stochastic Geometry arxiv.org/abs/2503.13455 .NI

Modeling and Analysis of Non-Terrestrial Networks by Spherical Stochastic Geometry

Non-terrestrial networks (NTNs) are anticipated to be indispensable in extending coverage and enabling global communication access in next-generation wireless networks. With the extensive deployment of non-terrestrial platforms, evaluating the performance of NTN-enabled communication systems becomes a challenging task. Spherical stochastic geometry (SG) is a recently proposed analytical framework that has garnered increasing attention. Due to its suitability for modeling large-scale dynamic topologies and its ability to provide an analytical framework for interference analysis and low-complexity performance evaluation, spherical SG has been widely applied in NTN performance analysis. This paper surveys the modeling and analysis of NTN networks based on spherical SG. We begin by introducing the spherical SG framework, detailing its history and development. Next, we categorize existing spherical SG models into three types based on orbital modeling methods and provide algorithm implementations for common models. Furthermore, we investigate the accuracy and necessity of spherical modeling through case studies. On the topology level, concepts such as association strategy, central angle, zenith angle, contact angle, and availability probability are introduced, with simple derivations provided. On the channel level, we detail the modeling of large-scale fading, small-scale fading, and beam gain for different channel links. Finally, we discuss several advanced topics that have not been fully explored but have strong motivation and research potential, and we predict future research directions.

arXiv.org
Show older
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.