Predictive Process Monitoring: a comparison survey between different type of event logs arxiv.org/abs/2504.16933 .SE

Multifaceted Evaluation of Audio-Visual Capability for MLLMs: Effectiveness, Efficiency, Generalizability and Robustness arxiv.org/abs/2504.16936 .AS .MM .CV .SD

Multifaceted Evaluation of Audio-Visual Capability for MLLMs: Effectiveness, Efficiency, Generalizability and Robustness

Multi-modal large language models (MLLMs) have recently achieved great success in processing and understanding information from diverse modalities (e.g., text, audio, and visual signals). Despite their growing popularity, there remains a lack of comprehensive evaluation measuring the audio-visual capabilities of these models, especially in diverse scenarios (e.g., distribution shifts and adversarial attacks). In this paper, we present a multifaceted evaluation of the audio-visual capability of MLLMs, focusing on four key dimensions: effectiveness, efficiency, generalizability, and robustness. Through extensive experiments, we find that MLLMs exhibit strong zero-shot and few-shot generalization abilities, enabling them to achieve great performance with limited data. However, their success relies heavily on the vision modality, which impairs performance when visual input is corrupted or missing. Additionally, while MLLMs are susceptible to adversarial samples, they demonstrate greater robustness compared to traditional models. The experimental results and our findings provide insights into the audio-visual capabilities of MLLMs, highlighting areas for improvement and offering guidance for future research.

arXiv.org

A Desideratum for Conversational Agents: Capabilities, Challenges, and Future Directions arxiv.org/abs/2504.16939 .AI .CL

Flexibility of German gas-fired generation: evidence from clustering empirical operation arxiv.org/abs/2504.16943 .CY .LG

Burning some myths on privacy properties of social networks against active attacks arxiv.org/abs/2504.16944 .CO .SI

MobileCity: An Efficient Framework for Large-Scale Urban Behavior Simulation arxiv.org/abs/2504.16946 .SI .AI

Surveillance Disguised as Protection: A Comparative Analysis of Sideloaded and In-Store Parental Control Apps arxiv.org/abs/2504.16087 .CR .CY

Launching Insights: A Pilot Study on Leveraging Real-World Observational Data from the Mayo Clinic Platform to Advance Clinical Research arxiv.org/abs/2504.16090 .CY

Launching Insights: A Pilot Study on Leveraging Real-World Observational Data from the Mayo Clinic Platform to Advance Clinical Research

Backgrounds: Artificial intelligence (AI) is transforming healthcare, yet translating AI models from theoretical frameworks to real-world clinical applications remains challenging. The Mayo Clinic Platform (MCP) was established to address these challenges by providing a scalable ecosystem that integrates real-world multiple modalities data from multiple institutions, advanced analytical tools, and secure computing environments to support clinical research and AI development. Methods: In this study, we conducted four research projects leveraging MCP's data infrastructure and analytical capabilities to demonstrate its potential in facilitating real-world evidence generation and AI-driven clinical insights. Utilizing MCP's tools and environment, we facilitated efficient cohort identification, data extraction, and subsequent statistical or AI-powered analyses. Results: The results underscore MCP's role in accelerating translational research by offering de-identified, standardized real-world data and facilitating AI model validation across diverse healthcare settings. Compared to Mayo's internal Electronic Health Record (EHR) data, MCP provides broader accessibility, enhanced data standardization, and multi-institutional integration, making it a valuable resource for both internal and external researchers. Conclusion: Looking ahead, MCP is well-positioned to transform clinical research through its scalable ecosystem, effectively bridging the divide between AI innovation and clinical deployment. Future investigations will build upon this foundation, further exploring MCP's capacity to advance precision medicine and enhance patient outcomes.

arXiv.org

Audio and Multiscale Visual Cues Driven Cross-modal Transformer for Idling Vehicle Detection arxiv.org/abs/2504.16102 .CV .RO

Audio and Multiscale Visual Cues Driven Cross-modal Transformer for Idling Vehicle Detection

Idling vehicle detection (IVD) supports real-time systems that reduce pollution and emissions by dynamically messaging drivers to curb excess idling behavior. In computer vision, IVD has become an emerging task that leverages video from surveillance cameras and audio from remote microphones to localize and classify vehicles in each frame as moving, idling, or engine-off. As with other cross-modal tasks, the key challenge lies in modeling the correspondence between audio and visual modalities, which differ in representation but provide complementary cues -- video offers spatial and motion context, while audio conveys engine activity beyond the visual field. The previous end-to-end model, which uses a basic attention mechanism, struggles to align these modalities effectively, often missing vehicle detections. To address this issue, we propose AVIVDNetv2, a transformer-based end-to-end detection network. It incorporates a cross-modal transformer with global patch-level learning, a multiscale visual feature fusion module, and decoupled detection heads. Extensive experiments show that AVIVDNetv2 improves mAP by 7.66 over the disjoint baseline and 9.42 over the E2E baseline, with consistent AP gains across all vehicle categories. Furthermore, AVIVDNetv2 outperforms the state-of-the-art method for sounding object localization, establishing a new performance benchmark on the AVIVD dataset.

arXiv.org

CUBETESTERAI: Automated JUnit Test Generation using the LLaMA Model arxiv.org/abs/2504.15286 .SE .AI

CUBETESTERAI: Automated JUnit Test Generation using the LLaMA Model

This paper presents an approach to automating JUnit test generation for Java applications using the Spring Boot framework, leveraging the LLaMA (Large Language Model Architecture) model to enhance the efficiency and accuracy of the testing process. The resulting tool, called CUBETESTERAI, includes a user-friendly web interface and the integration of a CI/CD pipeline using GitLab and Docker. These components streamline the automated test generation process, allowing developers to generate JUnit tests directly from their code snippets with minimal manual intervention. The final implementation executes the LLaMA models through RunPod, an online GPU service, which also enhances the privacy of our tool. Using the advanced natural language processing capabilities of the LLaMA model, CUBETESTERAI is able to generate test cases that provide high code coverage and accurate validation of software functionalities in Java-based Spring Boot applications. Furthermore, it efficiently manages resource-intensive operations and refines the generated tests to address common issues like missing imports and handling of private methods. By comparing CUBETESTERAI with some state-of-the-art tools, we show that our proposal consistently demonstrates competitive and, in many cases, better performance in terms of code coverage in different real-life Java programs.

arXiv.org
Show older
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.