Show newer

Are We There Yet? A Measurement Study of Efficiency for LLM Applications on Mobile Devices arxiv.org/abs/2504.00002 .PF .AI .HC .NI

Are We There Yet? A Measurement Study of Efficiency for LLM Applications on Mobile Devices

Recent advancements in large language models (LLMs) have prompted interest in deploying these models on mobile devices to enable new applications without relying on cloud connectivity. However, the efficiency constraints of deploying LLMs on resource-limited devices present significant challenges. In this paper, we conduct a comprehensive measurement study to evaluate the efficiency tradeoffs between mobile-based, edge-based, and cloud-based deployments for LLM applications. We implement AutoLife-Lite, a simplified LLM-based application that analyzes smartphone sensor data to infer user location and activity contexts. Our experiments reveal that: (1) Only small-size LLMs (<4B parameters) can run successfully on powerful mobile devices, though they exhibit quality limitations compared to larger models; (2) Model compression is effective in lower the hardware requirement, but may lead to significant performance degradation; (3) The latency to run LLMs on mobile devices with meaningful output is significant (>30 seconds), while cloud services demonstrate better time efficiency (<10 seconds); (4) Edge deployments offer intermediate tradeoffs between latency and model capabilities, with different results on CPU-based and GPU-based settings. These findings provide valuable insights for system designers on the current limitations and future directions for on-device LLM applications.

arXiv.org

LayerCraft: Enhancing Text-to-Image Generation with CoT Reasoning and Layered Object Integration arxiv.org/abs/2504.00010 .LG .GR .MA

LayerCraft: Enhancing Text-to-Image Generation with CoT Reasoning and Layered Object Integration

Text-to-image generation (T2I) has become a key area of research with broad applications. However, existing methods often struggle with complex spatial relationships and fine-grained control over multiple concepts. Many existing approaches require significant architectural modifications, extensive training, or expert-level prompt engineering. To address these challenges, we introduce \textbf{LayerCraft}, an automated framework that leverages large language models (LLMs) as autonomous agents for structured procedural generation. LayerCraft enables users to customize objects within an image and supports narrative-driven creation with minimal effort. At its core, the system includes a coordinator agent that directs the process, along with two specialized agents: \textbf{ChainArchitect}, which employs chain-of-thought (CoT) reasoning to generate a dependency-aware 3D layout for precise instance-level control, and the \textbf{Object-Integration Network (OIN)}, which utilizes LoRA fine-tuning on pre-trained T2I models to seamlessly blend objects into specified regions of an image based on textual prompts without requiring architectural changes. Extensive evaluations demonstrate LayerCraft's versatility in applications ranging from multi-concept customization to storytelling. By providing non-experts with intuitive, precise control over T2I generation, our framework democratizes creative image creation. Our code will be released upon acceptance at github.com/PeterYYZhang/LayerCraft

arXiv.org

Medical Reasoning in LLMs: An In-Depth Analysis of DeepSeek R1 arxiv.org/abs/2504.00016 .CL

Medical Reasoning in LLMs: An In-Depth Analysis of DeepSeek R1

Integrating large language models (LLMs) like DeepSeek R1 into healthcare requires rigorous evaluation of their reasoning alignment with clinical expertise. This study assesses DeepSeek R1's medical reasoning against expert patterns using 100 MedQA clinical cases. The model achieved 93% diagnostic accuracy, demonstrating systematic clinical judgment through differential diagnosis, guideline-based treatment selection, and integration of patient-specific factors. However, error analysis of seven incorrect cases revealed persistent limitations: anchoring bias, challenges reconciling conflicting data, insufficient exploration of alternatives, overthinking, knowledge gaps, and premature prioritization of definitive treatment over intermediate care. Crucially, reasoning length correlated with accuracy - shorter responses (<5,000 characters) were more reliable, suggesting extended explanations may signal uncertainty or rationalization of errors. While DeepSeek R1 exhibits foundational clinical reasoning capabilities, recurring flaws highlight critical areas for refinement, including bias mitigation, knowledge updates, and structured reasoning frameworks. These findings underscore LLMs' potential to augment medical decision-making through artificial reasoning but emphasize the need for domain-specific validation, interpretability safeguards, and confidence metrics (e.g., response length thresholds) to ensure reliability in real-world applications.

arXiv.org

SandboxEval: Towards Securing Test Environment for Untrusted Code arxiv.org/abs/2504.00018 .CR .LG

SandboxEval: Towards Securing Test Environment for Untrusted Code

While large language models (LLMs) are powerful assistants in programming tasks, they may also produce malicious code. Testing LLM-generated code therefore poses significant risks to assessment infrastructure tasked with executing untrusted code. To address these risks, this work focuses on evaluating the security and confidentiality properties of test environments, reducing the risk that LLM-generated code may compromise the assessment infrastructure. We introduce SandboxEval, a test suite featuring manually crafted test cases that simulate real-world safety scenarios for LLM assessment environments in the context of untrusted code execution. The suite evaluates vulnerabilities to sensitive information exposure, filesystem manipulation, external communication, and other potentially dangerous operations in the course of assessment activity. We demonstrate the utility of SandboxEval by deploying it on an open-source implementation of Dyff, an established AI assessment framework used to evaluate the safety of LLMs at scale. We show, first, that the test suite accurately describes limitations placed on an LLM operating under instructions to generate malicious code. Second, we show that the test results provide valuable insights for developers seeking to harden assessment infrastructure and identify risks associated with LLM execution activities.

arXiv.org

ObscuraCoder: Powering Efficient Code LM Pre-Training Via Obfuscation Grounding arxiv.org/abs/2504.00019 .CL .AI .SE

ObscuraCoder: Powering Efficient Code LM Pre-Training Via Obfuscation Grounding

Language models (LMs) have become a staple of the code-writing toolbox. Their pre-training recipe has, however, remained stagnant over recent years, barring the occasional changes in data sourcing and filtering strategies. In particular, research exploring modifications to Code-LMs' pre-training objectives, geared towards improving data efficiency and better disentangling between syntax and semantics, has been noticeably sparse, especially compared with corresponding efforts in natural language LMs. In this work, we examine grounding on obfuscated code as a means of helping Code-LMs look beyond the surface-form syntax and enhance their pre-training sample efficiency. To this end, we compile ObscuraX, a dataset of approximately 55M source and obfuscated code pairs in seven languages. Subsequently, we pre-train ObscuraCoder models, ranging in size from 255M to 2.8B parameters, on a 272B-token corpus that includes ObscuraX and demonstrate that our obfuscation-based pre-training recipe leads to consistent improvements in Code-LMs' abilities compared to both vanilla autoregressive pre-training as well as existing de-obfuscation (DOBF) objectives. ObscuraCoder demonstrates sizeable gains across multiple tests of syntactic and semantic code understanding, along with improved capabilities in multilingual code completion, multilingual code commit summarization, and multi-purpose library-oriented code generation.

arXiv.org

Provenance of Adaptation in Scientific and Business Workflows -- Literature Review arxiv.org/abs/2503.22685 .SE .CR

Provenance of Adaptation in Scientific and Business Workflows -- Literature Review

In the world of science new technology have opened up the possibility to rely on advanced computational methods and models to conduct and produce scientific research. An important aspect of scientific and business workflows is provenance - which refers to the information describing the production, history or lineage of an end product, which can also be data, digitalized processes and other not tangible artifacts. While there are already systems, tools and standards to capture provenance of data and workflows the provenance of adaptations/changes in workflows has not been addressed yet. In this paper we carry out a literature review to establish the state of the art on this topic and present our methodology and findings. Our findings confirm that provenance of adaptation has not been addressed adequately in the fields of business and scientific workflows. The two fields also have different motivation for recording the lineage of data or processes. While scientific workflows are interested in reproducibility and visualization, business workflows solutions are indirectly connected to compliance, exception handling and analysis. The adaptive nature of workflows in both fields is not reflected in the research on process provenance yet, as our results show. The use of standard provenance standards is also not wide spread.

arXiv.org

Truth in Text: A Meta-Analysis of ML-Based Cyber Information Influence Detection Approaches arxiv.org/abs/2503.22686 .CR .LG

Truth in Text: A Meta-Analysis of ML-Based Cyber Information Influence Detection Approaches

Cyber information influence, or disinformation in general terms, is widely regarded as one of the biggest threats to social progress and government stability. From US presidential elections to European Union referendums and down to regional news reporting of wildfires, lies and post-truths have normalized radical decision-making. Accordingly, there has been an explosion in research seeking to detect disinformation in online media. The frontier of disinformation detection research is leveraging a variety of ML techniques such as traditional ML algorithms like Support Vector Machines, Random Forest, and Naïve Bayes. Other research has applied deep learning models including Convolutional Neural Networks, Long Short-Term Memory networks, and transformer-based architectures. Despite the overall success of such techniques, the literature demonstrates inconsistencies when viewed holistically which limits our understanding of the true effectiveness. Accordingly, this work employed a two-stage meta-analysis to (a) demonstrate an overall meta statistic for ML model effectiveness in detecting disinformation and (b) investigate the same by subgroups of ML model types. The study found the majority of the 81 ML detection techniques sampled have greater than an 80\% accuracy with a Mean sample effectiveness of 79.18\% accuracy. Meanwhile, subgroups demonstrated no statistically significant difference between-approaches but revealed high within-group variance. Based on the results, this work recommends future work in replication and development of detection methods operating at the ML model level.

arXiv.org

CodeIF-Bench: Evaluating Instruction-Following Capabilities of Large Language Models in Interactive Code Generation arxiv.org/abs/2503.22688 .SE .PL

CodeIF-Bench: Evaluating Instruction-Following Capabilities of Large Language Models in Interactive Code Generation

Large Language Models (LLMs) have demonstrated exceptional performance in code generation tasks and have become indispensable programming assistants for developers. However, existing code generation benchmarks primarily assess the functional correctness of code generated by LLMs in single-turn interactions, offering limited insight into their capabilities to generate code that strictly follows users' instructions, especially in multi-turn interaction scenarios. In this paper, we introduce \bench, a benchmark for evaluating LLMs' instruction-following capabilities in interactive code generation. Specifically, \bench incorporates nine types of verifiable instructions aligned with the real-world software development requirements, which can be independently and objectively validated through specified test cases, facilitating the evaluation of instruction-following capability in multi-turn interactions. We evaluate nine prominent LLMs using \bench, and the experimental results reveal a significant disparity between their basic programming capability and instruction-following capability, particularly as task complexity, context length, and the number of dialogue rounds increase.

arXiv.org

From Occurrence to Consequence: A Comprehensive Data-driven Analysis of Building Fire Risk arxiv.org/abs/2503.22689 .data-an .AP .LG

From Occurrence to Consequence: A Comprehensive Data-driven Analysis of Building Fire Risk

Building fires pose a persistent threat to life, property, and infrastructure, emphasizing the need for advanced risk mitigation strategies. This study presents a data-driven framework analyzing U.S. fire risks by integrating over one million fire incident reports with diverse fire-relevant datasets, including social determinants, building inventories, weather conditions, and incident-specific factors. By adapting machine learning models, we identify key risk factors influencing fire occurrence and consequences. Our findings show that vulnerable communities, characterized by socioeconomic disparities or the prevalence of outdated or vacant buildings, face higher fire risks. Incident-specific factors, such as fire origins and safety features, strongly influence fire consequences. Buildings equipped with fire detectors and automatic extinguishing systems experience significantly lower fire spread and injury risks. By pinpointing high-risk areas and populations, this research supports targeted interventions, including mandating fire safety systems and providing subsidies for disadvantaged communities. These measures can enhance fire prevention, protect vulnerable groups, and promote safer, more equitable communities.

arXiv.org

Fragile Mastery: Are Domain-Specific Trade-Offs Undermining On-Device Language Models? arxiv.org/abs/2503.22698 .CL

Fragile Mastery: Are Domain-Specific Trade-Offs Undermining On-Device Language Models?

The application of on-device language models (ODLMs) on resource-constrained edge devices is a multi-dimensional problem that strikes a fine balance between computational effectiveness, memory, power usage, and linguistic capacity across heterogeneous tasks. This holistic study conducts a thorough investigation of the trade-offs between domain-specific optimization and cross-domain robustness, culminating in the proposal of the Generalized Edge Model (GEM), a new architecture that aims to balance specialization and generalization in a harmonious manner. With a rigorous experimental approach testing 47 well-chosen benchmarks in eight domains--healthcare, law, finance, STEM, commonsense, conversational AI, multilingual, and domain-adaptive tasks--we show that conventional optimization techniques decrease target task perplexity by 18-25% but result in a precipitous decline in general-task performance with F1 scores decreasing by 12-29%, as reported by Liu et al. GEM employs a Sparse Cross-Attention Router (SCAR) to dynamically allocate computation to a variable number of computing resources with a cross-domain F1 accuracy of 0.89 on less than 100ms latency across Raspberry Pi 4, Pixel 6, iPhone 13, and bespoke custom neural processing units (NPUs). Compared to GPT-4 Lite, GEM enhances the general-task level by 7% with respect and parity in domain-specific performance. We propose three new measurement tools--Domain Specialization Index (DSI), Generalization Gap (GG), and Cross-Domain Transfer Ratio (CDTR)--which show strong correlation between model compression intensity and brittleness.

arXiv.org

CodeScientist: End-to-End Semi-Automated Scientific Discovery with Code-based Experimentation arxiv.org/abs/2503.22708 .AI .CL

CodeScientist: End-to-End Semi-Automated Scientific Discovery with Code-based Experimentation

Despite the surge of interest in autonomous scientific discovery (ASD) of software artifacts (e.g., improved ML algorithms), current ASD systems face two key limitations: (1) they largely explore variants of existing codebases or similarly constrained design spaces, and (2) they produce large volumes of research artifacts (such as automatically generated papers and code) that are typically evaluated using conference-style paper review with limited evaluation of code. In this work we introduce CodeScientist, a novel ASD system that frames ideation and experiment construction as a form of genetic search jointly over combinations of research articles and codeblocks defining common actions in a domain (like prompting a language model). We use this paradigm to conduct hundreds of automated experiments on machine-generated ideas broadly in the domain of agents and virtual environments, with the system returning 19 discoveries, 6 of which were judged as being both at least minimally sound and incrementally novel after a multi-faceted evaluation beyond that typically conducted in prior work, including external (conference-style) review, code review, and replication attempts. Moreover, the discoveries span new tasks, agents, metrics, and data, suggesting a qualitative shift from benchmark optimization to broader discoveries.

arXiv.org
Show older
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.