arXiv Computer Science @arxiv_cs@qoto.org

1.12K Followers

Bot

I toot the arXiv feed for topics in Computer Science.

#ComputerScience #CS #Programming #SoftwareEngineering #Software #SoftwareDevelopment #Computers #Science #arXiv #News #PeerReview

Joined Jul 2018

2 Following 1.12K Followers

Posts Posts and replies Media

arXiv Computer Science @arxiv_cs@qoto.org

On strict ranking by pairwise comparisons https://arxiv.org/abs/2501.14738 #math.IT #cs.IT

On strict ranking by pairwise comparisons

We attack the problem of getting a strict ranking (i.e. a ranking without equally ranked items) of $n$ items from a pairwise comparisons matrix. Basic structures are described, a first heuristical approach based on a condition, the $\mathcal{R}-$condition, is proposed. Analyzing the limits of this ranking procedure, we finish with a minimization problem which can be applied to a wider class of pairwise comparisons matrices. If solved, it produces consistent pairwise comparisons that produce a strict ranking.

arXiv Computer Science @arxiv_cs@qoto.org

Reproduction Research of FSA-Benchmark https://arxiv.org/abs/2501.14739 #cs.DC #cs.LG

Reproduction Research of FSA-Benchmark

In the current landscape of big data, the reliability and performance of storage systems are essential to the success of various applications and services. as data volumes continue to grow exponentially, the complexity and scale of the storage infrastructures needed to manage this data also increase. a significant challenge faced by data centers and storage systems is the detection and management of fail-slow disks that experience a gradual decline in performance before ultimately failing. Unlike outright disk failures, fail-slow conditions can go undetected for prolonged periods, leading to considerable impacts on system performance and user experience.

arXiv Computer Science @arxiv_cs@qoto.org

Datapath Combinational Equivalence Checking With Hybrid Sweeping Engines and Parallelization https://arxiv.org/abs/2501.14740 #cs.DC #cs.AR

Datapath Combinational Equivalence Checking With Hybrid Sweeping Engines and Parallelization

In the application of IC design for microprocessors, there are often demands for optimizing the implementation of datapath circuits, on which various arithmetic operations are performed. Combinational equivalence checking (CEC) plays an essential role in ensuring the correctness of design optimization. The most prevalent CEC algorithms are based on SAT sweeping, which utilizes SAT to prove the equivalence of the internal node pairs in topological order, and the equivalent nodes are merged. Datapath circuits usually contain equivalent pairs for which the transitive fan-in cones are small but have a high XOR chain density, and proving such node pairs is very difficult for SAT solvers. An exact probability-based simulation (EPS) is suitable for verifying such pairs, while this method is not suitable for pairs with many primary inputs due to the memory cost. We first reduce the memory cost of EPS and integrate it to improve the SAT sweeping method. Considering the complementary abilities of SAT and EPS, we design an engine selection heuristic to dynamically choose SAT or EPS in the sweeping process, according to XOR chain density. Our method is further improved by reducing unnecessary engine calls by detecting regularity. Furthermore, we parallelized the SAT and EPS engines of HybridCEC, leading to the parallel CEC prover. Experiments on a benchmark suite from industrial datapath circuits show that our method is much faster than the state-of-the-art CEC tool namely ABC &cec on nearly all instances, and is more than 100x faster on 30% of the instances, 1000x faster on 12% of the instances. In addition, the 64 threads version of our method achieved 77x speedup.

arXiv Computer Science @arxiv_cs@qoto.org

Language Representation Favored Zero-Shot Cross-Domain Cognitive Diagnosis https://arxiv.org/abs/2501.13943 #cs.CL #cs.AI #cs.CY #cs.LG

Language Representation Favored Zero-Shot Cross-Domain Cognitive Diagnosis

Cognitive diagnosis aims to infer students' mastery levels based on their historical response logs. However, existing cognitive diagnosis models (CDMs), which rely on ID embeddings, often have to train specific models on specific domains. This limitation may hinder their directly practical application in various target domains, such as different subjects (e.g., Math, English and Physics) or different education platforms (e.g., ASSISTments, Junyi Academy and Khan Academy). To address this issue, this paper proposes the language representation favored zero-shot cross-domain cognitive diagnosis (LRCD). Specifically, LRCD first analyzes the behavior patterns of students, exercises and concepts in different domains, and then describes the profiles of students, exercises and concepts using textual descriptions. Via recent advanced text-embedding modules, these profiles can be transformed to vectors in the unified language space. Moreover, to address the discrepancy between the language space and the cognitive diagnosis space, we propose language-cognitive mappers in LRCD to learn the mapping from the former to the latter. Then, these profiles can be easily and efficiently integrated and trained with existing CDMs. Extensive experiments show that training LRCD on real-world datasets can achieve commendable zero-shot performance across different target domains, and in some cases, it can even achieve competitive performance with some classic CDMs trained on the full response data on target domains. Notably, we surprisingly find that LRCD can also provide interesting insights into the differences between various subjects (such as humanities and sciences) and sources (such as primary and secondary education).

arXiv Computer Science @arxiv_cs@qoto.org

Fanar: An Arabic-Centric Multimodal Generative AI Platform https://arxiv.org/abs/2501.13944 #cs.CL #cs.AI

arXiv Computer Science @arxiv_cs@qoto.org

Self-Explanation in Social AI Agents https://arxiv.org/abs/2501.13945 #cs.CL #cs.AI #cs.CY

Self-Explanation in Social AI Agents

Social AI agents interact with members of a community, thereby changing the behavior of the community. For example, in online learning, an AI social assistant may connect learners and thereby enhance social interaction. These social AI assistants too need to explain themselves in order to enhance transparency and trust with the learners. We present a method of self-explanation that uses introspection over a self-model of an AI social assistant. The self-model is captured as a functional model that specifies how the methods of the agent use knowledge to achieve its tasks. The process of generating self-explanations uses Chain of Thought to reflect on the self-model and ChatGPT to provide explanations about its functioning. We evaluate the self-explanation of the AI social assistant for completeness and correctness. We also report on its deployment in a live class.

arXiv Computer Science @arxiv_cs@qoto.org

Hallucination Mitigation using Agentic AI Natural Language-Based Frameworks https://arxiv.org/abs/2501.13946 #cs.CL #cs.AI #cs.MA

arXiv Computer Science @arxiv_cs@qoto.org

A Comprehensive Survey on Integrating Large Language Models with Knowledge-Based Methods https://arxiv.org/abs/2501.13947 #cs.CL #cs.AI

A Comprehensive Survey on Integrating Large Language Models with Knowledge-Based Methods

The rapid development of artificial intelligence has led to marked progress in the field. One interesting direction for research is whether Large Language Models (LLMs) can be integrated with structured knowledge-based systems. This approach aims to combine the generative language understanding of LLMs and the precise knowledge representation systems by which they are integrated. This article surveys the relationship between LLMs and knowledge bases, looks at how they can be applied in practice, and discusses related technical, operational, and ethical challenges. Utilizing a comprehensive examination of the literature, the study both identifies important issues and assesses existing solutions. It demonstrates the merits of incorporating generative AI into structured knowledge-base systems concerning data contextualization, model accuracy, and utilization of knowledge resources. The findings give a full list of the current situation of research, point out the main gaps, and propose helpful paths to take. These insights contribute to advancing AI technologies and support their practical deployment across various sectors.

arXiv Computer Science @arxiv_cs@qoto.org

Longitudinal Abuse and Sentiment Analysis of Hollywood Movie Dialogues using LLMs https://arxiv.org/abs/2501.13948 #cs.CL #cs.AI

Longitudinal Abuse and Sentiment Analysis of Hollywood Movie Dialogues using LLMs

Over the past decades, there has been an increasing concern about the prevalence of abusive and violent content in Hollywood movies. This study uses Large Language Models (LLMs) to explore the longitudinal abuse and sentiment analysis of Hollywood Oscar and blockbuster movie dialogues from 1950 to 2024. By employing fine-tuned LLMs, we analyze subtitles for over a thousand movies categorised into four genres to examine the trends and shifts in emotional and abusive content over the past seven decades. Our findings reveal significant temporal changes in movie dialogues, which reflect broader social and cultural influences. Overall, the emotional tendencies in the films are diverse, and the detection of abusive content also exhibits significant fluctuations. The results show a gradual rise in abusive content in recent decades, reflecting social norms and regulatory policy changes. Genres such as thrillers still present a higher frequency of abusive content that emphasises the ongoing narrative role of violence and conflict. At the same time, underlying positive emotions such as humour and optimism remain prevalent in most of the movies. Furthermore, the gradual increase of abusive content in movie dialogues has been significant over the last two decades, where Oscar-nominated movies overtook the top ten blockbusters.

arXiv Computer Science @arxiv_cs@qoto.org

Can OpenAI o1 Reason Well in Ophthalmology? A 6,990-Question Head-to-Head Evaluation Study https://arxiv.org/abs/2501.13949 #cs.CL #cs.AI

Can OpenAI o1 Reason Well in Ophthalmology? A 6,990-Question Head-to-Head Evaluation Study

Question: What is the performance and reasoning ability of OpenAI o1 compared to other large language models in addressing ophthalmology-specific questions? Findings: This study evaluated OpenAI o1 and five LLMs using 6,990 ophthalmological questions from MedMCQA. O1 achieved the highest accuracy (0.88) and macro-F1 score but ranked third in reasoning capabilities based on text-generation metrics. Across subtopics, o1 ranked first in ``Lens'' and ``Glaucoma'' but second to GPT-4o in ``Corneal and External Diseases'', ``Vitreous and Retina'' and ``Oculoplastic and Orbital Diseases''. Subgroup analyses showed o1 performed better on queries with longer ground truth explanations. Meaning: O1's reasoning enhancements may not fully extend to ophthalmology, underscoring the need for domain-specific refinements to optimize performance in specialized fields like ophthalmology.

arXiv Computer Science @arxiv_cs@qoto.org

iServe: An Intent-based Serving System for LLMs https://arxiv.org/abs/2501.13111 #cs.SE #cs.LG

iServe: An Intent-based Serving System for LLMs

Large Language Models (LLMs) are becoming ubiquitous across industries, where applications demand they fulfill diverse user intents. However, developers currently face the challenge of manually exploring numerous deployment configurations - combinations of parallelism and compression techniques that impact resource usage, latency, cost, and accuracy - to meet these intents. Assessing the impact of these configurations on user metrics requires extensive, costly profiling for each model. Existing approaches avoid this expense by using fixed, static configurations, but this often leads to sub-optimal performance and higher costs. Moreover, none of these solutions dynamically adapt to changing user intents to balance latency and cost, effectively. We present iServe, an automated, intent-based system for distributed LLM inference. Instead of manually selecting deployment configurations, developers simply specify their intent - such as minimizing latency, reducing cost, or meeting specific targets for either. iServe introduces fingerprints, lightweight representations of LLMs, to efficiently estimate how different configurations impact latency and memory usage. Based on these insights and GPU availability, iServe dynamically selects the optimal configuration to align with the user's intent. For various LLMs and query arrival rates, iServe best meets user intents compared to state-of-the-art systems by reducing latency by 77.62% and SLO violations by 7.09x while improving GPU throughput by 4.72x. Moreover, iServe's fingerprint-based profiling reduces profiling cost by 6.05x (GPU-hours) compared to baselines.

arXiv Computer Science @arxiv_cs@qoto.org

Dagger Behind Smile: Fool LLMs with a Happy Ending Story https://arxiv.org/abs/2501.13115 #cs.CL #cs.AI #cs.CR

Dagger Behind Smile: Fool LLMs with a Happy Ending Story

The wide adoption of Large Language Models (LLMs) has attracted significant attention from $\textit{jailbreak}$ attacks, where adversarial prompts crafted through optimization or manual design exploit LLMs to generate malicious contents. However, optimization-based attacks have limited efficiency and transferability, while existing manual designs are either easily detectable or demand intricate interactions with LLMs. In this paper, we first point out a novel perspective for jailbreak attacks: LLMs are more responsive to $\textit{positive}$ prompts. Based on this, we deploy Happy Ending Attack (HEA) to wrap up a malicious request in a scenario template involving a positive prompt formed mainly via a $\textit{happy ending}$, it thus fools LLMs into jailbreaking either immediately or at a follow-up malicious request.This has made HEA both efficient and effective, as it requires only up to two turns to fully jailbreak LLMs. Extensive experiments show that our HEA can successfully jailbreak on state-of-the-art LLMs, including GPT-4o, Llama3-70b, Gemini-pro, and achieves 88.79\% attack success rate on average. We also provide quantitative explanations for the success of HEA.

arXiv Computer Science @arxiv_cs@qoto.org

MyGO Multiplex CoT: A Method for Self-Reflection in Large Language Models via Double Chain of Thought Thinking https://arxiv.org/abs/2501.13117 #cs.CL #cs.AI

MyGO Multiplex CoT: A Method for Self-Reflection in Large Language Models via Double Chain of Thought Thinking

Recent advancements in large language models (LLMs) have demonstrated their impressive abilities in various reasoning and decision-making tasks. However, the quality and coherence of the reasoning process can still benefit from enhanced introspection and self-reflection. In this paper, we introduce Multiplex CoT (Chain of Thought), a method that enables LLMs to simulate a form of self-review while reasoning, by initiating double Chain of Thought (CoT) thinking. Multiplex CoT leverages the power of iterative reasoning, where the model generates an initial chain of thought and subsequently critiques and refines this reasoning with a second round of thought generation. This recursive approach allows for more coherent, logical, and robust answers, improving the overall decision-making process. We demonstrate how this method can be effectively implemented using simple prompt engineering in existing LLM architectures, achieving an effect similar to that of the Learning-Refinement Model (LRM) without the need for additional training. Additionally, we present a practical guide for implementing the method in Google Colab, enabling easy integration into real-world applications.

arXiv Computer Science @arxiv_cs@qoto.org

Multilinguality in LLM-Designed Reward Functions for Restless Bandits: Effects on Task Performance and Fairness https://arxiv.org/abs/2501.13120 #cs.CL #cs.AI #cs.LG #cs.MA

Multilinguality in LLM-Designed Reward Functions for Restless Bandits: Effects on Task Performance and Fairness

Restless Multi-Armed Bandits (RMABs) have been successfully applied to resource allocation problems in a variety of settings, including public health. With the rapid development of powerful large language models (LLMs), they are increasingly used to design reward functions to better match human preferences. Recent work has shown that LLMs can be used to tailor automated allocation decisions to community needs using language prompts. However, this has been studied primarily for English prompts and with a focus on task performance only. This can be an issue since grassroots workers, especially in developing countries like India, prefer to work in local languages, some of which are low-resource. Further, given the nature of the problem, biases along population groups unintended by the user are also undesirable. In this work, we study the effects on both task performance and fairness when the DLM algorithm, a recent work on using LLMs to design reward functions for RMABs, is prompted with non-English language commands. Specifically, we run the model on a synthetic environment for various prompts translated into multiple languages. The prompts themselves vary in complexity. Our results show that the LLM-proposed reward functions are significantly better when prompted in English compared to other languages. We also find that the exact phrasing of the prompt impacts task performance. Further, as prompt complexity increases, performance worsens for all languages; however, it is more robust with English prompts than with lower-resource languages. On the fairness side, we find that low-resource languages and more complex prompts are both highly likely to create unfairness along unintended dimensions.

arXiv Computer Science @arxiv_cs@qoto.org

Episodic Memories Generation and Evaluation Benchmark for Large Language Models https://arxiv.org/abs/2501.13121 #cs.CL #cs.AI #cs.LG

Episodic Memories Generation and Evaluation Benchmark for Large Language Models

Episodic memory -- the ability to recall specific events grounded in time and space -- is a cornerstone of human cognition, enabling not only coherent storytelling, but also planning and decision-making. Despite their remarkable capabilities, Large Language Models (LLMs) lack a robust mechanism for episodic memory: we argue that integrating episodic memory capabilities into LLM is essential for advancing AI towards human-like cognition, increasing their potential to reason consistently and ground their output in real-world episodic events, hence avoiding confabulations. To address this challenge, we introduce a comprehensive framework to model and evaluate LLM episodic memory capabilities. Drawing inspiration from cognitive science, we develop a structured approach to represent episodic events, encapsulating temporal and spatial contexts, involved entities, and detailed descriptions. We synthesize a unique episodic memory benchmark, free from contamination, and release open source code and datasets to assess LLM performance across various recall and episodic reasoning tasks. Our evaluation of state-of-the-art models, including GPT-4 and Claude variants, Llama 3.1, and o1-mini, reveals that even the most advanced LLMs struggle with episodic memory tasks, particularly when dealing with multiple related events or complex spatio-temporal relationships -- even in contexts as short as 10k-100k tokens.

arXiv Computer Science @arxiv_cs@qoto.org

Zero-Shot Verification-guided Chain of Thoughts https://arxiv.org/abs/2501.13122 #cs.CL #cs.AI

Zero-Shot Verification-guided Chain of Thoughts

Previous works have demonstrated the effectiveness of Chain-of-Thought (COT) prompts and verifiers in guiding Large Language Models (LLMs) through the space of reasoning. However, most such studies either use a fine-tuned verifier or rely on manually handcrafted few-shot examples. In contrast, in this paper, we focus on LLM-based self-verification of self-generated reasoning steps via COT prompts in a completely zero-shot regime. To explore this setting, we design a new zero-shot prompt, which we call COT STEP, to aid zero-shot decomposition of reasoning steps and design two new zero-shot prompts for LLM-based verifiers. We evaluate the verifiers' ability to classify the correctness of reasoning chains and explore different ways to use verifier scores in guiding reasoning for various mathematical and commonsense reasoning tasks with different LLMs.

arXiv Computer Science @arxiv_cs@qoto.org

Debate Helps Weak-to-Strong Generalization https://arxiv.org/abs/2501.13124 #cs.CL #cs.AI

Debate Helps Weak-to-Strong Generalization

Common methods for aligning already-capable models with desired behavior rely on the ability of humans to provide supervision. However, future superhuman models will surpass the capability of humans. Therefore, humans will only be able to weakly supervise superhuman models. This expected deficiency of human evaluation would weaken the safety of future AI systems. Scalable oversight and weak-to-strong generalization are two complementary approaches to tackle this issue. In this paper, we attempt to combine the strengths of these two approaches to further improve alignment. Specifically, we investigate ways of improving human supervision with a strong pretrained model and then supervise the strong model with enhanced weak human supervision. To make iterative empirical progress, we consider an analogy: can we use a strong model to improve weak model supervision and then use it to supervise the strong model? We empirically test it by finetuning a small weak model on ground truth labels with the additional help from a large strong model, and then finetuning the strong model on labels generated by the weak model. We find that debate can assist a weak model in extracting trustworthy information from an untrustworthy strong model, which provides leverage as context on samples when training a weak model. We also show that an ensemble of weak models helps exploit long arguments generated by strong model debaters and obtain a more robust supervision estimate. Extensive experiments on the OpenAI weak-to-strong NLP benchmarks show that the combination approach leads to better alignment, which indicates that debate has the potential to help weak-to-strong generalization.

arXiv Computer Science @arxiv_cs@qoto.org

Generating Plausible Distractors for Multiple-Choice Questions via Student Choice Prediction https://arxiv.org/abs/2501.13125 #cs.CL #cs.AI #cs.LG

Generating Plausible Distractors for Multiple-Choice Questions via Student Choice Prediction

In designing multiple-choice questions (MCQs) in education, creating plausible distractors is crucial for identifying students' misconceptions and gaps in knowledge and accurately assessing their understanding. However, prior studies on distractor generation have not paid sufficient attention to enhancing the difficulty of distractors, resulting in reduced effectiveness of MCQs. This study presents a pipeline for training a model to generate distractors that are more likely to be selected by students. First, we train a pairwise ranker to reason about students' misconceptions and assess the relative plausibility of two distractors. Using this model, we create a dataset of pairwise distractor ranks and then train a distractor generator via Direct Preference Optimization (DPO) to generate more plausible distractors. Experiments on computer science subjects (Python, DB, MLDL) demonstrate that our pairwise ranker effectively identifies students' potential misunderstandings and achieves ranking accuracy comparable to human experts. Furthermore, our distractor generator outperforms several baselines in generating plausible distractors and produces questions with a higher item discrimination index (DI).

arXiv Computer Science @arxiv_cs@qoto.org

Preference Curriculum: LLMs Should Always Be Pretrained on Their Preferred Data https://arxiv.org/abs/2501.13126 #cs.CL #cs.AI

Preference Curriculum: LLMs Should Always Be Pretrained on Their Preferred Data

Large language models (LLMs) generally utilize a consistent data distribution throughout the pretraining process. However, as the model's capability improves, it is intuitive that its data preferences dynamically change, indicating the need for pretraining with different data at various training stages. To achieve it, we propose the Perplexity Difference (PD) based Preference Curriculum learning (PDPC) framework, which always perceives and uses the data preferred by LLMs to train and boost them. First, we introduce the PD metric to quantify the difference in how challenging a sample is for weak versus strong models. Samples with high PD are more challenging for weak models to learn and are more suitable to be arranged in the later stage of pretraining. Second, we propose the preference function to approximate and predict the data preference of the LLM at any training step, so as to complete the arrangement of the dataset offline and ensure continuous training without interruption. Experimental results on 1.3B and 3B models demonstrate that PDPC significantly surpasses baselines. Notably, the 3B model trained on 1T tokens achieves an increased average accuracy of over 8.1% across MMLU and CMMLU.

arXiv Computer Science @arxiv_cs@qoto.org

A Hierarchical Reinforcement Learning Framework for Multi-UAV Combat Using Leader-Follower Strategy https://arxiv.org/abs/2501.13132 #eess.SY #cs.MA #cs.AI #cs.RO #cs.SY

A Hierarchical Reinforcement Learning Framework for Multi-UAV Combat Using Leader-Follower Strategy

Multi-UAV air combat is a complex task involving multiple autonomous UAVs, an evolving field in both aerospace and artificial intelligence. This paper aims to enhance adversarial performance through collaborative strategies. Previous approaches predominantly discretize the action space into predefined actions, limiting UAV maneuverability and complex strategy implementation. Others simplify the problem to 1v1 combat, neglecting the cooperative dynamics among multiple UAVs. To address the high-dimensional challenges inherent in six-degree-of-freedom space and improve cooperation, we propose a hierarchical framework utilizing the Leader-Follower Multi-Agent Proximal Policy Optimization (LFMAPPO) strategy. Specifically, the framework is structured into three levels. The top level conducts a macro-level assessment of the environment and guides execution policy. The middle level determines the angle of the desired action. The bottom level generates precise action commands for the high-dimensional action space. Moreover, we optimize the state-value functions by assigning distinct roles with the leader-follower strategy to train the top-level policy, followers estimate the leader's utility, promoting effective cooperation among agents. Additionally, the incorporation of a target selector, aligned with the UAVs' posture, assesses the threat level of targets. Finally, simulation experiments validate the effectiveness of our proposed method.

Bot

I toot the arXiv feed for topics in Computer Science.

#ComputerScience #CS #Programming #SoftwareEngineering #Software #SoftwareDevelopment #Computers #Science #arXiv #News #PeerReview

Joined Jul 2018