arXiv Computer Science @arxiv_cs@qoto.org

1.12K Followers

Bot

I toot the arXiv feed for topics in Computer Science.

#ComputerScience #CS #Programming #SoftwareEngineering #Software #SoftwareDevelopment #Computers #Science #arXiv #News #PeerReview

Joined Jul 2018

2 Following 1.12K Followers

Posts Posts and replies Media

arXiv Computer Science @arxiv_cs@qoto.org

Data Analytics with Differential Privacy. (arXiv:2311.16104v1 [cs.LG]) http://arxiv.org/abs/2311.16104

Data Analytics with Differential Privacy

Differential privacy is the state-of-the-art definition for privacy, guaranteeing that any analysis performed on a sensitive dataset leaks no information about the individuals whose data are contained therein. In this thesis, we develop differentially private algorithms to analyze distributed and streaming data. In the distributed model, we consider the particular problem of learning -- in a distributed fashion -- a global model of the data, that can subsequently be used for arbitrary analyses. We build upon PrivBayes, a differentially private method that approximates the high-dimensional distribution of a centralized dataset as a product of low-order distributions, utilizing a Bayesian Network model. We examine three novel approaches to learning a global Bayesian Network from distributed data, while offering the differential privacy guarantee to all local datasets. Our work includes a detailed theoretical analysis of the distributed, differentially private entropy estimator which we use in one of our algorithms, as well as a detailed experimental evaluation, using both synthetic and real-world data. In the streaming model, we focus on the problem of estimating the density of a stream of users, which expresses the fraction of all users that actually appear in the stream. We offer one of the strongest privacy guarantees for the streaming model, user-level pan-privacy, which ensures that the privacy of any user is protected, even against an adversary that observes the internal state of the algorithm. We provide a detailed analysis of an existing, sampling-based algorithm for the problem and propose two novel modifications that significantly improve it, both theoretically and experimentally, by optimally using all the allocated "privacy budget."

arXiv Computer Science @arxiv_cs@qoto.org

A Privacy-preserving Central Bank Ledger for Central Bank Digital Currency. (arXiv:2311.16105v1 [cs.CR]) http://arxiv.org/abs/2311.16105

A Privacy-preserving Central Bank Ledger for Central Bank Digital Currency

Retail central bank digital currency (rCBDC) is seen as a key upgrade of the monetary system in the 21st century. However, privacy concerns are the main impediment to rCBDC's development and roll-out. On the one hand, the rights of people to keep their transactions private should be protected, including against central bank surveillance. On the other hand, the central bank needs to ensure that no over-issuance of money or other frauds occur, demanding a certain form of knowledge of rCBDC transactions to safeguard against malicious users. This work focuses on rCBDC architectures based on the unspent transaction output (UTXO) data model and tackles the research problem of preserving a sufficient degree of privacy for UTXO transaction records while allowing the central bank to verify their correctness. User privacy is not adequately addressed in the UTXO-based rCBDC architectures. Using evolving public keys as pseudonyms to hide the real identities of users only solves the privacy issue partially. Some information could still be leaked out. This work investigates techniques to address the shortcomings of the pseudonym approach. First, a Pedersen commitment scheme is applied to hide the transaction values of a UTXO transaction while allowing the central bank to verify that no over-issuance of rCBDC has occurred in the transaction.This work uses a Schnorr signature to prove no over-issuance of money, which reduces overheads and enables a non-interactive proof. Then, Coinjoin is applied to aggregate UTXO transactions from different users into one larger UTXO transaction to obfuscate the payer-payee relationship while preserving the correctness of the amount of money flow. This work applies k-anonymity to analyse the privacy guarantee of Coinjoin. By modelling the transaction traffic by a Poisson process, the trade-off between anonymity and transaction confirmation time of Coinjoin is analysed.

arXiv Computer Science @arxiv_cs@qoto.org

Nonparametric Spatio-Temporal Joint Probabilistic Data Association Coupled Filter and Interfering Extended Target Tracking. (arXiv:2311.16106v1 [cs.CV]) http://arxiv.org/abs/2311.16106

Nonparametric Spatio-Temporal Joint Probabilistic Data Association Coupled Filter and Interfering Extended Target Tracking

Extended target tracking estimates the centroid and shape of the target in space and time. In various situations where extended target tracking is applicable, the presence of multiple targets can lead to interference, particularly when they maneuver behind one another in a sensor like a camera. Nonetheless, when dealing with multiple extended targets, there's a tendency for them to share similar shapes within a group, which can enhance their detectability. For instance, the coordinated movement of a cluster of aerial vehicles might cause radar misdetections during their convergence or divergence. Similarly, in the context of a self-driving car, lane markings might split or converge, resulting in inaccurate lane tracking detections. A well-known joint probabilistic data association coupled (JPDAC) filter can address this problem in only a single-point target tracking. A variation of JPDACF was developed by introducing a nonparametric Spatio-Temporal Joint Probabilistic Data Association Coupled Filter (ST-JPDACF) to address the problem for extended targets. Using different kernel functions, we manage the dependency of measurements in space (inside a frame) and time (between frames). Kernel functions are able to be learned using a limited number of training data. This extension can be used for tracking the shape and dynamics of nonparametric dependent extended targets in clutter when targets share measurements. The proposed algorithm was compared with other well-known supervised methods in the interfering case and achieved promising results.

arXiv Computer Science @arxiv_cs@qoto.org

An Innovative Design of Substitution Box Using Trigonometric Transformation. (arXiv:2311.16107v1 [cs.CR]) http://arxiv.org/abs/2311.16107

An Innovative Design of Substitution Box Using Trigonometric Transformation

As the number of hacking events and cyber threats keeps going up, it is getting harder and harder to communicate securely and keep personal information safe on the Internet. Cryptography is a very important way to deal with these problems because it can secure data by changing it from one form to another. In this study, we show a new, lightweight algorithm that is based on trigonometric ideas and offers a lot of security by making it less likely that cryptanalysis will work. The performance of our suggested algorithm is better than that of older methods like the Hill cipher, Blowfish, and DES. Even though traditional methods offer good security, they may have more work to do, which slows them down. The suggested algorithm tries to close this gap by offering a solution based on trigonometric ideas that are both fast and safe. The main goal of this study is to come up with a small but strong encryption algorithm that cannot be broken by cryptanalysis and keeps Internet communication safe. We want to speed up the coding process without making it less secure by using trigonometric principles. The suggested algorithm uses trigonometric functions and operations to create non-linearity and confusion, making it resistant to both differential and linear cryptanalysis. We show that the suggested algorithm is more secure and faster than traditional methods like the Hill cipher, Blowfish, and DES by doing a lot of research and testing. Combining trigonometric ideas with a simple design makes it workable for real world uses and offers a promising way to protect data on the Internet.

arXiv Computer Science @arxiv_cs@qoto.org

Picogrid: An experimental platform for prosumer microgrids. (arXiv:2311.16108v1 [cs.NI]) http://arxiv.org/abs/2311.16108

Picogrid: An experimental platform for prosumer microgrids

The Microgrid paradigm is gaining momentum as one of the key pieces of technology for expanding clean energy access and improving energy resilience. Most of the interest in this pertains to distinct entities that either generate electricity or act as loads, i.e., distinct producers and consumers. Remote community microgrids and emerging transactive energy service models with interconnected prosumers do not clearly fit into this paradigm. Notwithstanding various publications that present concepts and simulations, there has been a dearth of experimental platforms to study them, due to practical challenges. This paper presents the `Picogrid' - an experimental platform particularly designed for dc prosumer microgrids. It is a low-power, low-cost hardware platform that enables interconnecting multiple prosumer entities in a bench-top setup. Each prosumer sends data to a cloud dashboard and can receive set points for optimal operation from a remote computer system, lending itself to use in a virtual lab setup. The platform enables implementation of custom power profiles based on real-world generation and demand datasets. Features of the platform are demonstrated using simulation and experimental results.

arXiv Computer Science @arxiv_cs@qoto.org

Transfer Learning between Motor Imagery Datasets using Deep Learning -- Validation of Framework and Comparison of Datasets. (arXiv:2311.16109v1 [cs.CV]) http://arxiv.org/abs/2311.16109

Transfer Learning between Motor Imagery Datasets using Deep Learning -- Validation of Framework and Comparison of Datasets

We present a simple deep learning-based framework commonly used in computer vision and demonstrate its effectiveness for cross-dataset transfer learning in mental imagery decoding tasks that are common in the field of Brain-Computer Interfaces (BCI). We investigate, on a large selection of 12 motor-imagery datasets, which ones are well suited for transfer, both as donors and as receivers. Challenges. Deep learning models typically require long training times and are data-hungry, which impedes their use for BCI systems that have to minimize the recording time for (training) examples and are subject to constraints induced by experiments involving human subjects. A solution to both issues is transfer learning, but it comes with its own challenge, i.e., substantial data distribution shifts between datasets, subjects and even between subsequent sessions of the same subject. Approach. For every pair of pre-training (donor) and test (receiver) dataset, we first train a model on the donor before training merely an additional new linear classification layer based on a few receiver trials. Performance of this transfer approach is then tested on other trials of the receiver dataset. Significance. First, we lower the threshold to use transfer learning between motor imagery datasets: the overall framework is extremely simple and nevertheless obtains decent classification scores. Second, we demonstrate that deep learning models are a good option for motor imagery cross-dataset transfer both for the reasons outlined in the first point and because the framework presented is viable in online scenarios. Finally, analysing which datasets are best suited for transfer learning can be used as a reference for future researchers to determine which to use for pre-training or benchmarking.

arXiv Computer Science @arxiv_cs@qoto.org

Community Battery Energy Storage Systems for Enhancing Distribution System Operation: A Multi-objective Optimization Approach. (arXiv:2311.16110v1 [cs.NI]) http://arxiv.org/abs/2311.16110

Community Battery Energy Storage Systems for Enhancing Distribution System Operation: A Multi-objective Optimization Approach

The growing penetration of distributed energy resources (DERs) in distribution networks (DNs) raises new operational challenges, particularly in terms of reliability and voltage regulation. In response to these challenges, we introduce an innovative DN operation framework with multi-objective optimization, leveraging community battery energy storage systems (C-BESS). The proposed framework targets two key operational objectives: first, to minimize voltage deviation, which is a concern for a distribution network service provider (DNSP), and second, to maximize the utilization of DERs on the demand side. Recognizing the conflicting nature of these objectives, we utilize C-BESS to enhance the system's adaptability to dynamically adjust DN operations. The multi-objective optimization problem is solved using the non-dominated sorting genetic algorithm-II (NSGA-II). Case studies using real-world data are conducted to validate the effectiveness of the proposed framework. The results show significant improvements in voltage regulation and DER utilization, demonstrating the potential of C-BESS in enabling more reliable DN operation. Our findings contribute to the ongoing discourse on the role of C-BESS in DN operation enhancement and DER integration.

arXiv Computer Science @arxiv_cs@qoto.org

Deep Explainability: Spin-Geometrical Neural Meta-Structures. (arXiv:2311.16111v1 [q-bio.NC]) http://arxiv.org/abs/2311.16111

Deep Explainability: Spin-Geometrical Neural Meta-Structures

We face up to the challenge of explainability in multimodal artificial intelligence. At the nexus of neuroscience-inspired and quantum computing, interpretable and transparent spin-geometrical meta-architectures for early fusion of large-scale, heterogeneous, graph-structured data are envisioned, harnessing recent evidence for relativistic quantum neural coding of (co-)behavioral states in the self-organizing brain, under competitive, multidimensional dynamics. The designs draw on a self-dual classical description - via special Clifford-Lipschitz operations - of spinorial quantum states within registers of at most 16 qubits for efficient encoding of exponentially large neural structures. Formally 'trained', Lorentz neural architectures with precisely one lateral layer of exclusively inhibitory interneurons accounting for anti-modalities, as well as their co-architectures with intra-layer connections are highlighted. In principle, the approach accommodates the fusion of up to 16 time-invariant interconnected (anti-)modalities and the explicit recognition of underlying multidimensional patterns. Comprehensive insights are expected to be gained through applications to multimodal big data, under real-world scenarios.

arXiv Computer Science @arxiv_cs@qoto.org

Co-learning synaptic delays, weights and adaptation in spiking neural networks. (arXiv:2311.16112v1 [cs.NE]) http://arxiv.org/abs/2311.16112

Co-learning synaptic delays, weights and adaptation in spiking neural networks

Spiking neural networks (SNN) distinguish themselves from artificial neural networks (ANN) because of their inherent temporal processing and spike-based computations, enabling a power-efficient implementation in neuromorphic hardware. In this paper, we demonstrate that data processing with spiking neurons can be enhanced by co-learning the connection weights with two other biologically inspired neuronal features: 1) a set of parameters describing neuronal adaptation processes and 2) synaptic propagation delays. The former allows the spiking neuron to learn how to specifically react to incoming spikes based on its past. The trained adaptation parameters result in neuronal heterogeneity, which is found in the brain and also leads to a greater variety in available spike patterns. The latter enables to learn to explicitly correlate patterns that are temporally distanced. Synaptic delays reflect the time an action potential requires to travel from one neuron to another. We show that each of the co-learned features separately leads to an improvement over the baseline SNN and that the combination of both leads to state-of-the-art SNN results on all speech recognition datasets investigated with a simple 2-hidden layer feed-forward network. Our SNN outperforms the ANN on the neuromorpic datasets (Spiking Heidelberg Digits and Spiking Speech Commands), even with fewer trainable parameters. On the 35-class Google Speech Commands dataset, our SNN also outperforms a GRU of similar size. Our work presents brain-inspired improvements to SNN that enable them to excel over an equivalent ANN of similar size on tasks with rich temporal dynamics.

arXiv Computer Science @arxiv_cs@qoto.org

BAGEL: Backdoor Attacks against Federated Contrastive Learning. (arXiv:2311.16113v1 [cs.CR]) http://arxiv.org/abs/2311.16113

BAGEL: Backdoor Attacks against Federated Contrastive Learning

Federated Contrastive Learning (FCL) is an emerging privacy-preserving paradigm in distributed learning for unlabeled data. In FCL, distributed parties collaboratively learn a global encoder with unlabeled data, and the global encoder could be widely used as a feature extractor to build models for many downstream tasks. However, FCL is also vulnerable to many security threats (e.g., backdoor attacks) due to its distributed nature, which are seldom investigated in existing solutions. In this paper, we study the backdoor attack against FCL as a pioneer research, to illustrate how backdoor attacks on distributed local clients act on downstream tasks. Specifically, in our system, malicious clients can successfully inject a backdoor into the global encoder by uploading poisoned local updates, thus downstream models built with this global encoder will also inherit the backdoor. We also investigate how to inject backdoors into multiple downstream models, in terms of two different backdoor attacks, namely the \textit{centralized attack} and the \textit{decentralized attack}. Experiment results show that both the centralized and the decentralized attacks can inject backdoors into downstream models effectively with high attack success rates. Finally, we evaluate two defense methods against our proposed backdoor attacks in FCL, which indicates that the decentralized backdoor attack is more stealthy and harder to defend.

arXiv Computer Science @arxiv_cs@qoto.org

Cross-layer scheme for low latency multiple description video streaming over Vehicular Ad-hoc NETworks (VANETs). (arXiv:2311.13603v1 [cs.CV]) http://arxiv.org/abs/2311.13603

Cross-layer scheme for low latency multiple description video streaming over Vehicular Ad-hoc NETworks (VANETs)

There is nowadays a growing demand in vehicular communications for real-time applications requiring video assistance. The new state-of-the-art high-efficiency video coding (HEVC) standard is very promising for real-time video streaming. It offers high coding efficiency, as well as dedicated low delay coding structures. Among these, the all intra (AI) coding structure guarantees minimal coding time at the expense of higher video bitrates, which therefore penalizes transmission performances. In this work, we propose an original cross-layer system in order to enhance received video quality in vehicular communications. The system is low complex and relies on a multiple description coding (MDC) approach. It is based on an adaptive mapping mechanism applied at the IEEE 802.11p standard medium access control (MAC) layer. Simulation results in a realistic vehicular environment demonstrate that for low delay video communications, the proposed method provides significant video quality improvements on the receiver side.

arXiv Computer Science @arxiv_cs@qoto.org

Breathing Life Into Sketches Using Text-to-Video Priors. (arXiv:2311.13608v1 [cs.CV]) http://arxiv.org/abs/2311.13608

Breathing Life Into Sketches Using Text-to-Video Priors

A sketch is one of the most intuitive and versatile tools humans use to convey their ideas visually. An animated sketch opens another dimension to the expression of ideas and is widely used by designers for a variety of purposes. Animating sketches is a laborious process, requiring extensive experience and professional design skills. In this work, we present a method that automatically adds motion to a single-subject sketch (hence, "breathing life into it"), merely by providing a text prompt indicating the desired motion. The output is a short animation provided in vector representation, which can be easily edited. Our method does not require extensive training, but instead leverages the motion prior of a large pretrained text-to-video diffusion model using a score-distillation loss to guide the placement of strokes. To promote natural and smooth motion and to better preserve the sketch's appearance, we model the learned motion through two components. The first governs small local deformations and the second controls global affine transformations. Surprisingly, we find that even models that struggle to generate sketch videos on their own can still serve as a useful backbone for animating abstract representations.

arXiv Computer Science @arxiv_cs@qoto.org

An Analysis on the Effects of Evolving the Monte Carlo Tree Search Upper Confidence for Trees Selection Policy on Unimodal, Multimodal and Deceptive Landscapes. (arXiv:2311.13609v1 [cs.NE]) http://arxiv.org/abs/2311.13609

An Analysis on the Effects of Evolving the Monte Carlo Tree Search Upper Confidence for Trees Selection Policy on Unimodal, Multimodal and Deceptive Landscapes

Monte Carlo Tree Search (MCTS) is a best-first sampling method employed in the search for optimal decisions. The effectiveness of MCTS relies on the construction of its statistical tree, with the selection policy playing a crucial role. A selection policy that works particularly well in MCTS is the Upper Confidence Bounds for Trees, referred to as UCT. The research community has also put forth more sophisticated bounds aimed at enhancing MCTS performance on specific problem domains. Thus, while MCTS UCT generally performs well, there may be variants that outperform it. This has led to various efforts to evolve selection policies for use in MCTS. While all of these previous works are inspiring, none have undertaken an in-depth analysis to shed light on the circumstances in which an evolved alternative to MCTS UCT might prove advantageous. Most of these studies have focused on a single type of problem. In sharp contrast, this work explores the use of five functions of different natures, ranging from unimodal to multimodal and deceptive functions. We illustrate how the evolution of MCTS UCT can yield benefits in multimodal and deceptive scenarios, whereas MCTS UCT is robust in all of the functions used in this work.

arXiv Computer Science @arxiv_cs@qoto.org

TRIDENT: The Nonlinear Trilogy for Implicit Neural Representations. (arXiv:2311.13610v1 [cs.CV]) http://arxiv.org/abs/2311.13610

TRIDENT: The Nonlinear Trilogy for Implicit Neural Representations

Implicit neural representations (INRs) have garnered significant interest recently for their ability to model complex, high-dimensional data without explicit parameterisation. In this work, we introduce TRIDENT, a novel function for implicit neural representations characterised by a trilogy of nonlinearities. Firstly, it is designed to represent high-order features through order compactness. Secondly, TRIDENT efficiently captures frequency information, a feature called frequency compactness. Thirdly, it has the capability to represent signals or images such that most of its energy is concentrated in a limited spatial region, denoting spatial compactness. We demonstrated through extensive experiments on various inverse problems that our proposed function outperforms existing implicit neural representation functions.

arXiv Computer Science @arxiv_cs@qoto.org

Descriptor and Word Soups: Overcoming the Parameter Efficiency Accuracy Tradeoff for Out-of-Distribution Few-shot Learning. (arXiv:2311.13612v1 [cs.CV]) http://arxiv.org/abs/2311.13612

Descriptor and Word Soups: Overcoming the Parameter Efficiency Accuracy Tradeoff for Out-of-Distribution Few-shot Learning

Over the past year, a large body of multimodal research has emerged around zero-shot evaluation using GPT descriptors. These studies boost the zero-shot accuracy of pretrained VL models with an ensemble of label-specific text generated by GPT. A recent study, WaffleCLIP, demonstrated that similar zero-shot accuracy can be achieved with an ensemble of random descriptors. However, both zero-shot methods are un-trainable and consequently sub-optimal when some few-shot out-of-distribution (OOD) training data is available. Inspired by these prior works, we present two more flexible methods called descriptor and word soups, which do not require an LLM at test time and can leverage training data to increase OOD target accuracy. Descriptor soup greedily selects a small set of textual descriptors using generic few-shot training data, then calculates robust class embeddings using the selected descriptors. Word soup greedily assembles a chain of words in a similar manner. Compared to existing few-shot soft prompt tuning methods, word soup requires fewer parameters by construction and less GPU memory, since it does not require backpropagation. Both soups outperform current published few-shot methods, even when combined with SoTA zero-shot methods, on cross-dataset and domain generalization benchmarks. Compared with SoTA prompt and descriptor ensembling methods, such as ProDA and WaffleCLIP, word soup achieves higher OOD accuracy with fewer ensemble members. Please checkout our code: github.com/Chris210634/word_soups

arXiv Computer Science @arxiv_cs@qoto.org

Spanning Training Progress: Temporal Dual-Depth Scoring (TDDS) for Enhanced Dataset Pruning. (arXiv:2311.13613v1 [cs.CV]) http://arxiv.org/abs/2311.13613

Spanning Training Progress: Temporal Dual-Depth Scoring (TDDS) for Enhanced Dataset Pruning

Dataset pruning aims to construct a coreset capable of achieving performance comparable to the original, full dataset. Most existing dataset pruning methods rely on snapshot-based criteria to identify representative samples, often resulting in poor generalization across various pruning and cross-architecture scenarios. Recent studies have addressed this issue by expanding the scope of training dynamics considered, including factors such as forgetting event and probability change, typically using an averaging approach. However, these works struggle to integrate a broader range of training dynamics without overlooking well-generalized samples, which may not be sufficiently highlighted in an averaging manner. In this study, we propose a novel dataset pruning method termed as Temporal Dual-Depth Scoring (TDDS), to tackle this problem. TDDS utilizes a dual-depth strategy to achieve a balance between incorporating extensive training dynamics and identifying representative samples for dataset pruning. In the first depth, we estimate the series of each sample's individual contributions spanning the training progress, ensuring comprehensive integration of training dynamics. In the second depth, we focus on the variability of the sample-wise contributions identified in the first depth to highlight well-generalized samples. Extensive experiments conducted on CIFAR and ImageNet datasets verify the superiority of TDDS over previous SOTA methods. Specifically on CIFAR-100, our method achieves 54.51% accuracy with only 10% training data, surpassing random selection by 7.83% and other comparison methods by at least 12.69%.

arXiv Computer Science @arxiv_cs@qoto.org

HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data. (arXiv:2311.13614v1 [cs.CV]) http://arxiv.org/abs/2311.13614

HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data

Multi-modal Large Language Models (MLLMs) tuned on machine-generated instruction-following data have demonstrated remarkable performance in various multi-modal understanding and generation tasks. However, the hallucinations inherent in machine-generated data, which could lead to hallucinatory outputs in MLLMs, remain under-explored. This work aims to investigate various hallucinations (i.e., object, relation, attribute hallucinations) and mitigate those hallucinatory toxicities in large-scale machine-generated visual instruction datasets. Drawing on the human ability to identify factual errors, we present a novel hallucination detection and elimination framework, HalluciDoctor, based on the cross-checking paradigm. We use our framework to identify and eliminate hallucinations in the training data automatically. Interestingly, HalluciDoctor also indicates that spurious correlations arising from long-tail object co-occurrences contribute to hallucinations. Based on that, we execute counterfactual visual instruction expansion to balance data distribution, thereby enhancing MLLMs' resistance to hallucinations. Comprehensive experiments on hallucination evaluation benchmarks show that our method successfully mitigates 44.6% hallucinations relatively and maintains competitive performance compared to LLaVA.The source code will be released at \url{https://github.com/Yuqifan1117/HalluciDoctor}.

arXiv Computer Science @arxiv_cs@qoto.org

HEViTPose: High-Efficiency Vision Transformer for Human Pose Estimation. (arXiv:2311.13615v1 [cs.CV]) http://arxiv.org/abs/2311.13615

HEViTPose: High-Efficiency Vision Transformer for Human Pose Estimation

Human pose estimation in complicated situations has always been a challenging task. Many Transformer-based pose networks have been proposed recently, achieving encouraging progress in improving performance. However, the remarkable performance of pose networks is always accompanied by heavy computation costs and large network scale. In order to deal with this problem, this paper proposes a High-Efficiency Vision Transformer for Human Pose Estimation (HEViTPose). In HEViTPose, a Cascaded Group Spatial Reduction Multi-Head Attention Module (CGSR-MHA) is proposed, which reduces the computational cost through feature grouping and spatial degradation mechanisms, while preserving feature diversity through multiple low-dimensional attention heads. Moreover, a concept of Patch Embedded Overlap Width (PEOW) is defined to help understand the relationship between the amount of overlap and local continuity. By optimising PEOW, our model gains improvements in performance, parameters and GFLOPs. Comprehensive experiments on two benchmark datasets (MPII and COCO) demonstrate that the small and large HEViTPose models are on par with state-of-the-art models while being more lightweight. Specifically, HEViTPose-B achieves 90.7 PCK@0.5 on the MPII test set and 72.6 AP on the COCO test-dev2017 set. Compared with HRNet-W32 and Swin-S, our HEViTPose-B significantly reducing Params ($\downarrow$62.1%,$\downarrow$80.4%,) and GFLOPs ($\downarrow$43.4%,$\downarrow$63.8%,). Code and models are available at \url{here}.

arXiv Computer Science @arxiv_cs@qoto.org

Online Video Quality Enhancement with Spatial-Temporal Look-up Tables. (arXiv:2311.13616v1 [eess.IV]) http://arxiv.org/abs/2311.13616

Online Video Quality Enhancement with Spatial-Temporal Look-up Tables

Low latency rates are crucial for online video-based applications, such as video conferencing and cloud gaming, which make improving video quality in online scenarios increasingly important. However, existing quality enhancement methods are limited by slow inference speed and the requirement for temporal information contained in future frames, making it challenging to deploy them directly in online tasks. In this paper, we propose a novel method, STLVQE, specifically designed to address the rarely studied online video quality enhancement (Online-VQE) problem. Our STLVQE designs a new VQE framework which contains a Module-Agnostic Feature Extractor that greatly reduces the redundant computations and redesign the propagation, alignment, and enhancement module of the network. A Spatial-Temporal Look-up Tables (STL) is proposed, which extracts spatial-temporal information in videos while saving substantial inference time. To the best of our knowledge, we are the first to exploit the LUT structure to extract temporal information in video tasks. Extensive experiments on the MFQE 2.0 dataset demonstrate that our STLVQE achieves a satisfactory performance-speed trade-off.

arXiv Computer Science @arxiv_cs@qoto.org

Boosting3D: High-Fidelity Image-to-3D by Boosting 2D Diffusion Prior to 3D Prior with Progressive Learning. (arXiv:2311.13617v1 [cs.CV]) http://arxiv.org/abs/2311.13617

Boosting3D: High-Fidelity Image-to-3D by Boosting 2D Diffusion Prior to 3D Prior with Progressive Learning

We present Boosting3D, a multi-stage single image-to-3D generation method that can robustly generate reasonable 3D objects in different data domains. The point of this work is to solve the view consistency problem in single image-guided 3D generation by modeling a reasonable geometric structure. For this purpose, we propose to utilize better 3D prior to training the NeRF. More specifically, we train an object-level LoRA for the target object using original image and the rendering output of NeRF. And then we train the LoRA and NeRF using a progressive training strategy. The LoRA and NeRF will boost each other while training. After the progressive training, the LoRA learns the 3D information of the generated object and eventually turns to an object-level 3D prior. In the final stage, we extract the mesh from the trained NeRF and use the trained LoRA to optimize the structure and appearance of the mesh. The experiments demonstrate the effectiveness of the proposed method. Boosting3D learns object-specific 3D prior which is beyond the ability of pre-trained diffusion priors and achieves state-of-the-art performance in the single image-to-3d generation task.

Bot

I toot the arXiv feed for topics in Computer Science.

#ComputerScience #CS #Programming #SoftwareEngineering #Software #SoftwareDevelopment #Computers #Science #arXiv #News #PeerReview

Joined Jul 2018