Software Engineering
See recent articles
Showing new listings for Thursday, 26 February 2026
- [1] arXiv:2602.21251 [pdf, html, other]
-
Title: AgenticTyper: Automated Typing of Legacy Software Projects Using Agentic AIComments: Accepted at ICSE 2026 Student Research Competition (SRC)Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Programming Languages (cs.PL)
Legacy JavaScript systems lack type safety, making maintenance risky. While TypeScript can help, manually adding types is expensive. Previous automated typing research focuses on type inference but rarely addresses type checking setup, definition generation, bug identification, or behavioral correctness at repository scale. We present AgenticTyper, a Large Language Model (LLM)-based agentic system that addresses these gaps through iterative error correction and behavior preservation via transpilation comparison. Evaluation on two proprietary repositories (81K LOC) shows that AgenticTyper resolves all 633 initial type errors in 20 minutes, reducing manual effort from one working day.
- [2] arXiv:2602.21568 [pdf, html, other]
-
Title: From Ad-Hoc Scripts to Orchestrated Pipelines: Architecting a Resilient ELT Framework for Developer Productivity MetricsSubjects: Software Engineering (cs.SE)
Developer Productivity Dashboards are essential for visualizing DevOps performance metrics such as Deployment Frequency and Change Failure Rate (DORA). However, the utility of these dashboards is frequently undermined by data reliability issues. In early iterations of our platform, ad-hoc ingestion scripts (Cron jobs) led to "silent failures," where data gaps went undetected for days, eroding organizational trust. This paper reports on our experience migrating from legacy scheduling to a robust Extract-Load-Transform (ELT) pipeline using Directed Acyclic Graph (DAG) orchestration and Medallion Architecture. We detail the operational benefits of decoupling data extraction from transformation, the necessity of immutable raw history for metric redefinition, and the implementation of state-based dependency management. Our experience suggests that treating the metrics pipeline as a production-grade distributed system is a prerequisite for sustainable engineering analytics.
- [3] arXiv:2602.21611 [pdf, html, other]
-
Title: Structurally Aligned Subtask-Level Memory for Software Engineering AgentsSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Large Language Models (LLMs) have demonstrated significant potential as autonomous software engineering (SWE) agents. Recent work has further explored augmenting these agents with memory mechanisms to support long-horizon reasoning. However, these approaches typically operate at a coarse instance granularity, treating the entire problem-solving episode as the atomic unit of storage and retrieval. We empirically demonstrate that instance-level memory suffers from a fundamental granularity mismatch, resulting in misguided retrieval when tasks with similar surface descriptions require distinct reasoning logic at specific stages. To address this, we propose Structurally Aligned Subtask-Level Memory, a method that aligns memory storage, retrieval, and updating with the agent's functional decomposition. Extensive experiments on SWE-bench Verified demonstrate that our method consistently outperforms both vanilla agents and strong instance-level memory baselines across diverse backbones, improving mean Pass@1 over the vanilla agent by +4.7 pp on average (e.g., +6.8 pp on Gemini 2.5 Pro). Performance gains grow with more interaction steps, showing that leveraging past experience benefits long-horizon reasoning in complex software engineering tasks.
- [4] arXiv:2602.21641 [pdf, html, other]
-
Title: Uncertainty Modeling for SysML v2Subjects: Software Engineering (cs.SE)
Uncertainty is inherent in modern engineered systems, including cyber-physical systems, autonomous systems, and large-scale software-intensive infrastructures (such as microservice-based systems) operating in dynamic and partially observable environments. The recent publication of Precise Semantics for Uncertainty Modeling (PSUM) by the Object Management Group represents the first standardized specification for uncertainty modeling within the Model-Based Systems Engineering (MBSE) community, providing formally defined semantics for representing and reasoning about uncertainty in models. In parallel, the second version of Systems Modeling Language (SysML v2) was released as the next-generation systems modeling language, offering improved semantic rigor and reusability, yet lacking native constructs aligned with PSUM for first-class uncertainty representation. This paper proposes a systematic extension of SysML v2 that incorporates the PSUM metamodel into its modeling framework. The extension enables explicit specification of indeterminacy sources, structured characterization of uncertainties, and consistent propagation of uncertainty within system models, while preserving conformance with SysML v2 syntax and semantics. We validate the approach through seven case studies. Results demonstrate that the proposed extension (PSUM-SysMLv2) is expressive and applicable for uncertainty-aware MBSE, and potentially enables uncertainty and uncertainty propagation analyses.
- [5] arXiv:2602.21681 [pdf, html, other]
-
Title: AkiraRust: Re-thinking LLM-aided Rust Repair Using a Feedback-guided Thinking SwitchRenshuang Jiang, Yichong Wang, Pan Dong, Xiaoxiang Fang, Zhenling Duan, Tinglue Wang, Yuchen Hu, Jie Yu, Zhe JiangComments: 7 pages, 11 figures, accepted to DACSubjects: Software Engineering (cs.SE)
Eliminating undefined behaviors (UBs) in Rust programs requires a deep semantic understanding to enable accurate and reliable repair. While existing studies have demonstrated the potential of LLMs to support Rust code analysis and repair, most frameworks remain constrained by inflexible templates or lack grounding in executable semantics, resulting in limited contextual awareness and semantic incorrectness. Here, we present AkiraRust, an LLM-driven repair and verification framework that incorporates a finite-state machine to dynamically adapt its detection and repair flow to runtime semantic conditions. AkiraRust introduces a dual-mode reasoning strategy that coordinates fast and slow thinking across multiple agents. Each agent is mapped to an FSM state, and a waveform-driven transition controller manages state switching, rollback decisions, and semantic check pointing, enabling context-aware and runtime-adaptive repair. Experimental results show that AkiraRust achieves about 92% semantic correctness and delivers a 2.2x average speedup compared to SOTA.
- [6] arXiv:2602.21697 [pdf, html, other]
-
Title: EditFlow: Benchmarking and Optimizing Code Edit Recommendation Systems via Reconstruction of Developer FlowsComments: Accepted at OOPSLA 2026 (Proc. ACM Program. Lang., Vol. 10, OOPSLA1)Subjects: Software Engineering (cs.SE)
Large language models (LLMs) for code editing have achieved remarkable progress, yet recent empirical studies reveal a fundamental disconnect between technical accuracy and developer productivity. Despite their strong benchmark performance, developers complete tasks 19% slower when using AI assistance, with over 68.81% of recommendations disrupting their mental flow. This misalignment stems from the use of static commit snapshots that lack temporal information, causing models to optimize for end results rather than the incremental, context-sensitive steps that align with developers' natural reasoning process.
To bridge this gap, we present EditFlow, which benchmarks and optimizes subsequent code edit recommendation systems through the reconstruction of developer editing flows. EditFlow addresses three key challenges. First, collecting edit-order data that reflects developers' flow is inherently difficult: manual annotation introduces prohibitive overhead, while development logs capture only single trajectories instead of all plausible editing flows. Second, benchmarking recommendation performance against developers' ongoing editing flow requires a digital-twin-like simulation that can faithfully simulate the editing process. Third, existing heterogeneous systems vary drastically in scale and architecture, posing challenges for developing a unified optimization strategy that endows all models with mental-flow awareness regardless of design or capability.
...... - [7] arXiv:2602.21734 [pdf, html, other]
-
Title: Proto-ML: An IDE for ML Solution PrototypingComments: To be published at 3rd International Workshop on Integrated Development Environments (IDE '26), April 12--18, 2026, Rio de Janeiro, BrazilSubjects: Software Engineering (cs.SE)
Prototyping plays a critical role in the development of machine learning (ML) solutions, yet existing tools often provide limited support for effective collaboration and knowledge reuse among stakeholders. This paper introduces Proto-ML, an IDE designed to strengthen ML prototyping workflows. By addressing key deficiencies such as insufficient stakeholder involvement, limited cross-project knowledge reuse, and fragmented tool support, Proto-ML offers a unified framework that enables structured documentation of prototyping activities and promotes knowledge sharing across projects.
The Proto-ML IDE consists of three extension bundles: prototype implementation, analysis, and knowledge management. These extensions support tasks ranging from evaluating prototype quality against defined criteria to incorporating stakeholder perspectives throughout the development process. Preliminary user feedback suggests that Proto-ML can increase prototyping efficiency and foster more transparent and reusable ML solution development. - [8] arXiv:2602.21800 [pdf, html, other]
-
Title: An Evaluation of Context Length Extrapolation in Long Code via Positional Embeddings and Efficient AttentionSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
The rapid advancement of large language models (LLMs) has led to a significant increase in automated tools in the software engineering, capable of performing various code-related tasks such as code generation, completion, and translation. Despite these advancements, its effectiveness is constrained by fixed context lengths, limiting its ability to generalize across long, domain-specific code sequences. To address this challenge, we investigate zero-shot, inference-only methods aimed at improving position encodings and optimizing attention mechanisms. Our goal is to provide a thorough analysis of current approaches that facilitate context length extrapolation in code, particularly in the context of long code completion tasks.
- [9] arXiv:2602.21806 [pdf, html, other]
-
Title: An Empirical Study of Bugs in Modern LLM Agent FrameworksSubjects: Software Engineering (cs.SE)
LLM agents have been widely adopted in real-world applications, relying on agent frameworks for workflow execution and multi-agent coordination. As these systems scale, understanding bugs in the underlying agent frameworks becomes critical. However, existing work mainly focuses on agent-level failures, overlooking framework-level bugs. To address this gap, we conduct an empirical study of 998 bug reports from CrewAI and LangChain, constructing a taxonomy of 15 root causes and 7 observable symptoms across five agent lifecycle stages: 'Agent Initialization','Perception', 'Self-Action', 'Mutual Interaction' and 'Evolution'. Our findings show that agent framework bugs mainly arise from 'API misuse', 'API incompatibility', and 'Documentation Desync', largely concentrated in the 'Self-Action' stage. Symptoms typically appear as 'Functional Error', 'Crash', and 'Build Failure', reflecting disruptions to task progression and control flow.
- [10] arXiv:2602.21833 [pdf, html, other]
-
Title: From Restructuring to Stabilization: A Large-Scale Experiment on Iterative Code Readability Refactoring with Large Language ModelsSubjects: Software Engineering (cs.SE)
Large language models (LLMs) are increasingly used for automated code refactoring tasks. Although these models can quickly refactor code, the quality may exhibit inconsistencies and unpredictable behavior. In this article, we systematically study the capabilities of LLMs for code refactoring with a specific focus on improving code readability.
We conducted a large-scale experiment using GPT5.1 with 230 Java snippets, each systematically varied and refactored regarding code readability across five iterations under three different prompting strategies. We categorized fine-grained code changes during the refactoring into implementation, syntactic, and comment-level transformations. Subsequently, we investigated the functional correctness and tested the robustness of the results with novel snippets.
Our results reveal three main insights: First, iterative code refactoring exhibits an initial phase of restructuring followed by stabilization. This convergence tendency suggests that LLMs possess an internalized understanding of an "optimally readable" version of code. Second, convergence patterns are fairly robust across different code variants. Third, explicit prompting toward specific readability factors slightly influences the refactoring dynamics.
These insights provide an empirical foundation for assessing the reliability of LLM-assisted code refactoring, which opens pathways for future research, including comparative analyses across models and a systematic evaluation of additional software quality dimensions in LLM-refactored code. - [11] arXiv:2602.21997 [pdf, html, other]
-
Title: Enhancing LLM-Based Test Generation by Eliminating Covered CodeComments: 9 pages, 4 figures, supplementary material includedSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Automated test generation is essential for software quality assurance, with coverage rate serving as a key metric to ensure thorough testing. Recent advancements in Large Language Models (LLMs) have shown promise in improving test generation, particularly in achieving higher coverage. However, while existing LLM-based test generation solutions perform well on small, isolated code snippets, they struggle when applied to complex methods under test. To address these issues, we propose a scalable LLM-based unit test generation method. Our approach consists of two key steps. The first step is context information retrieval, which uses both LLMs and static analysis to gather relevant contextual information associated with the complex methods under test. The second step, iterative test generation with code elimination, repeatedly generates unit tests for the code slice, tracks the achieved coverage, and selectively removes code segments that have already been covered. This process simplifies the testing task and mitigates issues arising from token limits or reduced reasoning effectiveness associated with excessively long contexts. Through comprehensive evaluations on open-source projects, our approach outperforms state-of-the-art LLM-based and search-based methods, demonstrating its effectiveness in achieving high coverage on complex methods.
- [12] arXiv:2602.22020 [pdf, html, other]
-
Title: Detecting UX smells in Visual Studio Code using LLMsComments: 4 pages, 2 figures, 1 table, 3rd International Workshop on Integrated Development Environments (IDE 2026)Subjects: Software Engineering (cs.SE); Human-Computer Interaction (cs.HC)
Integrated Development Environments shape developers' daily experience, yet the empirical study of their usability and user experience (UX) remains limited. This work presents an LLM-assisted approach to detecting UX smells in Visual Studio Code by mining and classifying user-reported issues from the GitHub repository. Using a validated taxonomy and expert review, we identified recurring UX problems that affect the developer experience. Our results show that the majority of UX smells are concentrated in informativeness, clarity, intuitiveness, and efficiency, qualities that developers value most.
- [13] arXiv:2602.22076 [pdf, other]
-
Title: Visual Milestone Planning in a Hybrid Development ContextComments: 15 pages, Presented at QUATIC 2023Journal-ref: Visual Milestone Planning in a Hybrid Development Context. In: Fernandes, J.M., Travassos, G.H., Lenarduzzi, V., Li, X. (eds) Quality of Information and Communications Technology, 2023Subjects: Software Engineering (cs.SE)
This paper explains the Visual Milestone Planning (VMP) method using an agile vocabulary to facilitate its adoption by agile practitioners as a front end for a hybrid development process. VMP is a visual and collaborative planning approach which promotes a shared understanding of the work approach and commitment through the direct manipulation by team members of the reified planning constructs involved in the development of the plan. Once the product backlog has been established and relevant milestones identified, a novel construct called the milestone planning matrix is used to document the allocation of product backlog items to milestones. The milestones due dates are later determined by grouping sticky notes representing the work to be performed into time-boxes called work packages and accommodating them on a resource and time scaled scheduling canvas very much as it would be done in a Tetris game.
- [14] arXiv:2602.22124 [pdf, html, other]
-
Title: SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering AgentsPatrick Tser Jern Kon, Archana Pradeep, Ang Chen, Alexander P. Ellis, Warren Hunt, Zijian Wang, John Yang, Samuel ThompsonSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Small language models (SLMs) offer compelling advantages in cost, latency, and adaptability, but have so far lagged behind larger models on long-horizon software engineering tasks such as SWE-bench, where they suffer from pervasive action looping and low resolution rates. We introduce SWE-Protégé, a post-training framework that reframes software repair as an expert-protégé collaboration problem. In SWE-Protégé, an SLM remains the sole decision-maker while learning to selectively seek guidance from a strong expert model, recognize stalled states, and follow through on expert feedback. Our approach combines supervised fine-tuning on expert-augmented trajectories with agentic reinforcement learning that explicitly discourages degenerative looping and unproductive expert collaboration. We lightly post-train Qwen2.5-Coder-7B-Instruct to achieve 42.4% Pass@1 on SWE-bench Verified, a +25.4% improvement over the prior SLM state of the art, while using expert assistance sparsely (~4 calls per task and 11% of total tokens).
New submissions (showing 14 of 14 entries)
- [15] arXiv:2602.21253 (cross-list from quant-ph) [pdf, html, other]
-
Title: A Physics-Informed Neuro-Fuzzy Framework for Quantum Error AttributionSubjects: Quantum Physics (quant-ph); Software Engineering (cs.SE)
As quantum processors scale beyond 100 qubits, distinguishing software bugs from stochastic hardware noise becomes a critical diagnostic challenge. We present a neuro-fuzzy framework that addresses this attribution problem by combining Adaptive Neuro-Fuzzy Inference Systems (ANFIS) with physics-grounded feature engineering. We introduce the Bhattacharyya Veto, a hard physical constraint grounded in the Data Processing Inequality that prevents the classifier from attributing topologically impossible output distributions to noise. Validated on IBM's 156-qubit Heron r2 processor (ibm_fez) across 105 circuits spanning 17 algorithm families, the framework achieves 89.5% effective accuracy (+/- 5.9% CI). The system implements a safe failure mode, flagging 14.3% of ambiguous cases for manual review rather than forcing low-confidence predictions. We resolve key ambiguities -- such as distinguishing correct Grover amplification from bug-induced collapse -- and identify fundamental limits of single-basis diagnostics, including a Z-basis blind spot where phase-flip errors remain statistically invisible. This work establishes a robust, interpretable diagnostic layer that prevents error mitigation techniques from being applied to logically flawed circuits.
- [16] arXiv:2602.21265 (cross-list from cs.CL) [pdf, html, other]
-
Title: ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool ReasoningComments: Conference : Submitted to ICML 2026. 8 pages (+ abstract 16 pages), 5 figuresSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
We introduce \ToolMATH, a math-grounded benchmark that evaluates tool-augmented language models in realistic multi-tool environments where the output depends on calling schema-specified tools and sustaining multi-step execution. It turns math problems into a controlled, correctness-checkable benchmark with tool sets, enabling systematic evaluation of model reliability under (1) large, overlapping tool catalogs and (2) the absence of the intended capability. \ToolMATH provides actionable diagnostic evidence of failure modes in tool-augmented agents, helping identify the control mechanisms required for robustness. \ToolMATH roughly contains 8k questions and 12k tools; we provide an additional hard-set \ToolMATHHard with questions and tools. Our evaluation reveals that the key failure factor is due to the inability to reason, leading to the accumulation of intermediate results' errors and constrain later decisions. Tool-list redundancy do not simply add noise, but amplify small early deviations into irreversible execution drift. The benchmark highlights that when the intended capability is missing, distractor tools can sometimes serve as partial substitutes in solution paths, yet they can also mislead models into ungrounded tool trajectories. Finally, comparisons between tool-use protocols emphasize that improvements come less from local action selection and more from long-range plan coherence and disciplined use of observations.
Cross submissions (showing 2 of 2 entries)
- [17] arXiv:2506.08980 (replaced) [pdf, html, other]
-
Title: Towards Better Code Generation: Adaptive Decoding with Uncertainty GuidanceComments: 21 pages, 7 figuresSubjects: Software Engineering (cs.SE)
The success of code synthesis using large language models (LLMs) depends heavily on navigating critical decision points during the decoding process. Standard uniform strategies, such as greedy decoding, often fall short because they fail to distinguish between deterministic steps and those characterized by high logical ambiguity. Our empirical analysis identifies a recurring failure mode: "logic drift" caused by the model's inability to correctly rank viable candidates during high-uncertainty intervals, even when the ground-truth token is available.
To resolve this, we present AdaDec, a framework that introduces a selective pause-then-rerank mechanism into the decoding pipeline. Unlike static methods, AdaDec utilizes learned, model-specific entropy thresholds to identify when the model is "confused" and dynamically triggers a lookahead-based evaluation to re-score candidate tokens.
Across benchmarks including HumanEval+, MBPP+, and DevEval, AdaDec achieves significant performance breakthroughs, boosting Pass@1 accuracy by up to 20.9% absolute over greedy decoding. The framework not only surpasses traditional Beam Search and specialized methods like AdapT in terms of reliability but also maintains high inference efficiency by intervening only at the most consequential steps. These results suggest that uncertainty-aware adaptive strategies are key to making LLM-driven code generation both robust and practical. - [18] arXiv:2506.10056 (replaced) [pdf, html, other]
-
Title: Pareto Optimal Code GenerationComments: 29 pages, 6 figures, code released here: this https URLSubjects: Software Engineering (cs.SE); Programming Languages (cs.PL)
Generate-then-rank is the dominant test-time scaling (TTS) paradigm for code generation, but scaling accuracy by sampling and executing more candidates makes comprehensive verification a major computational bottleneck. This creates an inherent trade-off between accuracy and compute that, despite its importance to TTS, is often ignored. Specifically, faster but noisier signals, such as outcome reward models (ORMs), are dismissed as suboptimal. We frame verifier selection as a Pareto optimization problem and empirically map the accuracy-throughput frontier across signals, including the full test suite, heuristics for selective execution, and ORMs, across four Python benchmarks. We show that ORMs are most effective at optimizing the Pareto curve when pruning is used in the generate-then-rank pipeline--known as staged verification--where lightweight filters remove obviously incorrect solutions, including candidates with small syntactic or character-level bugs, before expensive verification. Our pruning analysis shows that eliminating incorrect yet highly ranked candidates (often character-level bugs) prevents wasted compute on incorrect tokens. We find that ORMs with staged verification shift the Pareto frontier outward, achieving 11.64x higher throughput at a cost of 8.26% accuracy relative to full test-suite verification.
- [19] arXiv:2507.02376 (replaced) [pdf, html, other]
-
Title: On the Inference (In-)Security of Vertical Federated Learning: Efficient Auditing against Inference Tampering AttackSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Vertical Federated Learning (VFL) is an emerging distributed learning paradigm for cross-silo collaboration without accessing participants' data. However, existing VFL work lacks a mechanism to audit the inference correctness of the data party. The malicious data party can modify the local data and model to mislead the joint inference results. To exploit this vulnerability, we design a novel Vertical Federated Inference Tampering (VeFIT) attack, allowing the data party to covertly tamper with the local inference and mislead results on the task party's final prediction. VeFIT can decrease the task party's inference accuracy by an average of 34.49%. Existing defense mechanisms can not effectively detect this attack, and the detection performance is near random guessing. To mitigate the attack, we further design a Vertical Federated Inference Auditing (VeFIA) framework. VeFIA helps the task party to audit whether the data party's inferences are executed as expected during large-scale online inference. VeFIA does not leak the data party's privacy nor introduce additional latency. The core design is that the task party can use the inference results from a framework with Trusted Execution Environments (TEE) and the coordinator to validate the correctness of the data party's computation results. VeFIA guarantees that, as long as the proportion of inferences attacked by VeFIT exceeds 5.4%, the task party can detect the malicious behavior of the data party with a probability of 99.99%, without any additional online overhead. VeFIA's random sampling validation of VeFIA achieves 100% positive predictive value, negative predictive value, and true positive rate in detecting VeFIT. We further validate VeFIA's effectiveness in terms of privacy protection and scalability on real-world datasets. To the best of our knowledge, this is the first paper discussing the inference auditing problem towards VFL.
- [20] arXiv:2507.17691 (replaced) [pdf, html, other]
-
Title: CASCADE: LLM-Powered JavaScript Deobfuscator at GoogleComments: To appear in ICSE-SEIP 2026Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Programming Languages (cs.PL)
Software obfuscation, particularly prevalent in JavaScript, hinders code comprehension and analysis, posing significant challenges to software testing, static analysis, and malware detection. This paper introduces CASCADE, a novel hybrid approach that integrates the advanced coding capabilities of Gemini with the deterministic transformation capabilities of a compiler Intermediate Representation (IR), specifically JavaScript IR (JSIR). By employing Gemini to identify critical prelude functions, the foundational components underlying the most prevalent obfuscation techniques, and leveraging JSIR for subsequent code transformations, CASCADE effectively recovers semantic elements like original strings and API names, and reveals original program behaviors. This method overcomes limitations of existing static and dynamic deobfuscation techniques, eliminating hundreds to thousands of hardcoded rules while achieving reliability and flexibility. CASCADE is already deployed in Google's production environment, demonstrating substantial improvements in JavaScript deobfuscation efficiency and reducing reverse engineering efforts.
- [21] arXiv:2509.11787 (replaced) [pdf, html, other]
-
Title: CodeCureAgent: Automatic Classification and Repair of Static Analysis WarningsSubjects: Software Engineering (cs.SE); Multiagent Systems (cs.MA)
Static analysis tools are widely used to detect bugs, vulnerabilities, and code smells. Traditionally, developers must resolve these warnings manually. Because this process is tedious, developers sometimes ignore warnings, leading to an accumulation of warnings and a degradation of code quality. This paper presents CodeCureAgent, an approach that harnesses LLM-based agents to automatically analyze, classify, and repair static analysis warnings. Unlike previous work, our method does not follow a predetermined algorithm. Instead, we adopt an agentic framework that iteratively invokes tools to gather additional information from the codebase (e.g., via code search) and edit the codebase to resolve the warning. CodeCureAgent detects and suppresses false positives, while fixing true positives when identified. We equip CodeCureAgent with a three-step heuristic to approve patches: (1) build the project, (2) verify that the warning disappears without introducing new warnings, and (3) run the test suite. We evaluate CodeCureAgent on a dataset of 1,000 SonarQube warnings found in 106 Java projects and covering 291 distinct rules. Our approach produces plausible fixes for 96.8% of the warnings, outperforming state-of-the-art baseline approaches by 29.2%-34.0% in plausible-fix rate. Manual inspection of 291 cases reveals a correct-fix rate of 86.3%, showing that CodeCureAgent can reliably repair static analysis warnings. The approach incurs LLM costs of about 2.9 cents (USD) and an end-to-end processing time of about four minutes per warning. We envision CodeCureAgent helping to clean existing codebases and being integrated into CI/CD pipelines to prevent the accumulation of static analysis warnings.
- [22] arXiv:2602.20292 (replaced) [pdf, other]
-
Title: Quantifying the Expectation-Realisation Gap for Agentic AI SystemsComments: 10 pages, no figures; added glossarySubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Agentic AI systems are deployed with expectations of substantial productivity gains, yet rigorous empirical evidence reveals systematic discrepancies between pre-deployment expectations and post-deployment outcomes. We review controlled trials and independent validations across software engineering, clinical documentation, and clinical decision support to quantify this expectation-realisation gap. In software development, experienced developers expected a 24% speedup from AI tools but were slowed by 19% -- a 43 percentage-point calibration error. In clinical documentation, vendor claims of multi-minute time savings contrast with measured reductions of less than one minute per note, and one widely deployed tool showed no statistically significant effect. In clinical decision support, externally validated performance falls substantially below developer-reported metrics. These shortfalls are driven by workflow integration friction, verification burden, measurement construct mismatches, and systematic variation in who benefits and who does not. The evidence motivates structured planning frameworks that require explicit, quantified benefit expectations with human oversight costs factored in.
- [23] arXiv:2602.20610 (replaced) [pdf, other]
-
Title: SpecMind: Cognitively Inspired, Interactive Multi-Turn Framework for Postcondition InferenceSubjects: Software Engineering (cs.SE); Computation and Language (cs.CL)
Specifications are vital for ensuring program correctness, yet writing them manually remains challenging and time-intensive. Recent large language model (LLM)-based methods have shown successes in generating specifications such as postconditions, but existing single-pass prompting often yields inaccurate results. In this paper, we present SpecMind, a novel framework for postcondition generation that treats LLMs as interactive and exploratory reasoners rather than one-shot generators. SpecMind employs feedback-driven multi-turn prompting approaches, enabling the model to iteratively refine candidate postconditions by incorporating implicit and explicit correctness feedback, while autonomously deciding when to stop. This process fosters deeper code comprehension and improves alignment with true program behavior via exploratory attempts. Our empirical evaluation shows that SpecMind significantly outperforms state-of-the-art approaches in both accuracy and completeness of generated postconditions.
- [24] arXiv:2406.11935 (replaced) [pdf, html, other]
-
Title: A Problem-Oriented Perspective and Anchor Verification for Code OptimizationComments: ICLR 2026Subjects: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
Large Language Models (LLMs) have shown remarkable capabilities in solving various programming tasks, such as code generation. However, their potential for code optimization, particularly in performance enhancement, remains largely unexplored. This paper investigates the capabilities of LLMs in optimizing code for minimal execution time, addressing a critical gap in current research. The recently proposed code optimization methods construct program optimization pairs based on iterative submissions from the same programmer for the same problem. However, this approach confines LLMs to local performance improvements, neglecting global algorithmic innovation. To overcome this limitation, we adopt a completely different perspective by reconstructing the optimization pairs into a problem-oriented approach. This allows for the integration of various ideas from multiple programmers tackling the same problem. Furthermore, we observe that code optimization presents greater challenges compared to code generation, often accompanied by "optimization tax". Recognizing the inherent trade-offs in correctness and efficiency, we introduce a novel anchor verification framework to mitigate this "optimization tax". Ultimately, the problem oriented perspective combined with the anchor verification framework significantly enhances both the correct optimization ratio and speedup to new levels.
- [25] arXiv:2602.16291 (replaced) [pdf, html, other]
-
Title: A Calculus of InheritanceSubjects: Programming Languages (cs.PL); Software Engineering (cs.SE)
Just as the $\lambda$-calculus uses three primitives (abstraction, application, variable) as the foundation of functional programming, inheritance-calculus uses three primitives (record, definition, inheritance) as the foundation of declarative programming. It trivially embeds the $\lambda$-calculus, although the entire semantics rests solely on naive set theory; as a consequence, all constructs including inheritance are inherently commutative, idempotent, and associative; the linearization problem of multiple inheritance does not arise. This induces a fully abstract semantics of the lazy $\lambda$-calculus with respect to Böhm tree equivalence~\cite{barendregt1984lambda}. Inheritance-calculus is distilled from MIXINv2, a practical implementation in which we observed further emergent phenomena: the same code acts as different function colors~\cite{nystrom2015color}; ordinary arithmetic yields the relational semantics of logic programming~\cite{vanemden1976semantics}; self-reference resolves to multiple targets; and programs are immune to the Expression Problem~\cite{wadler1998expression}. This makes inheritance-calculus strictly more expressive than the $\lambda$-calculus in both common sense and Felleisen's sense~\cite{felleisen1991expressive}. These properties suggest applications to configuration languages, dependency injection, object-oriented programming, composable effect systems, modular software architectures, file-system-as-compiler, general-purpose programming, and no-code development.