The Hitchhiker's Guide to Program Analysis, Part III: Mostly Harmless LLMs
Abstract.
LLMs are increasingly used in bug analysis to reason about code and judge whether a potential bug can be triggered in realistic execution contexts, with recent work showing promising empirical results. However, empirical effectiveness does not make a plausible model-generated rationale sufficient for discharging warnings. This distinction is especially important for no-bug decisions: dismissing a report or warning requires establishing that the reported error state is unreachable in the program context being analyzed, not merely offering a plausible explanation for why it may not occur. We argue that program-behavior reasoning should be grounded in formal analysis, rather than performed directly by LLMs.
We present Evident, a bug analysis system that separates LLM assistance from program-behavior reasoning, delegating the latter to backend analysis. Given a warning specifying the reported location and data flow, Evident uses an LLM only to construct a warning-specific analysis harness. Evident then validates the harness before invoking the backend. The backend performs the harness-relative check: whether the reported error state is unreachable under the constructed harness and its assumptions. We evaluate Evident on 200 real Android kernel driver warnings from two existing static detectors. Evident correctly classifies 151 cases (76%), including discharging 111 false alarms, without discharging any confirmed bug in the dataset; the remaining cases are either unresolved or conservatively retained as potential bugs. Evident also rediscovers a confirmed vulnerability overlooked by both prior LLM-based filtering and manual triage.
1. Introduction
Static analysis tools are widely used to find bugs in large code bases such as the Linux kernel, but they report many false positives, and recent work therefore uses LLMs to filter these warnings (Du et al., 2026; Li et al., 2024, 2025a; Lekssays et al., 2025). The task fits LLMs well on the surface: whether a warning is a real bug may require checking API usage, possible reaching values, caller-side checks, and project-specific idioms. Recent systems report promising accuracy when it labels warnings as true bugs or false positives.
The broad use of LLMs in warning triage makes the basis for no-bug decisions unclear. A model may give a plausible explanation of program behavior, but that explanation in natural language does not show how the program actually executes. Finding a bug is existential: one feasible execution reaching the error state is enough, and the claim can be confirmed by exhibiting that execution. Discharging a warning as a false alarm is universal: the error state must be unreachable in the program context being analyzed. Existing evaluations measure whether the final label matches ground truth (Wen et al., 2024; Li et al., 2025a), so a pipeline can look accurate on a benchmark while quietly dismissing real vulnerabilities with plausible but unchecked reasoning.
We therefore advocate a simple principle: LLMs may help analyze bugs, but decisions about program behavior should be grounded in formal methods. Directly applying high-precision formal analysis to the full program, however, is often impractical in large systems such as the Linux kernel (Lawall et al., 2025; Chen et al., 2024b). Whether the reported error state is reachable may depend on subsystem interfaces, indirect calls, and state established across entry points, context that is difficult to model precisely at whole-program scale. What is needed instead is a small executable context that preserves enough of this environment for the reported error state to be checked by formal analysis.
We instantiate this principle by using the model for context construction, not for the verdict. Existing LLM-based triage systems already rely on the model to inspect surrounding code, follow relevant definitions and call chains, and decide which context matters to judge a warning (Li et al., 2025a, 2024). In a program-analysis setting, this work can be treated as implicit construction of an analysis context. Evident makes that construction explicit while removing the model's final verdict: the LLM must externalize the selected context as an executable program artifact, and the final decision is made by formal analysis of that artifact, in our case abstract interpretation with Frama-C/Eva. Thus, the model's contribution enters the pipeline as analyzable code rather than as a natural-language judgment.
This shift creates a new obligation: a verdict on the generated program must transfer to the original program. For discharge, the requirement is directional: the artifact should over-approximate the original warning-relevant executions with respect to the reported error state. If it is too narrow, for example by fixing a user-controlled length to a constant, it can silently mask the overflow being checked. Proving this over-approximation directly would require the same whole-program context that the artifact is meant to avoid. We therefore screen generated programs with checkable necessary conditions. If a program fails these checks, it is discarded rather than used to support any verdict.
We instantiate this idea in Evident for taint-style warnings in Android Linux kernel drivers. These warnings report that data from an untrusted source may reach a sensitive operation in a way that triggers an error state, such as an out-of-bounds access, an integer overflow, or an invalid pointer dereference. Each warning provides the source, the reported path, the sink, and the reported error state to be checked.
Given such a warning, Evident constructs a smaller executable C program for the backend analyzer. We call this program a warning-specific analysis harness.111We use harness in the program-analysis sense: a self-contained program for analysis, rather than the recent usage in which a “harness” orchestrates an LLM. The harness invokes a selected driver entry point, initializes the state needed to exercise the reported path, and initializes warning-dependent inputs with type-derived abstract values. Before backend analysis, Evident validates the generated harness using minimum admission checks: the harness must make the reported sink analyzable by the backend, reach relevant checkpoints on the reported path, and keep warning-dependent inputs unconstrained within their program types. Only harnesses accepted by validation are used for the final backend analysis.
We evaluate Evident on 200 real Android kernel-driver warnings reported by two existing static detectors. Evident produces correct results on 151 cases and discharges 111 false alarms without suppressing any confirmed bug. The remaining 49 cases are conservative failures: 28 non-bug warnings remain alarmed, and 21 cases are unresolved. Evident also rediscovers a real bug that had been overlooked by both prior LLM-based filtering and manual triage. These results show that LLM-generated harnesses can make warning discharge practical when their use is mediated by validation and formal backend analysis.
This paper makes three contributions:
-
A principled use of LLMs in bug analysis. We articulate and implement a role for LLMs that keeps reconstruction but discards the verdict. Evident uses the model to externalize warning-specific program context as an executable analysis harness, while the final no-bug decision is made by formal analysis of that harness rather than by the model's judgment.
-
Validation for LLM-generated analysis harnesses. We introduce validation checks that enforce necessary preservation requirements before backend analysis. These checks do not prove full equivalence to the original program, but define the admission boundary under which a generated harness may support a no-bug decision. Harnesses that fail validation are rejected rather than used to support a verdict.
-
Empirical results on Android kernel-driver warnings. We evaluate Evident on 200 Android kernel-driver warnings from two existing static detectors. Evident produces correct results on 151 cases (76%) and discharges 111 false alarms without suppressing any confirmed bug. The remaining 49 cases are conservative failures: 28 non-bugs remain alarmed, and 21 cases are unresolved. It also rediscovers a confirmed vulnerability previously overlooked by both model-decided filtering and manual triage.
2. Motivation
We motivate Evident with a real Android kernel-driver bug that was overlooked by both prior LLM-based filtering and manual triage, but corresponds to an existing vendor patch (1): a negative value passed through the user-controlled ucontrol structure could become a positive bound and lead to an overflow. This example illustrates two failure modes in LLM-assisted bug analysis. First, a model can wrongly discharge a real bug with a plausible rationale based on an apparent sanity check. Second, even when the LLM is used only to construct the analysis harness, the generated harness can be inadmissible if it reconstructs a benign execution that overconstrains the attacker-controlled input.
2.1. A Plausible Discharge Can Be Wrong
Figure 1 shows a warning that is easy to dismiss when viewed locally. The loop bound is checked against the size of items, and the write occurs immediately after the check. An LLM-based filter, or a human reviewer reading only the local snippet, can therefore give a convincing no-bug rationale: the guard at Line 17 appears to bound the loop, so the out-of-bounds warning looks like a false alarm.
The local rationale misses the type of the checked value. The value is not an int, but a 64-bit long field in a control object defined by ALSA (the Advanced Linux Sound Architecture), the Linux kernel's audio subsystem. A negative long can pass the guard and then be narrowed to a positive 32-bit int. For example, becomes after the narrowing conversion. Thus, the guard does not establish the condition needed to make the loop safe, namely n <= 16, and the write to items[i] can go out of bounds.
This example shows why a plausible local explanation is not enough to discharge a warning. The visible check and the reported write are adjacent, but the value's type and admissible range are determined by a framework object outside the local snippet. A no-bug decision must account for those facts over the attacker-controlled input range; otherwise it can miss executions that reach the reported error state.
2.2. From Model Judgment to Analysis Context
The bug above is not beyond program analysis. A sufficiently precise analysis, e.g., a path-sensitive value analysis such as Frama-C/Eva (Cuoq et al., 2012), or symbolic execution such as KLEE (Cadar et al., 2008), can detect the reported error state once the relevant code is placed in an analyzable context. Locally, the required analysis is straightforward: keep the user-controlled value unconstrained within its program type, follow the guard, model the narrowing conversion, and check the resulting array bound.
The difficulty is bringing this reasoning to bear in a real Linux kernel driver. Running the analyzer directly on the whole driver module is not realistic: starting from driver registration or module initialization would force the analyzer to explore a large kernel-driven state space before reaching the warned sink. This is why symbolic-execution systems often rely on selected entry points, environment models, under-constrained execution, or selective exploration to make analysis tractable (Ramos and Engler, 2015; Chipounov et al., 2011). Nor does the warning itself provide the information needed to make the warned behavior analyzable. In our setting, a warning typically identifies only the bug location and a coarse source-to-sink taint flow.
A static analysis harness fills this gap by supplying the warning-specific context that the backend analyzer needs. It initializes the driver and framework state needed to reach the warned sink and then invokes a selected entry point, keeping warning-dependent values unconstrained within their program types. Such context could in principle be supplied by manually modeling each subsystem or by writing deterministic harness generators. Evident uses an LLM as a practical way to recover this context from driver and framework conventions that are costly to model manually across subsystems. The LLM therefore only helps construct the harness.
This creates a validation problem. As Figure 1 indicates, an LLM-generated harness may look plausible because it reconstructs a benign execution of the driver. By concretizing under-specified inputs or state according to normal use, it can overconstrain the analysis and remove executions that reach the reported error state. Evident therefore validates the generated harness against the reported warning before allowing it to enter backend analysis.
2.3. The Need for Harness Validation
Even if the LLM is not asked to decide whether the warning is real, its harness can still make the warning disappear by changing the analyzed context. In the motivating example, a plausible harness might replace the attacker-controlled value with the range suggested by the local guard. Such a harness, as shown in Figure 1, could contain the following line, where Frama_C_interval(a,b) introduces an abstract integer value constrained to the interval for Frama-C/Eva:
This example shows that harness construction can fail even when the final decision is delegated to a formal backend. If the LLM fixes the warning-dependent input to [0,16], the backend is essentially analyzing a different program context in which the negative long values have already been removed. A no-bug result on such a harness would only show that the reported error state is unreachable after the input domain has been narrowed.
The needed check is concrete in this example: the tainted input reported by the detector must be initialized with an abstract value over its program type, not replaced by a benign concrete range chosen by the LLM. Evident therefore validates the harness before backend analysis and rejects harnesses that concretize or over-constrain the warning-dependent input.
3. System Overview
Figure 1 also summarizes Evident. Evident starts from a warning produced by a front-end detector. The warning identifies an untrusted source, a sink, the reported error state, and a taint flow from the source toward the sink. From this warning, Evident constructs a smaller C program that exposes the reported behavior to a formal backend analyzer. We call this program a warning-specific analysis harness. In our implementation, the backend is Frama-C/Eva: Frama-C is a C program-analysis platform, and Eva is its abstract-interpretation-based value analysis plugin. The front-end detector is not fixed; any detector that provides the required warning fields can be used.
Definition. Fix a build of the analyzed program. A location is a statement in that build. An error state is a pair , where is a predicate over the program state at . An execution reaches if it visits in a state satisfying ; is reachable in a program if some execution of reaches it.
A warning reports a source, reported path, and an error state, and claims that is reachable in the analyzed program . Evident discharges by establishing the negation: no execution of reaches .
A harness preserves 's error state if reachability transfers from to : whenever is reachable in , it is reachable in . Contrapositively, if preserves 's error state, then a proof that is unreachable in justifies discharging .
Scope. Evident targets warning discharge in open programs: code analyzed without a closed, affordable entry point. In such settings, a warning identifies a local error state, while the context needed to analyze it is separated from the natural program entry. A warning-specific harness supplies a local entry and enough context for backend analysis. We study kernel drivers because the gap between the reported path and the context needed for analysis is especially large: relevant context often comes from build configuration, framework state, and subsystem dispatch. Similar issues arise in libraries, event handlers, and mid-program entries in large services, although we do not evaluate them here.
We focus on taint-style warnings whose error states are local memory- or integer-safety predicates, including out-of-bounds accesses and integer overflows. Values are treated as attacker-controlled when marked by the warning or source specification. Evident does not decide end-to-end exploitability, model concurrent interleavings, or reason about properties requiring rich heap-shape or recursive data-structure invariants. Other error states can be supported when the warning supplies a backend-checkable predicate, for example through a function contract checked by Frama-C/Eva.
Outcomes. Evident produces three outcomes, summarized in Table 1. If backend analysis of an accepted harness shows the reported error state unreachable, Evident reports no bug and discharges the warning. If the backend analyzes an accepted harness but does not show the error state unreachable, Evident leaves the warning still alarmed. If no accepted backend result is obtained, Evident reports unresolved.
| System result | Ground truth | Evaluation label |
|---|---|---|
| No bug | Unreachable | Correct |
| No bug | Reachable | FN / incorrect discharge |
| Still alarmed | Reachable | Correct |
| Still alarmed | Unreachable | FP / conservative false alarm |
| Unresolved | Any | Reported separately |
4. Design
Algorithm 1 shows Evident's workflow at a high level. Given a warning , Evident asks the LLM to construct a candidate harness . The warning provides the information needed to construct that harness. For the motivating warning, this means constructing a context in which the reported out-of-bounds check at items[i] can be analyzed, while the value derived from ucontrol remains unconstrained within its long type.
Evident compiles and validates each candidate before backend analysis. Validation is an admission gate: candidates that fail the necessary requirements for the warning are rejected, and their diagnostics may be used to request another harness. Once a harness is accepted, the backend analyzes the reported error state under that harness and its recorded assumptions, producing one of the outcomes defined in §3.
| Requirement | Instance in Fig. 2 |
|---|---|
| Source | *ucontrol |
| Sink | write to items[i] |
| Error state | |
| Sink function | msm_routing_put_... |
| Value-flow witness | within |
| Source function | snd_ctl_ioctl |
| Call witness | |
| Entry candidates |
4.1. Preparing the Warning Input
Evident starts from the information reported by the front-end detector: where the tainted input comes from, where it reaches a warned operation, and what error state should be checked there. For harness construction, Evident splits this reported evidence into two parts: a value-flow witness that captures the warning-dependent computation to exercise, and a call witness that identifies candidate harness entries.
Value-flow Witness. A value-flow witness refines the detector-reported taint flow into the program points that harness construction should exercise. It records how a warning-dependent value moves through assignments, conversions, parameter passing, and uses on its way to the warned operation. The witness may be interprocedural; in the motivating warning, the tainted control object passes through the ALSA framework before a value read from it becomes the driver callback's loop bound. For harness construction, Evident views the witness as fragments aligned with the call witness. We write for the fragment inside the sink-side function , as illustrated in Table 2.
For the case in Figure 2, the local value-flow witness is:
Call witness. As shown in Table 2, a call witness records how that sink-side function is reached from an external or framework entry point. We write the call witness as , where is the outermost function reported on the path and is the sink-side function. In Figure 2, the is snd_ctl_ioctl, is snd_ctl_elem_write, and is msm_routing_put_.... is omitted for clarity; it is an ALSA framework routine.
The call witness also induces a set of candidate harness entry points. Rather than forcing the harness to start at , Evident considers suffixes of the call witness: . Each candidate chooses a different entry depth. Starting near preserves more framework behavior but requires more kernel state to be initialized. Starting closer to reduces the amount of framework state, but requires the harness to reconstruct the objects and assumptions that the skipped framework calls would have supplied.
The call witness is also not just a sequence of direct calls. In the real warning, snd_ctl_elem_write is an ALSA framework routine that receives the control value and dispatches to the driver callback through the control's put operation, effectively calling kctl->put(...). This indirect call is common in kernel frameworks. Therefore, the call witness tells Evident not only which functions are on the path, but also which framework object must be reconstructed and which operation binding must be made explicit in the harness.
4.2. LLM-Guided Harness Construction
Task. The harness constructor is the only component in Evident that uses an LLM. Given the warning input and entry candidates from the previous step, it produces a candidate C harness for one suffix . The harness constructs the arguments and state needed to invoke , keeps warning-dependent values unconstrained within their C types, and calls into so that the remaining call witness can reach the sink-side function .
For the motivating warning, this task is not just to call the driver callback. If the harness starts from the ALSA framework routine snd_ctl_elem_write, it must construct the control objects needed for the framework dispatch and bind the driver's put operation so that the indirect call reaches msm_routing_put_.... If the harness starts closer to the driver callback, it must instead reconstruct the objects that the framework would have supplied, including the kcontrol and ucontrol arguments. The entry candidates expose this tradeoff to the constructor.
Figure 3 shows a simplified harness sketch for the motivating warning using a suffix that starts at the sink-side callback. The real generated harness contains additional declarations, stubs, and initialization code, but the excerpt shows the key obligations: it constructs the framework-supplied arguments (e.g., shift at Line 22 in Figure 2), keeps the warning-dependent value unconstrained within its C type, and invokes the selected entry.
Type-preserving abstract-value initialization. For warning-dependent values, Evident uses type-preserving abstract-value initialization: the harness initializes the value with an abstract value whose domain is derived from the C type of the lvalue, rather than from a concrete range chosen by the LLM. In the paper presentation, this is written as a single assignment x = abstract_value(x), where is the lvalue being initialized and the abstract_value function is a harness primitive that introduces an Eva abstract value over the C type of . The LLM chooses where the abstract value should be introduced, but it does not choose the value's type or range.
This design is important for kernel code, where warning-dependent values often appear through typedefs, nested structures, and unions. In the motivating warning, the loop bound is read through a framework-defined union field whose underlying type is long, even though the value is later stored in an int. Asking the LLM to write backend-specific interval expressions would require it to choose both a type and a range, which is exactly where an over-constrained harness can remove attacker-controlled values. With abstract_value, the LLM only places the abstract-value initialization at the relevant program location; the range remains determined by the program type and backend semantics.
Code navigation. Harness construction follows the pattern of tool-augmented coding agents: the LLM is not expected to infer the relevant driver and framework structure from a single prompt. Evident provides the warning input, entry candidates, relevant snippets, and abstract-input requirements, and lets the LLM request source facts through code-navigation tools. In our implementation, a lightweight tool layer routes these requests to clangd, which provides language-server queries such as go-to-definition, references, callers, callees, and type information (8; 18).
Repair loop. Harness construction is iterative. After each proposal, Evident tries to build the harness and run the available checks. Failures are returned to the LLM as structured diagnostics, and the LLM is asked to repair the harness under the same warning input. Some failures are ordinary compile errors, such as missing declarations, incompatible pointer types, or invalid field accesses. Others are validation failures, such as failing to keep a warning-dependent value unconstrained within its program type or generating a harness whose detector-reported intermediate checkpoints are not reachable. We describe validation in more detail in §4.3.
If no candidate harness passes within the attempt budget, Evident marks the case unresolved.
4.3. Harness Validation
Validation checks a generated harness against the warning input summarized in Table 2. It asks four questions: does the harness reach the reported sink ; does it keep the warning-dependent value from source unconstrained within its program type; does it execute the relevant fragment of the value-flow witness ; and does it avoid fixing other state that can affect the reported error state to unjustified constants?
Abstract-input audit. The first validation check is tied to the warning-dependent source and the values derived from it. Evident audits assignments in the generated harness to catch concrete or overly narrow initialization of warning-dependent state. Values identified by the warning input must be introduced through the type-preserving form abstract_value. The audit also checks for aggregate initialization that silently fixes warning-relevant state, such as zeroing an object that contains a tainted value or a state field that controls whether the tainted value reaches .
The audit distinguishes structural wiring from value restriction. A harness may need concrete assignments to allocate storage, connect pointer fields, or install callback tables. For example, the motivating harness may bind an ALSA control's put operation so that the framework dispatch reaches the driver callback. By contrast, numeric or state values that influence must remain unconstrained within their program types unless the warning input or trusted setup code justifies a narrower constraint.
Target reachability and dead-sink. The second validation check is tied to the reported sink . Evident inserts Frama-C logging probes at and at selected earlier hook points on the chosen entry path, then runs a lightweight Eva pass. A probe is observed only when Eva propagates a non-bottom abstract state to the instrumented statement.
Reachability of is the normal acceptance condition. It shows that the harness has constructed enough entry context for analysis to arrive at the reported location. Missing reachability is treated as a harness failure by default, because it often reflects incomplete framework wiring rather than a genuine dead sink. In the motivating warning (Line 22 in Figure 2), the sink-side function reads through kcontrol->private_value ; unless the harness constructs kcontrol and initializes that field (Line 7-8 in Figure 3) , Eva can lose the abstract state before the reported sink.
Evident accepts an unreachable reported sink only under a stricter dead-sink condition. First, an earlier hook in the sink-side function or on the selected entry path must be reached, so the harness is not simply dead. Second, the reported sink itself must remain unreachable. Third, the lightweight run must report no alarms before the sink. A dead-sink result is accepted only when the sink-side context is reached, the target branch is unreachable, and no earlier alarm explains the loss of reachability.
Checkpoint reachability. The third check is tied to the value-flow witness . Reaching the reported sink is not enough: a harness may reach the same program location without exercising the warning-dependent value reported by the detector. Evident therefore instruments selected checkpoints from the portion of relevant to the chosen entry candidate and checks whether they are reachable with non-bottom abstract state.
These checkpoints may include the load of the warning-dependent value, a narrowing conversion, or an assignment to a loop-bound variable. The check does not prove the detector's taint relation or the full source-to-sink flow. It only serves as an admission check: an accepted harness must exercise the warning-relevant fragment of the reported witness, rather than merely reach through an unrelated context.
Dependency audit. The final check considers state beyond the reported source . Such state may determine whether the reported sink is reached, whether the warning-dependent value reaches , or what range it has when the reported error state is checked. For example, a guard such as if (g) x &= 0xff can sanitize a tainted value depending on the value of g. If g is a global variable and Eva starts the analysis with default zero-initialized globals, the backend may miss behaviors that remain possible in the original program.
Evident therefore runs a dependency analysis by Frama-C/From on the harnessed program and audits state that can influence reachability of or the check of . When such state is not fixed by trusted setup code, the harness must expose it through type-preserving abstract-value initialization. This check is especially important for dead-sink results: a proof that is unreachable must not depend on model-supplied concrete values for guards or global state. It must follow from the code and build configuration under unconstrained warning-relevant state.
4.4. Backend Analysis
After a harness passes validation, Evident analyzes the accepted harness with Frama-C/Eva. The backend checks the reported error state under the program context encoded by . Evident reports no bug only when the backend shows unreachable on an accepted harness. Harness construction, LLM explanations, validation success, alarms, unknown results, timeouts, and analysis failures are not evidence for no bug.
We do not use Frama-C/Eva's -lib-entry mode as the main way to analyze driver entries. Although -lib-entry can abstract globals at analysis entry, drivers often contain large static object graphs and recursive objects that prevent analysis from reaching the warning-specific code. RQ4 evaluates a rule-guided variant that relies more heavily on this style of direct entry analysis; it leaves many cases unresolved.
Evident may analyze an accepted harness under multiple Frama-C/Eva precision settings. In our implementation, the lightweight run uses Eva precision level 1, while the full run uses level 11. The lightweight run is cheaper and can quickly discharge warnings whose reported error state is already unreachable; the full run spends more analysis effort to reduce alarms caused by imprecision. These settings change precision and cost, not the decision rule: a no-bug result requires some accepted backend run to show unreachable. If accepted runs only alarm, return unknown, time out, or fail, the warning remains still alarmed or unresolved as defined in §3.
5. Implementation
This section describes how Evident instantiates the design on Android kernel-driver code and Frama-C/Eva. We focus on the implementation details that matter for reproducibility: kernel build preparation, shims, validation instrumentation, and analyzer-output extraction.
Warning Inputs and Kernel Build Preparation. Evident is implemented and evaluated with two kinds of front-end warning sources. The first is an LLVM-based static analysis, represented by SUTURE in our evaluation (Zhang et al., 2021). This adapter imports warning reports from LLVM instruction traces and debug locations. The second is a SARIF-style SAST report, represented by a CodeQL-based detector built on CodeQL's taint-tracking library (Backhouse, 2018). This adapter imports the reported source, sink, and path nodes from the SARIF record. Both adapters produce the same internal warning record as shown in §4.1.
We run Evident on Android kernel driver source files from the warning datasets; Section 6 reports the exact versions. For each target file, the kernel is first built to recover include paths, configuration-dependent macros, and compiler options. Evident then preprocesses the file into a .i translation unit. This preprocessed file fixes the macro-expanded C program seen by Frama-C/Eva and serves as the artifact rewritten with harness code and validation instrumentation.
Harness Support and Kernel Shims. Generated harnesses are compiled and analyzed outside the full kernel environment. Evident therefore provides a small support layer for harness generation: shim headers, wrapper include paths, abstract-value initialization primitives, and selected kernel API models. During compilation of harnesses, Evident includes a shim header and places a wrapper include directory before the kernel headers. This setup simplifies kernel mechanisms that are irrelevant or difficult for C-level analysis, including compiler attributes, build-time assertions, dynamic-debug macros, selected atomic primitives, and configuration-specific helpers.
The shim layer implements the abstract-value initialization primitive used by generated harnesses. abstract_value(x) uses the compile-time type of to choose a matching Frama-C/Eva expression and casts the result back to __typeof__(x). The generated harness specifies where a value should be left unconstrained; the shim determines the backend expression appropriate for that C type.
Some warning classes require a small C-level contract to make the reported error state visible to Frama-C/Eva. For the CodeQL-based detector that reports tainted accesses to copy_from_user-style APIs, Evident provides function contracts for the checked APIs. The contract states the destination-buffer validity condition checked by the warning. The copy contents are irrelevant for this warning class; the backend only needs to check whether the destination range may be invalid when the call is reached. For example, Evident uses the following contract for copy_from_user:
👁 [Uncaptioned image]Evident also provides pragmatic shims for kernel mechanisms that commonly block Frama-C/Eva. Allocation routines such as kmalloc are modeled with a bounded static heap that may fail or return storage. Other wrappers simplify architecture- or configuration-specific mechanisms such as current, READ_ONCE/WRITE_ONCE, common data-structure helpers, and selected kernel idioms. These shims define the analysis environment used by the harnesses.
Harnessing Harness Construction. Evident implements harness generation as a scaffolded, tool-augmented coding task. The prompt scaffold turns the warning input and design constraints into concrete generation requirements: the model must choose an entry candidate, construct the objects needed by that entry, initialize warning-dependent state with type-preserving abstract values through abstract_value, and leave the reported error state to Frama-C/Eva. Most importantly, it forbids LLMs from making any analysis decisions.
The scaffold uses a staged workflow. In the planning stage, the model does not emit the harness file. It either requests one source fact through the code-navigation tools or writes a construction plan that selects an entry candidate, identifies the objects that must be wired, and lists the fields that must remain unconstrained within their program types. In the generation stage, the model emits the C harness under that plan. Repair turns keep the warning input fixed and add only compiler or validation diagnostics.
The scaffold also encodes the separation between structural setup and warning-relevant state. Concrete assignments are allowed for object wiring, such as connecting pointer fields, installing callback operations, or constructing a small object graph needed by the chosen entry. Numeric values, indices, lengths, flags, and fields derived from the warning source must be introduced with abstract_value unless they are fixed by trusted setup code.
Aggregate zero-initialization (e.g., memset) is rejected (also enforced by validation), because it assigns values to fields that the LLM may not have inspected, including fields that can affect reachability, pointer validity, or the reported error state. In practice, leaving an aggregate uninitialized often produces more useful diagnostics: Frama-C/Eva reports the specific uninitialized field or invalid pointer that the harness must wire, rather than silently analyzing a zero-filled object graph. The generated harness therefore initializes only the fields needed to wire the selected entry or expose the warning-dependent value.
Instrumentation for validation. Evident implements reachability checks by inserting Frama-C-visible logging probes into the preprocessed program. The instrumentation is source-line based. During preprocessing, Evident preserves #line directives, which map lines in the .i file back to the original kernel file and line number. Given a reported sink or a checkpoint from the value-flow witness, Evident uses this mapping to find the corresponding line in the preprocessed file.
A sink probe inserts a unique Frama-C logging call at , and a witness checkpoint probe inserts analogous logging calls at selected intermediate points from the value-flow witness. After instrumentation, Evident runs a lightweight Eva pass and parses the Frama-C output for the corresponding markers. A marker appears only if the instrumented location is reached with non-bottom abstract state under the generated harness. These probes check reachability of reported program points under the harness.
Frama-C/Eva Result Extraction. Evident extracts results from Frama-C/Eva relative to the expected error state, rather than by counting all analyzer alarms. For each accepted harness, Evident matches Eva reports against the reported sink and reported error state . The primary match is an alarm at the detector-provided sink location. To account for source-location drift introduced by preprocessing and instrumentation, Evident also accepts nearby reports in the same enclosing function when they involve the same warning-dependent value and the same class of error.
One practical complication is that Eva may stop propagating abstract states after an invalid state before the reported sink. In that case, the marker at may never be reached even though the analysis has already found a relevant error on the same path. Evident records this restricted case as prior-invalid evidence only when the invalid state occurs before , inside the sink-containing function, and matches the reported error state . Such evidence keeps the warning alarmed.
If no target-matched alarm or prior-invalid evidence is found, Evident consults the reachability result for the reported error state. A no-bug result is reported only when Eva shows unreachable on the accepted harness. If the backend output cannot be matched to the warning, or analysis terminates before a trusted result is available, the case is marked unresolved.
Implementation size. The core implementation of Evident is approximately 22.2 KLOC of Python, excluding tests. This includes LLM orchestration, harness construction and Frama-C integration, warning validation, binding/dataflow resolution, and repository lookup. The artifact also includes 0.6 KLOC of Frama-C/Linux shim headers and 0.3 KLOC of YAML prompt templates.
6. Evaluation
We evaluate Evident on real Android Linux kernel driver warnings to answer four research questions:
-
RQ1 (Effectiveness): How effectively does Evident analyze real warnings, and what accounts for its unresolved and incorrect cases?
-
RQ2 (Model Sensitivity): How does the choice of LLM affect Evident's coverage and error profile under the same validation checks and backend-decision rule?
-
RQ3 (Comparison with Model-Decided Triage): How does Evident compare with current triage practices (reasoning scaffolds, and agentic systems) whose final verdicts are decided by LLMs.
-
RQ4 (Design Variants): How do alternative harness-construction, abstract-input, and validation choices affect Evident's coverage and empirical error profile?
6.1. Experimental Setup
Dataset. We evaluate on 200 warnings drawn from the BugLens dataset (Li et al., 2025a), which includes Android Linux kernel driver warnings reported by two existing static detectors: SUTURE (Zhang et al., 2021) and CodeQL-OOB, a CodeQL-based out-of-bounds detector (Backhouse, 2018). The dataset contains 126 SUTURE warnings and 74 CodeQL-OOB warnings. RQ1 uses all 200 warnings. RQ2–RQ4 use the 126 SUTURE warnings, which are the subset used for model-sensitivity, comparison, and design-variant experiments.
For the SUTURE subset, we use the two main driver modules in the dataset, sound and i2c, because prior BugLens results show that these two modules cover all confirmed true bugs in that subset. The SUTURE benchmark cases, msm-sound and msm-i2c, are evaluated on the Android 10 Qualcomm Pixel kernel android-msm-coral-4.14-android10 (android-10.0.0_r0.56), Linux 4.14.150. The CodeQL-OOB cases use a separate Qualcomm MSM kernel tree with Linux 4.4.21-perf+.
Ground truth. Ground-truth labels were established from the labels and case analyses reported in the SUTURE and BugLens papers, followed by additional manual validation by the authors. In total, the ground-truth set contains 40 real bugs and 160 non-bugs. We treat these labels as the best available reference for this dataset, while noting that they are derived from prior reports and manual review rather than from exhaustive formal specifications.
Metrics. We report correctness against these labels. Correct results (Cor.) are the sum of true positives and true negatives: real bugs that remain alarmed, and non-bugs that are discharged as no-bug. False positives are non-bug warnings that Evident leaves alarmed, and false negatives are real bugs that Evident incorrectly discharges as no-bug. Accuracy is , where includes unresolved cases. Unresolved cases are reported separately and do not count as correct, even though they do not discharge the warning.
Baselines. The first baseline directly asks Codex, instantiated with GPT-5.5 with the xhigh reasoning setting, to classify each warning given its report. Codex operates as a fully agentic baseline: it is given read-only access to the entire target kernel tree and may freely explore additional code context using its built-in shell and code-search capabilities (e.g., rg, cat). We disable web search, browser access, and all external tools, so its analysis relies solely on the model and the source code; it has no access to ground-truth labels or any intermediate results of Evident. The second family of baselines uses the BugLens workflow, which performs its own repository-level knowledge retrieval to gather code context. We evaluate both the original BugLens setting with o3-mini, using results reported in the BugLens paper, and a GPT-5.5 variant to keep the model consistent with the rest of our experiments.
Implementation. In the main experiment, Evident uses GPT-5.5 (gpt-5_5-2026-04-23) for harness scoping and Frama-C 32.0 (Germanium) for formal analysis, through the Eva and From plug-ins. The experiments were conducted on a machine running Ubuntu 22.04.5 LTS (Jammy), equipped with an Intel(R) Core(TM) i7-7820X CPU @ 3.60GHz and 64 GB of RAM. The OCaml version was 4.14.1. Each case is analyzed under a fixed per-stage timeout, with Eva runs capped at 15 minutes.
Cost and runtime. Cost is not a primary evaluation target, but Evident's LLM use is modest. With GPT-5.5, harness generation consumes about 27K uncached input tokens, 42K cached input tokens, and 11K output tokens per case attempt, costing about $0.50. This is well below the agentic Codex baseline in our runs ($1.20/case). BugLens consumes a similar token usage to Evident; its cost is thus roughly equal under GPT-5.5, and lower in its original o3-mini configuration due to cheaper model pricing. Runtime is dominated by Frama-C/Eva: quick analysis times out at 300s and full analysis at 900s. Serial execution averages about 10 minutes per case, but cases are independent and parallelize naturally across workers.
6.2. RQ1: Effectiveness
| Src. | Warn. | Cor. | Acc. | FP | TP | FN | Unres. |
|---|---|---|---|---|---|---|---|
| SUTURE | 126 | 94 | 0.75 | 25 | 39 | 0 | 7 |
| CodeQL-OOB | 74 | 57 | 0.77 | 3 | 1 | 0 | 14 |
| Total | 200 | 151 | 0.76 | 28 | 40 | 0 | 21 |
Evident reaches a verdict on 179 of 200 warnings. Across all warnings, 151 results are correct (true positives or true negatives), for an accuracy of 75.5%. The 28 incorrect verdicts are false positives: non-bug warnings that Evident conservatively leaves undischarged. Evident produces no false negatives in the main experiment.
| Residual reason | Count |
| Unresolved cases | |
| Unsupported analysis pattern | 9 |
| Missing framework or API model | 4 |
| Harness validation failure | 7 |
| Timeout on analysis | 1 |
| Incorrect analyses | |
| Missing non-local semantic relation | 16 |
| Missing framework object or state model | 12 |
| Total | 49 |
Residual cases. Table 4 breaks down the 21 unresolved cases and 28 false positives. Unresolved cases expose support boundaries in the current implementation. They arise from unsupported C or analysis patterns, missing models for kernel framework objects or helper APIs, validation failures after the repair budget is exhausted, or final-analysis timeout. Missing models occur when the harness cannot reconstruct required kernel state, such as data structures in the kernel whose effects are needed to reach or analyze the reported sink.
The 28 false positives are residual proof failures rather than missed bugs. In 16 cases, the backend lacks non-local relations needed to prove the warning infeasible, such as relations between a pointer and its length field, a payload cursor and its grammar, an enum index and table metadata, or a driver-specific numeric value and its protocol bound. In the remaining 12 cases, the harness or environment model leaves framework objects, global state, helper effects, or initialization-dependent fields too abstract. Eva therefore observes behaviors that are feasible in the abstract state but infeasible under the intended kernel or driver execution.
Overall, the residual cases show the empirical error profile in this dataset: all incorrect verdicts are false positives, and no false negatives are observed. For no-bug decisions, this means that Evident's failures are conservative in this experiment: unsupported or imprecise cases remain alarmed or unresolved rather than being discharged. The remaining errors point to missing kernel-framework models and missing non-local semantic relations, rather than to LLM's failures.
6.3. RQ2: Model Sensitivity
| LLM | No-bug | Real-bug | Unresolved | Acc. | |||
|---|---|---|---|---|---|---|---|
| TN | FP | TP | FN | All | Bug | ||
| GPT-5.5 | 55 | 25 | 39 | 0 | 7 | 0 | 0.75 |
| Claude Sonnet 4.6 | 47 | 24 | 37 | 0 | 18 | 2 | 0.67 |
| GPT-5.4 mini | 40 | 18 | 27 | 0 | 41 | 12 | 0.53 |
| GLM-5.1 | 50 | 18 | 33 | 0 | 25 | 6 | 0.66 |
Table 5 evaluates Evident with LLMs of varying capability on the same 126-warning subset. It shows that model choice affects both correctness and coverage. GPT-5.5 obtains the highest accuracy, 0.75, and leaves the fewest cases unresolved. Lower-capability models produce fewer correct verdicts and leave more cases unresolved, including some real bugs. Across all models, however, these failures do not become incorrect no-bug results: every configuration has zero false negatives on this dataset.
The performance of GLM-5.1 (37), a leading open model, is also notable: despite using only 40B active parameters, it remains close to the closed models in this comparison. It achieves 0.66 accuracy and produces no false negatives, although it leaves more cases unresolved than GPT-5.5. This suggests that Evident is not tied to a single frontier proprietary model. A capable open model can construct enough accepted harnesses to recover many backend-supported verdicts.
The unresolved cases from weaker models mostly fail before final bug classification in two ways. Some fail during harness construction. In our implementation, the LLM first gathers code facts and then emits a single C harness; weaker models sometimes discover late that required definitions or object relationships are still missing, continue requesting lookup tools, and exhaust the attempt budget before producing a usable harness. Other cases produce a harness but fail admission validation. These failures come from invented local models, ill-typed object layouts, or incomplete reconstruction of the environment state needed to reach the reported bug location. Evident classifies both kinds of failures as unresolved rather than no-bug results.
These results highlight a robustness property of Evident's decision principle. A no-bug result is reported only when the harness is accepted by validation and the backend shows the reported error state unreachable. As a result, lower-capability models mainly reduce coverage by producing more failed or unresolved harnesses. They do not relax the backend-decision rule used to discharge warnings.
6.4. RQ3: Comparison with Model-Decided Triage
| Approach | TP | TN | FP | FN | Unres. | Acc. |
|---|---|---|---|---|---|---|
| GPT-5.5 + BugLens | 25 | 58 | 29 | 14 | 0 | 0.66 |
| o3-mini + BugLens | 22 | 64 | 23 | 17 | 0 | 0.68 |
| Codex (GPT-5.5 xhigh) | 21 | 55 | 32 | 18 | 0 | 0.60 |
| Evident | 39 | 55 | 25 | 0 | 7 | 0.75 |
Table 6 compares Evident with direct LLM-based filtering on the same 126-warning subset. For GPT-5.5 and o3-mini, we use the BugLens scaffold rather than a bare prompt (Li et al., 2025a). These numbers are not meant to reproduce the BugLens paper, because our task uses a different ground truth: we evaluate bug reachability, not security impact. A reachable signed overflow, out-of-bounds access, or invalid pointer dereference counts as a bug even if an LLM judges it unlikely to be exploitable. To make the comparison aligned with our task, we remove BugLens's security-impact accessor and compare only the warning classification under our reachability-based ground truth.
The results show that LLM-only filtering can miss real bugs even with strong models and structured scaffolds. GPT-5.5 with the BugLens scaffold and the Codex agent both identify the motivating narrowing bug from Section 2, but still produce 14 and 18 FNs respectively; o3-mini produces 17 FNs. Many missed bugs follow a similar pattern: the model accepts a plausible framework or driver check as sufficient sanitization, while the actual code makes the check conditional, version-dependent, incomplete, or state-dependent. Prior LLM-filtering experiments also report sanitizer-overtrust failures (Li et al., 2025a).
The comparison highlights the consequence of where the final verdict is made. The model-decided baselines produce 14–18 false negatives on this subset: real bugs are sometimes dismissed as no-bug. We do not attribute these failures to a single cause, since they can arise from missing context, incorrect reasoning about guards, or version-specific framework assumptions. The important difference is procedural. In Evident, an LLM-generated rationale is never enough to discharge a warning. A no-bug result requires an accepted harness and a backend result showing the reported error state unreachable. On the same subset, Evident produces zero false negatives; its residual failures are 25 false positives and 7 unresolved cases.
6.5. RQ4: Design Variants
| Variants | Cor. | FP | FN | Unres. | Acc. |
|---|---|---|---|---|---|
| Full Evident | 94 | 25 | 0 | 7 | 0.75 |
| LLM-decided input | 94 | 11 | 7 | 14 | 0.75 |
| Abstract values w/o validation | 97 | 18 | 6 | 5 | 0.77 |
| Rule-guided harnesses | 29 | 17 | 0 | 80 | 0.23 |
| Best model-decided triage | 86 | 23 | 17 | 0 | 0.68 |
Table 7 evaluates design variants on the 126-warning subset. The variants test three parts of Evident's division of responsibility: how the harness is constructed, how warning-dependent values are represented, and whether generated harnesses are validated before backend analysis.
Rule-guided harnesses. This variant replaces Evident's LLM-guided context construction with a more rigid setup. The analysis provides the entry pattern and wiring rules, and the LLM only emits harness code consistent with them. Warning-dependent state is not introduced through Evident's abstract_value(x) interface, but is instead delegated to the backend's global abstraction at analysis entry. The variant is conservative but incomplete: it introduces no false negatives, but leaves 80 of 126 cases unresolved and finds only 18 of the 39 real bugs. Many failures arise before a useful backend result is obtained, because the rule-guided setup does not reconstruct enough kernel- and framework-managed state to reach the relevant callbacks or reported locations. This is the same obstacle that makes direct target analysis difficult in the first place.
LLM-decided input. This variant removes both the type-preserving abstract-value interface and the validation gate. It uses a simpler harness-construction prompt in which the LLM must choose how to initialize warning-dependent inputs, without the abstract_value(x) primitive used by Evident. The generated harness may therefore use concrete values, intervals, or backend-specific abstract expressions chosen by the model. This variant produces 7 false negatives and 14 unresolved cases. The false negatives occur when the generated harness restricts warning-dependent inputs too narrowly, allowing the backend to show absence under a model-chosen input space rather than under the C types of the warning-dependent lvalues.
Abstract values without validation. This variant isolates the effect of validation more directly. It keeps the same harness-construction scaffold as Evident, including the type-preserving abstract_value(x) interface, but disables the validation checks before backend analysis. It obtains the highest aggregate accuracy in Table 7, 0.77, but still introduces 6 false negatives. These false negatives arise because an accepted-looking backend result can be produced on a harness that fails to exercise the reported warning. The full system rejects such harnesses before backend analysis; without validation, the backend may show absence only within an inadmissible generated context.
For reference, Table 7 also includes the best model-decided baseline from RQ3. It produces a label for every case, but incurs 17 false negatives. Together, the variants show that aggregate accuracy is not the only relevant measure for warning discharge. The simpler LLM-decided setup can narrow the checked input space; removing validation from the full scaffold can admit harnesses that do not exercise the reported warning; and rule-guided harness construction is too incomplete. The full configuration gives up a small amount of aggregate accuracy relative to the no-validation variant, but it eliminates the false negatives observed in the design variants while retaining much higher coverage than rule-guided harnessing.
6.6. Case Studies: Where Principled Use Helps
To understand how Evident differs from model-decided triage in practice, we examine four representative cases from our evaluation. These cases are not intended as capability limits for LLM-based agents. A more elaborate agent may avoid some of the same mistakes by retrieving more code, adding preprocessing, using stronger prompts, or majority voting.
The cases instead illustrate a structural difficulty of using the model for bug triage. Some program fragments cannot be judged from the local snippet alone; the answer may turn on a specific fact outside the snippet, such as how an object is constructed or how a function is called. Contemporary agents can typically retrieve any code snippets they want, but it has no mechanism that enforces any specific code to be inspected. Furthermore, even if the right context is retrieved, it may still not be what drives the answer. Thus a no-bug decision can look the same whether it is supported by the program or produced by a plausible shortcut.
Evident avoids this difficulty by moving the decisive part of triage out of the model’s internal code reasoning. The final evidence no longer depends on whether an unobservable reasoning step actually happened, whether the right context was used, or whether a plausible explanation merely masked a guess.
Nested types. The motivating example shows that the exact value type is important, but that type may not be visible at the local expression. More generally: resolving types like snd_ctl_elem_value::id::index requires two dependent lookups, first into snd_ctl_elem_value and then into the embedded snd_ctl_elem_id. A field access through nested structures requires such lookups in such a chain.
In practice, we often observe that models did not strictly follow such lookup chains. They would stop after one or two levels and form a conclusion from partial type information. The conclusion is sometimes correct, but it does not reveal whether the model actually resolved the relevant type or relied on a shortcut that happened to work.
Evident avoids making this nested type resolution part of the model's internal reasoning. Warning-dependent values are introduced as abstract values through their actual lvalue types, so the compiler resolves the full definition chain against the analyzed tree. The decisive fact is therefore checked outside the model rather than inferred from a partial inspection of the code.
Code-as-seen is not code-as-run. The code visible to the model is not always the code that executes. In one case, the LLM-only baseline reasoned about a target guarded by a complex compound condition, ~100 lines below the guard in a ~400-line function, apparently requiring nontrivial reasoning about framework versions and hardware support. In the analyzed build, however, one conjunct was compiled to a constant:
👁 [Uncaptioned image]With CONFIG_USE_Q6_32CH_SUPPORT disabled, the branch cannot execute regardless of the remaining conjuncts. The deciding fact is not in the function the model reads, so the baseline treats the dead branch as live code and reaches the wrong conclusion. Although preprocessing would remove this particular mismatch, it only resolves configuration choices expressed at the preprocessor level. It cannot eliminate branches that are dead if they are in the ordinary C logic, such as helper function or driver state, where the deciding fact is in program semantics rather than in conditional compilation. Evident sidesteps the gap rather than asking the model to bridge it: the harness is compiled against the analyzed build configuration, so the code the backend analyzes is the code that actually executes under the given harness. The deciding fact never passes through the model.
Memory contamination from nearby software versions. Prior work raises the concern that models may have seen target code or patches during training (Ramos et al., 2025; Wu et al., 2024), which gives the models an unfair advantage. We observe the opposite, failure mode in model-decided triage: the model remembers plausible facts from a different Linux kernel version. Later Linux kernels added ALSA-core input validation for control API values (Iwai, 2022); in some LLM-only runs on older Android kernels, the model reasoned as if such framework-level sanitization was already present, concluding that user-controlled indices had been canonicalized before reaching the driver callback. We do not claim a causal link to training exposure; the point is that parametric memory can bias the model toward version-mismatched framework behavior, and prompting for source evidence cannot guarantee that remembered facts are suppressed or distinguished from the analyzed version. Evident forces such assumptions to materialize: a remembered sanitizer must appear either as code, which must exist in the analyzed tree to compile and to reach the validation checkpoints, or as a constraint on warning-dependent inputs, which validation rejects outright. A version-mismatched belief therefore cannot remain an implicit assumption in the analysis.
Inherent nondeterminism. Feeding the model the same warning does not yield the same triage decision. In our inspection, the same code was plausibly analyzed in opposite directions across runs: one response discarded the warning, another declined to after emphasizing a different part of the same context. This instability is known in LLM-assisted triage: BugLens stabilizes judgments by majority voting, while LLift reports that repeated runs can produce different outcomes because of model randomness (Li et al., 2025a, 2024). Evident does not remove this nondeterminism but limits its effect: a wrong or incomplete harness leaves the case unresolved instead of an unsound discard, and an accepted no-bug decision must be backed by the full evidence chain. However in model-decided triage, the same randomness acts directly on the verdict.
7. Discussion & Limitations
Soundness Boundary. Evident does not claim end-to-end soundness for the original kernel driver. Its decision rule is narrower: a warning is discharged only when Frama-C/Eva shows the reported error state unreachable on an accepted harness. The remaining gap is whether that harness is an adequate context for the original warning. Evident reduces this gap with validation checks, but these checks are necessary admission criteria rather than a proof of full behavioral preservation.
The backend result also depends on the kernel environment models used to make the harness analyzable (kernel shims). These models are part of Evident's trusted computing base, not LLM-generated artifacts. In the current prototype, they are manually written and reviewed, but these assumptions appear as explicit C artifacts that can be inspected, tested, refined, or replaced by verified models.
The contribution is therefore not an end-to-end soundness theorem, but a clear trust boundary. Compared with a model-decided no-bug verdict, Evident moves the remaining assumptions into inspectable artifacts: the accepted harness, the validation rules, and the kernel environment models. The evaluation suggests that this boundary eliminates the false negatives observed in our model-decided baselines and design variants while still discharging many false alarms. Strengthening the kernel models and validation checks is future work.
Cost-Aware Deployment. Thanks to the no-bug decision rule being fixed across models, model choice only affects how often Evident obtains an accepted harness. This suggests a practical deployment path: use cheaper or locally hosted models for an initial harness-construction pass, and escalate unresolved or validation-failed cases to stronger models when additional coverage is worth the cost. The GLM-5.1 result supports this direction: a capable open model recovers many backend-supported verdicts, although with lower coverage than GPT-5.5. We leave a systematic study of model cost, latency, and escalation policies to future work.
Limitations and Future Work. Evident is most useful when a warning occurs in an open program whose relevant execution context is separated from the reported path. This is common in drivers, libraries, and large services analyzed from mid-program entries. The approach might be less compelling for small closed programs, or for code whose natural entry point is already affordable for formal analysis, because there is less missing context for a harness to reconstruct.
Our evaluation is limited to Android kernel drivers, a security-relevant target whose subsystem conventions may be familiar to modern LLMs. Results may not transfer directly to other driver stacks, proprietary frameworks, or less widely represented code bases. In such settings, Evident may construct fewer accepted harnesses, reducing accuracy under our scoring because more cases become unresolved. The goal of the design is therefore not only high accuracy, but (even more importantly) also to make failures surface as unresolved or still-alarmed cases rather than as model-decided discharges. The no-bug decision rule remains fixed for different code bases.
Finally, validation remains incomplete: a generated harness may pass the current admission checks while omitting behavior needed for the original warning. Future work includes stronger checks on harness behavior, richer or verified kernel environment models, and evaluation on more diverse code bases.
8. Related Work
Reliability of LLM reasoning for security. Recent studies show that LLMs remain unreliable as direct vulnerability detectors: models can misclassify buggy and patched code, produce setting-dependent or non-deterministic results, and give incorrect rationales for security judgments (Steenhoek et al., 2025; Ullah et al., 2024; Khare et al., 2024; Chen et al., 2025; Fang et al., 2024). Evident addresses this tension by separating usefulness from authority: the model helps recover warning-relevant context and construct an analysis artifact, but a no-bug result requires backend analysis of an accepted harness rather than a model-produced security judgment.
LLM-based warning triage and false-positive filtering. Recent work has shown that LLMs can be effective at static-warning triage and false-positive filtering (Wen et al., 2024; Chen et al., 2024a; Li et al., 2025c, a, 2024). These systems improve over bare warning reports by supplying the model with code context, slices, paths, or structured reasoning scaffolds, and they report strong empirical results on warning classification. Similar ideas have also appeared in industrial SAST workflows, where LLMs classify findings as likely true or false positives (Maring and Qian, 2025; Du et al., 2026).
Evident follows the same insight that LLMs are useful for recovering warning-relevant context, but uses that context differently. Prior systems use the recovered context to support a model-produced triage verdict. Evident instead turns the recovered context into a warning-specific analysis harness; a no-bug decision is made only after backend analysis of an accepted harness.
Grounding LLMs with program analysis. Closest in spirit are systems that pair LLMs with analysis-backed checks. LLM4PFA (Du et al., 2025) uses path-constraint solving to test source-to-sink feasibility, and concurrent work uses LLM-synthesized harnesses with symbolic execution to find vulnerabilities confirmed by concrete replay (Shafiuzzaman et al., 2026). Other systems decompose code reasoning into analysis-backed subproblems: LLMDFA uses LLM-guided data-flow summaries with solver-backed path validation, while RepoAudit validates repository-level bug reports using data-flow facts and path-condition checks (Wang et al., 2024; Guo et al., 2025).
These systems show the value of using program analysis to check or constrain LLM-assisted reasoning. Evident applies this idea to warning discharge, where the LLM-generated context must itself be checked before backend results can support a no-bug decision.
LLM-based fuzz-driver generation. LLMs have been used to synthesize fuzzing harnesses and drivers for dynamic bug finding (Google, 2024; Lyu et al., 2024; Liu et al., 2025; Xu et al., 2025; Li et al., 2025b; Jeong et al., 2023). Evident also asks an LLM to generate a harness, but for a different purpose: the harness is a warning-specific artifact for backend analysis, not a driver for finding new crashes. Its validation is therefore tied to the reported warning rather than to compilation, coverage, or fuzzing yield.
9. Conclusion
Program analysis exists to analyze programs. This is almost tautological: judgments about program behavior should come from analysis, not from guesses about the code. Yet the current enthusiasm for LLMs makes this boundary easy to blur. This paper asks where LLMs should sit in bug analysis so that the tautology still holds. Our answer is to use the LLM to construct an analyzable representation of program context that would otherwise be difficult to expose to the backend. Evident demonstrates that this is not merely a principle: in real Android Linux kernel drivers, LLMs can carry much of the context-construction burden while program analysis remains responsible for deciding program behavior.
References
- [1] (2021) asoc: Add check to handle negative value passed for num_app_cfg_type. Note: https://git.codelinaro.org/clo/la/platform/vendor/opensource/audio-kernel/-/commit/33393c5a7eb2f752cd0fcfc12a209f81f57d688dVendor patch Cited by: §2.
- K. Backhouse (2018) Stack buffer overflow in Qualcomm MSM 4.4 - Finding bugs with CodeQL. (en). External Links: Link Cited by: §5, §6.1.
- C. Cadar, D. Dunbar, and D. R. Engler (2008) KLEE: unassisted and automatic generation of high-coverage tests for complex systems programs. In 8th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2008, December 8-10, 2008, San Diego, California, USA, Proceedings, R. Draves and R. van Renesse (Eds.), pp. 209–224. External Links: Link Cited by: §2.2.
- J. Chen, H. Xiang, Z. Zhao, L. Li, Y. Zhang, B. Ding, Q. Li, and S. Xiong (2024a) Utilizing precise and complete code context to guide llm in automatic false positive mitigation. arXiv preprint arXiv:2411.03079. Cited by: §8.
- J. Chen, Z. Pan, X. Hu, Z. Li, G. Li, and X. Xia (2025) Reasoning runtime behavior of a program with llm: how far are we?. In Proceedings of the IEEE/ACM 47th International Conference on Software Engineering, Cited by: §8.
- X. Chen, Z. Li, J. Zhang, and A. Burtsev (2024b) Veld: Verified Linux Drivers. In Proceedings of the 2nd Workshop on Kernel Isolation, Safety and Verification, Austin TX USA, pp. 23–30 (en). External Links: ISBN 979-8-4007-1301-9, Link, Document Cited by: §1.
- V. Chipounov, V. Kuznetsov, and G. Candea (2011) S2E: a platform for in-vivo multi-path analysis of software systems. ACM SIGPLAN Notices 46 (3), pp. 265–278. External Links: ISSN 0362-1340, Link, Document Cited by: §2.2.
- [8] clangd: A language server for C/C++. Note: https://clangd.llvm.org/Accessed: 2026 Cited by: §4.2.
- P. Cuoq, F. Kirchner, N. Kosmatov, V. Prevosto, J. Signoles, and B. Yakobowski (2012) Frama-c: a software analysis perspective. In International conference on software engineering and formal methods, pp. 233–247. Cited by: §2.2.
- X. Du, J. Feng, Y. Zou, W. Xu, J. Ma, W. Zhang, S. Liu, X. Peng, and Y. Lou (2026) Reducing false positives in static bug detection with llms: an empirical study in industry. In Proceedings of the 48th IEEE/ACM International Conference on Software Engineering: Software Engineering in Practice, ICSE-SEIP 2026. Cited by: §1, §8.
- X. Du, K. Yu, C. Wang, Y. Zou, W. Deng, Z. Ou, X. Peng, L. Zhang, and Y. Lou (2025) Minimizing false positives in static bug detection via llm-enhanced path feasibility analysis. arXiv preprint arXiv:2506.10322. Cited by: §8.
- C. Fang, N. Miao, S. Srivastav, J. Liu, N. Nazari, and H. Homayoun (2024) Large Language Models for Code Analysis: Do LLMs Really Do Their Job?. In 33rd USENIX Security Symposium (USENIX Security 24), Cited by: §8.
- Google (2024) OSS-Fuzz-Gen: LLM powered fuzzing via OSS-Fuzz. Note: https://github.com/google/oss-fuzz-genAccessed: 2026-06-09 Cited by: §8.
- J. Guo, C. Wang, X. Xu, Z. Su, and X. Zhang (2025) RepoAudit: an autonomous llm-agent for repository-level code auditing. CoRR abs/2501.18160. External Links: Link, Document, 2501.18160 Cited by: §8.
- T. Iwai (2022) ALSA: control: Add input validation. Note: https://github.com/torvalds/linux/commit/f5e829f92a494a0b66d309497bab4e9d10d4ce3eLinux kernel commit, committed June 9, 2022 Cited by: §6.6.
- B. Jeong, J. Jang, H. Yi, J. Moon, J. Kim, I. Jeon, T. Kim, W. Shim, and Y. H. Hwang (2023) Utopia: automatic generation of fuzz driver using unit tests. In 2023 IEEE Symposium on Security and Privacy (SP), pp. 2676–2692. Cited by: §8.
- A. Khare, S. Dutta, Z. Li, A. Solko-Breslin, R. Alur, and M. Naik (2024) Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities. arXiv. Note: arXiv:2311.16169 [cs] External Links: Link, Document Cited by: §8.
- [18] Language Server Protocol. Note: https://microsoft.github.io/language-server-protocol/Accessed: 2026 Cited by: §4.2.
- J. Lawall, K. Nishimura, and J. Lozi (2025) Should We Balance? Towards Formal Verification of the Linux Kernel Scheduler. In Static Analysis, R. Giacobazzi and A. Gorla (Eds.), Cham, pp. 194–215. External Links: ISBN 978-3-031-74776-2 Cited by: §1.
- A. Lekssays, H. Mouhcine, and K. Tran (2025) LLMxCPG: Context-Aware Vulnerability Detection Through Code Property Graph-Guided Large Language Models. pp. 489–507. Cited by: §1.
- H. Li, Y. Hao, Y. Zhai, and Z. Qian (2024) Enhancing static analysis for practical bug detection: an llm-integrated approach. Proceedings of the ACM on Programming Languages (PACMPL), Volume 8, Issue OOPSLA1 8 (OOPSLA1). External Links: Document Cited by: §1, §1, §6.6, §8.
- H. Li, H. Zhang, K. Pei, and Z. Qian (2025a) Towards more accurate static analysis for taint-style bug detection in linux kernel. In Proceedings of the IEEE/ACM International Conference on Automated Software Engineering (ASE), External Links: Document Cited by: §1, §1, §1, §6.1, §6.4, §6.4, §6.6, §8.
- Y. Li, W. Yang, Y. Wang, J. Gao, S. Wang, Y. Xue, and L. Zhang (2025b) Scheduzz: constraint-based fuzz driver generation with dual scheduling. arXiv preprint arXiv:2507.18289. Cited by: §8.
- Z. Li, S. Dutta, and M. Naik (2025c) IRIS: LLM-Assisted Static Analysis for Detecting Security Vulnerabilities. In The Thirteenth International Conference on Learning Representations (ICLR 2025), External Links: Link Cited by: §8.
- Y. Liu, J. Deng, X. Jia, Y. Wang, M. Wang, L. Huang, T. Wei, and P. Su (2025) PromeFuzz: a knowledge-driven approach to fuzzing harness generation with large language models. In Proceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security, pp. 1559–1573. Cited by: §8.
- Y. Lyu, Y. Xie, P. Chen, and H. Chen (2024) Prompt fuzzing for fuzz driver generation. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pp. 3793–3807. Cited by: §8.
- C. Maring and K. Qian (2025) Using LLMs to filter out false positives from static code analysis. Note: Accessed: 2026-06-09 External Links: Link Cited by: §8.
- D. Ramos, C. Mamede, K. Jain, P. Canelas, C. Gamboa, and C. Le Goues (2025) Are Large Language Models Memorizing Bug Benchmarks?. In 2025 IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code), pp. 1–8. External Links: Link, Document Cited by: §6.6.
- D. A. Ramos and D. R. Engler (2015) Under-constrained symbolic execution: correctness checking for real code. In 24th USENIX Security Symposium, USENIX Security 15, Washington, D.C., USA, August 12-14, 2015, J. Jung and T. Holz (Eds.), pp. 49–64. External Links: Link Cited by: §2.2.
- M. Shafiuzzaman, A. Desai, W. Guo, and T. Bultan (2026) Guiding symbolic execution with static analysis and llms for vulnerability discovery. arXiv preprint arXiv:2604.06506. Cited by: §8.
- B. Steenhoek, M. M. Rahman, M. K. Roy, M. S. Alam, H. Tong, S. Das, E. T. Barr, and W. Le (2025) To Err is Machine: Vulnerability Detection Challenges LLM Reasoning. arXiv. Note: arXiv:2403.17218 [cs] External Links: Link, Document Cited by: §8.
- S. Ullah, M. Han, S. Pujar, H. Pearce, A. Coskun, and G. Stringhini (2024) LLMs Cannot Reliably Identify and Reason About Security Vulnerabilities (Yet?): A Comprehensive Evaluation, Framework, and Benchmarks. In 2024 IEEE Symposium on Security and Privacy (SP), pp. 862–880. External Links: Document Cited by: §8.
- C. Wang, W. Zhang, Z. Su, X. Xu, X. Xie, and X. Zhang (2024) LLMDFA: analyzing dataflow in code with large language models. In Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, and C. Zhang (Eds.), External Links: Link Cited by: §8.
- C. Wen, Y. Cai, B. Zhang, J. Su, Z. Xu, D. Liu, S. Qin, Z. Ming, and T. Cong (2024) Automatically Inspecting Thousands of Static Bug Warnings with Large Language Model: How Far Are We?. ACM Trans. Knowl. Discov. Data 18 (7), pp. 168:1–168:34. External Links: ISSN 1556-4681, Link, Document Cited by: §1, §8.
- Y. Wu, Z. Li, J. M. Zhang, and Y. Liu (2024) ConDefects: A Complementary Dataset to Address the Data Leakage Concern for LLM-Based Fault Localization and Program Repair. In Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, FSE 2024, New York, NY, USA, pp. 642–646. External Links: ISBN 979-8-4007-0658-5, Link, Document Cited by: §6.6.
- H. Xu, W. Ma, T. Zhou, Y. Zhao, K. Chen, Q. Hu, Y. Liu, and H. Wang (2025) Ckgfuzzer: llm-based fuzz driver generation enhanced by code knowledge graph. In 2025 IEEE/ACM 47th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion), pp. 243–254. Cited by: §8.
- [37] (2026-06) Zai-org/GLM-5. Z.ai. Note: original-date: 2026-02-09T08:17:02Z External Links: Link Cited by: §6.3.
- H. Zhang, W. Chen, Y. Hao, G. Li, Y. Zhai, X. Zou, and Z. Qian (2021) Statically discovering high-order taint style vulnerabilities in OS kernels. In CCS '21: 2021 ACM SIGSAC Conference on Computer and Communications Security, Virtual Event, Republic of Korea, November 15 - 19, 2021, Y. Kim, J. Kim, G. Vigna, and E. Shi (Eds.), pp. 811–824. External Links: Link, Document Cited by: §5, §6.1.
