Back to Explainers
AI SecurityExplainerApril 29, 2026

LLM-Assisted Malware Reverse Engineering: What Works and Where the Risks Are

Quick Answer

LLM-assisted malware reverse engineering uses a language model as a copilot inside Ghidra or IDA to translate decompiled code into function names, summaries, and step-by-step explanations. It speeds up triage by making the call graph readable in minutes instead of hours. The model proposes; the analyst verifies. It does not unpack binaries, it does not replace analyst judgment, and it introduces three risks: hallucinated annotations, data exfiltration through hosted inference, and adversarial content embedded in the sample itself.

LLM-Assisted Malware Reverse Engineering: What Works and Where the Risks Are

LLM-assisted malware reverse engineering puts a fine-tuned code model inside the analyst's decompiler — Ghidra, IDA, Binary Ninja — and uses it to propose function names, summaries, and step-by-step explanations as the analyst works. Done well, it turns a three-hour triage into a fifteen-minute one. Done carelessly, it puts confident wrong labels into the call graph and leaks samples to third parties. This page is the mental model and the workflow controls for teams considering it.

What is LLM-assisted malware reverse engineering?

It is the practice of using a language model as a copilot inside an existing reverse-engineering tool to translate decompiled code into human-readable annotations. A keyboard shortcut on an unknown function returns three things: a descriptive name, a short summary, and an optional step-by-step explanation. The plugin writes those back into the program database, so the analyst's call graph and cross-reference views become navigable.

The right framing is annotation, not analysis. The model proposes labels; the analyst accepts, edits, or rejects them. Anything the model "concludes" is a hypothesis until checked against the binary, the imports, the strings, and — where available — a dynamic trace. The model does not replace the analyst; it does the high-frequency translation work so analyst time goes to judgment calls.

How does it work?

A representative workflow, drawn from Project Lupine's SecTor 2023 talk, runs four loops:

  1. Triage loop (in-IDE, per function). The analyst opens an unknown function. A plugin shortcut sends the decompiled body — plus light context like cross-references, imports, and key strings — to an inference service. The model returns a proposed name and summary; the plugin renames the function and writes the comment. The analyst reviews and accepts, edits, or reverts.
  2. Dataset loop (offline, per corpus). A pipeline ingests samples, runs static identification, pattern matching, and capability detection (for example with capa), selects candidate functions, decompiles them, and pairs them with labels — synthetic for bootstrapping, real for coverage of anti-analysis tricks. Each row is (decompiled_function, name, summary, steps).
  3. Training loop (offline, per model version). A code-foundation model — Code Llama 34B in Lupine's case — is fine-tuned on that corpus using parameter-efficient methods such as QLoRA. The three task formats (name, summarize, explain) are kept separate so the model learns each as a distinct conditional, and evaluation is per task.
  4. Feedback loop (in-IDE, optional). Analyst corrections to model-proposed names and summaries can be queued back into training data, so the model gradually learns the team's conventions and the malware families they actually see.

Three properties of this shape determine what works and what does not. The model reads decompiler output, not the binary, so packers and virtualization obfuscators defeat it for the same reason they defeat the human. Context is a budget — full function body first, then xrefs, then key strings beats spending tokens on file headers. And model size matters more than prompt cleverness for some tasks: naming is doable on smaller models, but summaries and step-by-step explanations needed 34B-class models in Lupine's experience to be consistently usable.

Why does it matter?

Analyst time is the bottleneck of a threat-research pipeline. A team triaging a fresh sample can burn most of an analyst's day on functions that turn out to be uninteresting. If an LLM produces reliable enough first-pass names and summaries that the call graph becomes readable quickly, the team covers more samples, hunts more aggressively, and ships detections faster.

The risks split into three buckets, and each one is the cost of putting a probabilistic model in the analyst's loop:

  • Hallucinated annotations. The model returns a confident wrong answer. An analyst rushing through triage trusts a name like decrypt_c2_payload when the function does something else, and the mistake propagates into the report and the detection. Naming hallucinations are the highest-frequency failure and the easiest to miss, because the next analyst sees the rename, not the decompiled code.
  • Data exfiltration through the inference path. Decompiled code is data. A hosted endpoint sees every function body, string, and structural detail you send it, including proprietary code, customer-identifying strings, or hints about an active investigation. Local inference removes the network hop; remote inference needs the same data-flow review as any third-party service handling sensitive samples.
  • Adversarial influence on the model. Malware is hostile content. A sample can include strings, identifiers, or comments designed to mislead a model that summarizes it — claiming a legitimate purpose, or attempting to inject instructions through decompiled text. Treat model inputs as attacker-controlled and verify outputs against behavior, not the model's claim about behavior.

How do you defend against it?

The defenses are workflow controls. The goal is to keep the speed-up while keeping the analyst — not the model — accountable for the conclusion.

  1. Treat model output as a hypothesis, not a finding. Every proposed name and summary is reviewed against the decompiled code, the import table, the strings, and any dynamic trace. Adopt a visible UI convention — comment prefix, color, badge — that marks an annotation as "model-proposed, unreviewed" until an analyst signs off. Cost: review discipline and plugin UX. Does not cover: analysts who skip review under deadline pressure.
  2. Run inference locally for sensitive samples. A local model removes the third-party data flow entirely. For investigations involving customer data, embargoed disclosures, or active incidents, local-only should be the default and remote should be a deliberate exception. Cost: GPU hardware and model-ops work. Does not cover: the rest of the data flow (logs, dataset uploads, telemetry).
  3. Apply outbound network controls to remote inference. When a hosted endpoint is in scope, route it through a controlled egress: log every request and response, rate-limit, and maintain an approved-endpoint list. This makes "what did we send out about this sample?" a forensically answerable question. Cost: network and logging plumbing. Does not cover: provider-side handling after the data leaves.
  4. Keep model inputs separate from analyst instructions. The system prompt and the analyst's intent are operator content. The decompiled function, the strings, and the imports are sample-derived and potentially adversarial. Put sample-derived content inside an explicit "untrusted" envelope in the prompt template so the model has a reason to discount instructions inside it. Cost: template hygiene. Does not cover: a model that ignores the envelope.
  5. Verify outputs against ground truth before they leave the IDE. API-resolution checks (does this function actually call WriteProcessMemory?), xref consistency, and dynamic traces catch the highest-impact hallucinations cheaply. Bake the cheap checks into the plugin so they fire automatically before the analyst sees the suggestion. Cost: plugin engineering. Does not cover: hallucinations that survive the cheap checks.
  6. Track per-task evaluation, not a single score. Naming, summarization, and step-by-step explanation are separate tasks with separate failure modes. Evaluate them separately on a held-out set of real-world samples, and use that set as a regression test before promoting a new model version. Cost: eval harness and curated holdout. Does not cover: drift on samples unlike anything in the holdout.
  7. Govern the feedback loop. Analyst corrections are training data. If they feed a community model, they are now shared training data and may include proprietary names, internal conventions, or hints about active investigations. Decide what is submittable, scrub identifiers before submission, and keep an audit log. Cost: governance work. Does not cover: analysts who route around the policy.

The source paper uses Ghidra, but the workflow shape — and these defenses — are tool-agnostic.

Related concepts and tools

FAQ

Does LLM-assisted reverse engineering work on packed or obfuscated binaries?

Not directly. The model reads decompiled code, so anything that defeats the decompiler also defeats the model. Packers, virtualization obfuscators, and aggressive control-flow flattening all sit upstream of the LLM. The workflow is: unpack first, normalize the decompilation, then annotate.

What model size do you actually need for usable annotations?

In Project Lupine's experience, 34B-parameter code models were the floor for consistently usable summaries and step-by-step explanations. Smaller 7B and 13B models can handle naming alone but degrade on multi-step reasoning. Better training data and longer context windows can move that floor down over time, but the gap between tasks is real and worth measuring per task.

Does feeding malware to a hosted LLM leak the sample?

Yes, in the obvious sense. Decompiled code, strings, and structural metadata all reach the provider, and that may include proprietary code or victim-identifying content embedded in the sample. For sensitive samples, prefer local inference. For shared or community endpoints, treat the connection as a third-party data flow with the same review you would apply to any external SaaS.

Derived From

External References

FAQ

Does LLM-assisted reverse engineering work on packed or obfuscated binaries?

Not directly. The model reads decompiled code, so anything that defeats the decompiler also defeats the model. Packers, virtualization obfuscators, and aggressive control-flow flattening all sit upstream of the LLM. The workflow is: unpack first, normalize the decompilation, then annotate.

What model size do you actually need for usable annotations?

In Project Lupine's experience, 34B-parameter code models were the floor for consistently usable summaries and step-by-step explanations. Smaller 7B and 13B models can handle naming alone but degrade on multi-step reasoning. Better training data and longer context windows can move that floor down over time, but the gap between tasks is real and worth measuring per task.

Does feeding malware to a hosted LLM leak the sample?

Yes, in the obvious sense. Decompiled code, strings, and structural metadata all reach the provider, and that may include proprietary code or victim-identifying content embedded in the sample. For sensitive samples, prefer local inference. For shared or community endpoints, treat the connection as a third-party data flow with the same review you would apply to any external SaaS.