Back to Tools
AI SecurityToolMay 1, 2026Yellow — detail controls

Project Lupine: Fine-Tuned LLM Annotations for Ghidra

Quick Answer

Project Lupine is a research demonstrator from Jer's SecTor 2023 talk: a fine-tuned Code Llama 34B model, a small inference service, and three Ghidra plugins that write LLM-generated function names, summaries, and step-by-step walkthroughs into the decompiler view. It targets reverse engineers and ML/security researchers evaluating LLM-assisted triage workflows. Treat it as a reference architecture and dataset, not a maintained product.

Repository
Private
License
Unspecified
Language
Python
Status
Maintenance

Project Lupine: Fine-Tuned LLM Annotations for Ghidra

Project Lupine is the named software deliverable from Reversing Malware with LLMs, Jer's SecTor 2023 talk. It bundles a dataset pipeline, a fine-tuned Code Llama model, a small local inference service, and three Ghidra plugins that put LLM-generated names, summaries, and step-by-step walkthroughs directly into the decompiler view. The artifact exists so the workflow has a stable, citable home distinct from the paper's narrative.

What it does

Lupine is a "copilot for the Ghidra decompiler." When an analyst presses a shortcut on a function, a fine-tuned model proposes a descriptive name, a short summary, and optionally a step-by-step walkthrough, and the plugin writes them back into the program database.

The reference workflow has four moving parts:

  • Dataset pipeline. Combines synthetic binaries (compiled from generated source that exercises specific OS APIs) with real malware processed through pefile/DIE, YARA, capa, and radare2 decompilation. Each row is a (decompiled_code, function_name, summary, steps) tuple. The curated dataset is published as dyngnosis/function_names_v2 on Hugging Face.
  • Fine-tuning recipe. QLoRA-style PEFT against codellama/CodeLlama-34b-hf, with LoRA adapters on the attention projections (r=16, alpha=16, dropout 0.05), 8-bit load, fp16, adamw_torch, lr 3e-4, 10k steps. The full Appendix A configuration lives in the paper.
  • Inference service. api_server.py exposed on localhost:8000 for local-only use. A separate community endpoint exists for shared inference and feedback collection.
  • Three Ghidra plugins. llm.py (CTRL-ALT-L) runs local inference and renames the current function with a summary comment. llm_remote.pt (CTRL-ALT-O) sends a sample hash, function offset, and decompiled body to the community endpoint. llm_suggest (CTRL-SHIFT-K) submits analyst-corrected names and summaries back upstream to feed the retraining queue.

The intended use cases are interactive triage, function-naming during static review, and as a reference implementation for researchers building successor RE assistants.

Who it's for

Lupine targets malware analysts and reverse engineers willing to stand up their own inference, and ML/security researchers who want a published dataset and a working LoRA recipe for decompiled-code understanding. Tool authors building competing or successor assistants will find the analyst-in-the-loop feedback design useful as prior art.

It is not a fit for SOC teams expecting a supported product, for analysts who cannot tolerate confidently wrong output, or for anyone who would need to send untrusted samples through a hosted endpoint they do not control. This is a research demonstration, not a product.

How to use it

The components are described at the level of the paper's Appendix A implementation snapshot rather than a packaged release. To reproduce the workflow, fine-tune Code Llama 34B against the published function_names_v2 dataset using the recipe above, host api_server.py on localhost:8000, and load the three plugin scripts into Ghidra. Treat the paper as the authoritative reference for hyperparameters and prompt format.

Status and roadmap

Lupine is research-stage. It was demonstrated at SecTor 2023 and the dataset is publicly released; the plugins and training scripts are documented at the level of the paper's appendix rather than as an actively maintained codebase. Known limitations called out in the paper include confident hallucinations, repetition and stop-condition failures, prompt-injection susceptibility when consuming attacker-controlled decompiled code, truncation poisoning during training if the context budget is not enforced, and a quality drop below 34B parameters. Roadmap items named in the paper include YARA/Sigma/Snort rule generation from extracted behaviors, auto-analyst loops that follow up suggested leads with dynamic analysis, structured Markdown reporting for TI/IR handoff, and smaller quantized variants.

Source and license

The published artifacts are the SecTor 2023 talk recording and the dyngnosis/function_names_v2 dataset on Hugging Face; both are linked from this page. No canonical source repository or SPDX license is asserted here — the dataset carries its own license terms on Hugging Face, and the plugin and training code should be treated as paper-companion reference material until a maintained release exists.

Responsible use. The remote plugins exfiltrate decompiled code by design. Decompiled malware should not be sent to community or third-party endpoints without governance, and model output — names, summaries, and step-by-step walkthroughs — requires analyst verification before any downstream action. Local-only deployment is the safer default for sensitive samples.

Related research

FAQ

Is Project Lupine production-ready?

No. Lupine is a research demonstrator presented at SecTor 2023, useful as a reference architecture and as a published dataset. Expect to adapt the plugin scripts, retrain on your own corpus, and bring your own inference governance rather than ship it as-is into a SOC pipeline.

Can I run it without sending samples to a remote server?

Yes. The local plugin (llm.py) talks to api_server.py on localhost:8000, so the entire workflow can stay on a single workstation or air-gapped network. The remote plugins (llm_remote.pt and llm_suggest) are optional and only used when joining the shared community endpoint for inference and feedback collection.

Do I need a 34B model to use it?

The paper reports that 34B Code Llama was needed for consistently usable summaries and step-by-step output. Smaller 7B and 13B variants are listed as future work; in the original experiments, summarization quality degraded noticeably below 34B. You can fine-tune smaller bases against the published dataset, but expect to re-evaluate output quality before relying on it.

Derived From

Related Work

External References

FAQ

Is Project Lupine production-ready?

No. Lupine is a research demonstrator presented at SecTor 2023, useful as a reference architecture and as a published dataset. Expect to adapt the plugin scripts, retrain on your own corpus, and bring your own inference governance rather than ship it as-is into a SOC pipeline.

Can I run it without sending samples to a remote server?

Yes. The local plugin (llm.py) talks to api_server.py on localhost:8000, so the entire workflow can stay on a single workstation or air-gapped network. The remote plugins (llm_remote.pt and llm_suggest) are optional and only used when joining the shared community endpoint for inference and feedback collection.

Do I need a 34B model to use it?

The paper reports that 34B Code Llama was needed for consistently usable summaries and step-by-step output. Smaller 7B and 13B variants are listed as future work; in the original experiments, summarization quality degraded noticeably below 34B. You can fine-tune smaller bases against the published dataset, but expect to re-evaluate output quality before relying on it.