Content Discovery with ffuf and feroxbuster: Wordlists, Recursion, Extensions, Backups, and Vhosts
Content discovery is the part of recon where you stop guessing at what an application exposes and start systematically proving it. Crawling and a sitemap only show you the links the developers chose to surface; fuzzing shows you everything else — the forgotten /admin-old/, the config.php.bak left after a hotfix, the staging vhost bound to the same IP, the API version that was never deprecated. On real engagements, the highest-impact findings frequently live behind a URL that was never meant to be reachable, and the only way to find them at scale is brute-force enumeration with a good wordlist and disciplined filtering.
This guide is a practical walkthrough of doing that well with ffuf and feroxbuster, the two tools most pentesters reach for. We will cover wordlist strategy, recursion, extension fuzzing, hunting backup and source files, virtual host discovery, and — most importantly — how to cut through false positives so the results are actually usable. Everything here assumes you have written authorization for the target; brute-forcing paths against a host you do not own is noisy and, in most jurisdictions, illegal.
ffuf vs feroxbuster: When to Reach for Which
Both tools do HTTP fuzzing, but they have different ergonomics. ffuf is a single-shot fuzzer: you give it a wordlist, a URL with a FUZZ keyword, and matching/filtering rules. It is fast, scriptable, and the keyword model means you can fuzz any part of a request — path, parameter, header, or Host. feroxbuster is purpose-built for recursive directory discovery: it crawls and recurses automatically, extracts links from responses, and has sane defaults for depth and filtering, which makes it excellent for a fast first pass.
A common workflow is feroxbuster for broad recursive mapping, then ffuf for surgical work — fuzzing a specific parameter, a Host header, or a tightly scoped extension sweep. Both honor rate limits, threads, and proxy settings so you can route traffic through Burp for inspection.
# ffuf — explicit, keyword-driven
ffuf -u https://target.tld/FUZZ -w wordlist.txt
# feroxbuster — recursive by default
feroxbuster -u https://target.tld -w wordlist.txt
Wordlists: The Single Biggest Lever
Your results are only as good as your wordlist. A 200k-word generic list run against every endpoint wastes time and triggers WAFs; a curated list matched to the target's stack finds the interesting paths fast. The SecLists project is the de facto standard, and the Discovery/Web-Content directory is where you live.
- raft-* lists (
raft-medium-directories.txt,raft-medium-files.txt) — derived from real-world crawls, excellent general-purpose coverage. - common.txt — small and fast for a quick triage pass before committing to a long run.
- Technology-specific lists — once you fingerprint the stack (Tomcat, WordPress, Spring Boot, IIS), switch to the matching list. There is no point fuzzing
.aspxagainst a Node app. - API lists —
api/api-endpoints.txtandobjects.txtfor JSON APIs where directories are meaningless but resource names matter.
Tailor the list to what you observe. If you see X-Powered-By: PHP, prioritize PHP files and add .php to your extension set. The fastest way to assemble a target-shaped list is to mine the application itself — crawl it, extract every path and parameter name from JS bundles, and feed those words back into the fuzzer. The Recon Hub is useful here for pulling endpoints out of JavaScript and certificate transparency data so your wordlist reflects the real app rather than a generic guess.
# Quick triage, then go deep
ffuf -u https://target.tld/FUZZ -w /usr/share/seclists/Discovery/Web-Content/common.txt -c
# Stack-specific deep pass
ffuf -u https://target.tld/FUZZ \
-w /usr/share/seclists/Discovery/Web-Content/raft-medium-directories.txt
Extensions: Fuzzing Files, Not Just Directories
Directories tell you the structure; files are where the loot is. ffuf's -e flag appends extensions to every word, turning a directory list into a file list. Match extensions to the detected technology — fuzzing every extension against every word multiplies your request count and your noise.
# Append likely file extensions to each candidate
ffuf -u https://target.tld/FUZZ \
-w raft-medium-words.txt \
-e .php,.bak,.old,.txt,.zip,.config,.json -c
A subtle but important trick: many candidates are valid both as a directory and as a file. Run one pass with no extension (catches directories) and one with extensions (catches files). For SPAs and APIs, extensions are often irrelevant — there the resource names and HTTP methods matter more than file suffixes.
Recursion: Going Deeper Without Drowning
When a fuzzer finds /admin/, the interesting content is usually inside it. Recursion automatically queues discovered directories for further fuzzing. feroxbuster does this by default; ffuf needs -recursion with a depth cap.
# ffuf recursion, capped at depth 2 to avoid runaway scans
ffuf -u https://target.tld/FUZZ -w raft-medium-directories.txt \
-recursion -recursion-depth 2 -c
# feroxbuster with explicit depth and link extraction
feroxbuster -u https://target.tld -w raft-medium-directories.txt \
--depth 3 --extract-links
Recursion is powerful and dangerous. Uncapped, it can fan out into thousands of requests against deeply nested apps or hit infinitely recursing routes. Always set a depth limit, and consider --dont-scan in feroxbuster to exclude noisy paths like /static/ or /assets/ that contain hundreds of files but no security-relevant content. --extract-links (or feroxbuster's link extraction) parses responses for hrefs and queues genuinely-reachable paths you would otherwise miss.
Backup and Source-Disclosure Files
One of the most reliable wins in content discovery is finding a backup of a file whose live version executes server-side. If app.php runs as code but app.php.bak is served as plain text, you have just read the source — often including database credentials, secret keys, and internal logic. Editors and deploy scripts leave a predictable trail of these.
- Editor/temp suffixes:
~,.swp,.save,.orig - Backup suffixes:
.bak,.old,.backup,.copy,.1 - Archives:
.zip,.tar.gz,.tar, often named after the app or domain (backup.zip,target.tld.zip) - VCS metadata:
.git/HEAD,.git/config,.svn/entries,.DS_Store - Config/secrets:
.env,config.php.bak,web.config.old,settings.py.swp
# Target known filenames with a backup-suffix list using two FUZZ points
ffuf -u https://target.tld/FUZZW1FUZZW2 \
-w filenames.txt:FUZZW1 \
-w backup-extensions.txt:FUZZW2 \
-mc 200 -c
# Hunt exposed .git — if .git/HEAD returns 200, dump the repo
ffuf -u https://target.tld/FUZZ \
-w /usr/share/seclists/Discovery/Web-Content/raft-medium-files.txt \
-mc 200 | grep -i '\.git'
If .git/ is exposed, tools like git-dumper reconstruct the entire repository from the loose objects — full source history, deleted secrets, and all. Treat any 200 on .git/config or .env as a critical finding and verify content before reporting.
Virtual Host (Vhost) Discovery
A single IP commonly serves many sites via the Host header. Path fuzzing only sees the default vhost; vhost fuzzing reveals the others — internal apps, admin panels, and staging environments that are not in public DNS but answer when you ask for them by name. The technique is to fuzz the Host header while keeping the URL fixed on the target IP.
# Fuzz the Host header against a known IP
ffuf -u https://target.tld/ -H "Host: FUZZ.target.tld" \
-w /usr/share/seclists/Discovery/DNS/subdomains-top1million-20000.txt \
-c
# Filter out the default-vhost response by size (see filtering below)
ffuf -u https://10.10.10.10/ -H "Host: FUZZ.target.tld" \
-w vhost-wordlist.txt -fs 4242
The key challenge is that an unknown vhost usually returns the same default page as everything else, so you filter on response size or word count and only the real vhosts stand out. Distinguish vhost discovery from DNS subdomain brute-forcing: vhost fuzzing asks the server "do you serve this name on this IP?" without DNS ever resolving it, which finds internal-only hosts. For the DNS side — names that actually resolve — build a strong list with the Subdomain Wordlist Generator, then validate live hosts before deep-fuzzing each one.
Filtering and Calibration: Killing False Positives
The difference between a useful scan and a wall of garbage is filtering. Many apps return 200 OK for everything (SPAs with a catch-all route) or 404 pages that are actually 200s with a "not found" body. You calibrate by learning what a "miss" looks like, then filtering it out.
- Match codes (
-mc): only show specific statuses, e.g.-mc 200,301,302,401,403. - Filter size (
-fs): hide responses of a known boilerplate length. - Filter words / lines (
-fw/-fl): more stable than size when responses contain dynamic timestamps. - Filter regex (
-fr): drop anything whose body matches a known error string. - Auto-calibration (
-ac): ffuf sends random probes first, learns the baseline, and filters automatically — start here.
# Let ffuf auto-calibrate against soft-404s, then keep useful codes
ffuf -u https://target.tld/FUZZ -w raft-medium-directories.txt \
-ac -mc 200,204,301,302,307,401,403,405 -c
# Manual: a 200 soft-404 always returns 1456 bytes — filter it
ffuf -u https://target.tld/FUZZ -w wordlist.txt -fs 1456
Do not blindly discard 403 and 401 — a forbidden directory confirms the path exists and is worth probing with method changes, header tricks, or path normalization. A 403 on /admin/ is a signpost, not a dead end. For a full command reference and copy-paste recipes, the ffuf cheat sheet keeps the flags handy mid-engagement.
Defenses and Remediation
From the defender's side, content discovery is hard to stop entirely but easy to make far less rewarding:
- Return consistent 404s. Soft-404s that respond
200with a "not found" body defeat naive scanners but are trivially calibrated around — and they make legitimate monitoring harder. Return a real404with a stable, minimal body. - Never deploy backups or VCS metadata to web roots. Block
.git,.svn,.env,~,.bak, and.swpat the web server or WAF layer. Deploy from artifacts, not by copying a working directory. - Remove orphaned content. Old admin panels, deprecated API versions, and staging vhosts should be decommissioned, not just unlinked. Unlinked is not unreachable.
- Rate-limit and monitor. A burst of hundreds of 404s from one source is a clear brute-force signature — alert on it and throttle.
- Default-deny vhosts. Configure the web server to reject unknown
Hostheaders with a hard error rather than serving the default site, which neutralizes vhost fuzzing. - Enforce authentication and authorization on every route. Discovery only matters if discovered endpoints are reachable without proper checks. Per-route authz means a found path is still a closed door.
Content discovery rewards patience and a sharp wordlist far more than raw thread count. Start narrow, calibrate against the target's baseline, follow the recursion only as deep as it stays interesting, and always close the loop by manually verifying the high-value hits — an exposed .git or a readable config.php.bak is only a finding once you have confirmed what it actually leaks. Keep your scans in scope, throttled, and documented, and the forgotten corners of an application will give up their secrets reliably.
Level up your security testing
Install the CLI
npx payload-playgroundExplore All Tools
Encoding, hashing, JWT & more
Browse Cheat Sheets
Quick-reference payload guides