The Web Crawler Tool
Go beyond a single result: crawl multiple pages within a domain and extract their text, titles, and URLs — with polite delays and page limits — so an agent can ingest a whole site’s content, on infrastructure you control.
One page rarely holds the whole answer
A single search result is a starting point, but the real content — a documentation site, a knowledge base, a competitor’s product pages — spans many pages. Reading them one by one doesn’t scale, and naive scraping gets you blocked.
Content spans pages
The full picture lives across a whole section of a site.
Manual page-by-page
Visiting each page by hand is slow and incomplete.
Naive scraping gets blocked
Hammering a site without delays is rude and quickly rate-limited.
Messy HTML
Raw pages are full of markup an agent has to wade through.
A site’s content, ingested cleanly
Breadth
Crawl across the site
Many pages, one call.
The tool follows links within a domain and extracts the text, titles, and URLs of multiple pages, so an agent can ingest a whole documentation set or product section rather than a single page.
- Multi-page within a domain
- Text, titles, and URLs
- Per-site page limits
- Up to five starting URLs
Whole sections
Politeness
Crawls responsibly
Delays and limits built in.
Configurable delays between requests and a cap on pages per site mean the crawler behaves politely and stays within bounds — getting the content without getting blocked.
Delays + caps
Governance
On-premise crawling
Under your control.
Crawling runs through a controlled tool with audit logging on infrastructure you operate, so external content ingestion is governed rather than ad-hoc.
Governed, logged
Parameters
The web_crawler tool accepts these inputs when an agent calls it. Required inputs are flagged.
default: 10 Optional Maximum pages to crawl per website (1–20).
default: 1 Optional Minimum delay between requests in seconds (0.5–10).
default: 3 Optional Maximum delay between requests in seconds (1–20).
default: 5000 Optional Maximum text length to extract per page (1000–10000).
Where the web crawler pays back
Documentation ingest
Pull a whole docs site into text for retrieval.
Competitive research
Extract a competitor’s product pages for analysis.
Knowledge building
Feed crawled content into a private RAG index.
Content audits
Gather a site’s pages to review coverage.
Monitoring
Re-crawl pages to track changes over time.
Agent research
Let a research agent ingest a site, not just a page.
Assigned to agents, orchestrated as networks
On VDF AI, an industry’s use cases map to agents, and you assign tools like this one to those agents. Compose multiple agents into a governed, on-premise network.
What changes after you assign it
Questions about the Web Crawler tool
What does the web crawler do?
It crawls multiple pages within a domain and extracts their text, titles, and URLs, with polite delays and a per-site page limit, so an agent can ingest a whole site section rather than a single page.
Does it crawl across domains?
It stays within the same domain for each starting URL and accepts up to five starting URLs, keeping crawls focused and responsible.
How does it avoid getting blocked?
Configurable minimum and maximum delays between requests plus a cap on pages per site keep it polite and within bounds.
Is crawling governed?
Yes. It runs through a controlled tool with audit logging on infrastructure you operate.
How is it used by agents?
Research and knowledge agents use it to ingest sites into private RAG, often paired with web search for discovery and the federated vector search for retrieval afterward.
Assign Web Crawler to these agents
These VDF AI agents can be assigned this tool. Open an agent to see the full toolkit it can run.
Tools that work well alongside this one
Where this tool delivers value
Ingest whole sites, not single pages
See the web crawler feed a research agent’s knowledge base — on infrastructure you control.