Index What Matters. Exclude What Doesn't.

Content Type to Crawl

Not everything on a site or server needs to be in your search index. Keyspider lets you specify exactly which file types and content formats to crawl — web pages, PDFs, Word documents, XML feeds, JSON APIs, and more. Precise control means a cleaner index and more relevant results.

Request a demo See AI Search platform

Supported types: HTML, PDF, DOCX, XLSX, XML, JSON, plain text, and more

Include or exclude specific MIME types per data source

Crawl depth control — define how many levels deep to follow links

URL pattern matching to include or exclude specific paths

Authentication support for gated PDFs and documents

Metadata extraction from document properties (author, created date, title)

How it works

Define which content types to include

In the crawler settings for each source, select which file types to index. Enable PDFs but exclude images and scripts, for example — keeping the index clean and focused.

Set URL and path rules

Use include/exclude patterns to crawl only specific sections of a site. Index /support/* but exclude /blog/* if your search is focused on support content.

Keyspider extracts content from every format

Text is extracted from PDFs, DOCX files, and other document formats automatically. No pre-processing or format conversion needed on your side.

Use cases

Document management systems

Index only PDFs and Word documents from a SharePoint library — excluding images, videos, and spreadsheets that don't contain searchable text content.

Technical documentation portals

Crawl HTML pages and PDF manuals simultaneously. Users search once and find answers from both web documentation and downloadable technical guides.

Government information portals

Index published policy PDFs, legislation documents, and web pages from the same site. Content type rules ensure internal drafts and working files are never exposed.

Web Crawler Multiple Data Sources Crawling Interval

Ready to give your users better answers?

AI Search, AI Assistant, and Workplace Search. Deployed in days, not months. See it live on your own content.

Book a Demo Talk to Sales

No credit card required · Live in 2 weeks · Cancel anytime

See it live on your content

40+ school sites, one search bar

Compliance deadline: April 2026

SLED Procurement Checklist

Government Digital Experience 2025