Content Type to Crawl
Not everything on a site or server needs to be in your search index. Keyspider lets you specify exactly which file types and content formats to crawl — web pages, PDFs, Word documents, XML feeds, JSON APIs, and more. Precise control means a cleaner index and more relevant results.

How it works
Define which content types to include
In the crawler settings for each source, select which file types to index. Enable PDFs but exclude images and scripts, for example — keeping the index clean and focused.
Set URL and path rules
Use include/exclude patterns to crawl only specific sections of a site. Index /support/* but exclude /blog/* if your search is focused on support content.
Keyspider extracts content from every format
Text is extracted from PDFs, DOCX files, and other document formats automatically. No pre-processing or format conversion needed on your side.
Use cases
Document management systems
Index only PDFs and Word documents from a SharePoint library — excluding images, videos, and spreadsheets that don't contain searchable text content.
Technical documentation portals
Crawl HTML pages and PDF manuals simultaneously. Users search once and find answers from both web documentation and downloadable technical guides.
Government information portals
Index published policy PDFs, legislation documents, and web pages from the same site. Content type rules ensure internal drafts and working files are never exposed.
Ready to give your users better answers?
AI Search, AI Assistant, and Workplace Search. Deployed in days, not months. See it live on your own content.
No credit card required · Live in 2 weeks · Cancel anytime