Web Crawling¶

Crawl websites, Confluence spaces, SharePoint libraries, and other web sources to populate your knowledge repositories.

Overview¶

Web crawling allows you to automatically extract content from websites and external platforms into your knowledge repositories. The platform supports multiple crawl methods — from standard website crawling to specialized connectors for Confluence, SharePoint/OneDrive, and bulk URL imports.

To start a crawl, navigate to a Knowledge Repository, click Add Document(s), and select a crawl method from the Crawl section.

Crawl Methods¶

Method	Description
Website	Crawl website and extract content from pages — the most widely used crawl type
CSV URL Import	Import and crawl up to 10,000 URLs from a CSV file (depth fixed at 1)
Confluence	Crawl Confluence spaces using Atlassian API
SharePoint/OneDrive	Crawl SharePoint/OneDrive sites using Microsoft Graph API
Human Assisted	Crawl with human assistance using a browser plugin (for authenticated sites)

Website Crawl¶

The most relevant and widely used crawl type. Crawls a website starting from one or more URLs and extracts content from discovered pages.

Configuration¶

Field	Required	Default	Description
URLs to Crawl	Yes	—	Enter one or more starting URLs. Use "Add Another URL" for pages not discoverable through navigation or sitemaps
Document Locks	No	—	Keywords for restricting document access. Only users with matching role document keys can see results
Review Documents	No	Off	Review crawled pages before indexing. If off, pages are immediately indexed
Depth	No	`1`	Maximum crawl depth. Set to `0` for no limit
User Agent	Yes	Mozilla/5.0...	User agent string used while crawling
Primary Page Parser	Yes	Readability	Parser for raw HTML content
Secondary Page Parser	Yes	Artificial Intelligence (Slow)	AI-powered content processing. Requires Document Content Processing enabled in repository settings
Document Processor	Yes	Native (Fast)	Document processing method. Supports .pdf, .docx, .pptx

Additional Parameters¶

Field	Required	Default	Description
Max Pages	Yes	`50`	Maximum pages to crawl per website
Min Content Length	Yes	`50`	Minimum content length (characters) for indexing
Include/Exclude Mode	Yes	Exclude	Choose whether to include or exclude (skip) URL patterns
Skip Path	No	—	Regex patterns or URLs to skip during crawling
Skip Path From Index	No	—	Regex patterns or URLs to skip during indexing
External domains	No	Off	Allow external domains to be included
Force Update	No	Off	Override crawled pages even if source hasn't changed
Use Sitemaps	No	Off	Use sitemaps to discover URLs
Use Browser	No	Off	Use browser rendering for crawling. Slower but better for dynamic sites
Headless Browser	No	Off	Faster crawling but may not work well with dynamic websites

Testing

Before enabling/disabling "Use Browser" or changing Content Parser, test by indexing one page with no additional links, then review the results.

CSV URL Import¶

Import a list of URLs from a CSV file for bulk crawling. This saves time by targeting specific pages instead of traversing an entire website.

Configuration¶

Field	Required	Default	Description
CSV File	Yes	—	Upload a CSV file containing URLs (max 10,000 URLs, max 5MB). One URL per row
Document Locks	No	—	Keywords for restricting document access
Review Documents	No	Off	Review crawled pages before indexing
User Agent	Yes	Mozilla/5.0...	User agent string
Primary Page Parser	Yes	Readability	HTML content parser
Secondary Page Parser	Yes	None (Fast)	AI-powered content processing
Document Processor	Yes	Native (Fast)	Document processing method

Additional Parameters¶

Field	Required	Default	Description
Max Pages	Yes	`1000`	Maximum pages to crawl per URL. Depth is automatically set to 1 for CSV imports
Min Content Length	Yes	`50`	Minimum content length for indexing
Force Update	No	Off	Override pages even if source hasn't changed
Use Browser	No	Off	Use browser rendering
Headless Browser	No	Off	Faster browser crawling

Confluence¶

Crawl Confluence spaces using the Atlassian API to import wiki pages and documentation.

Configuration¶

Field	Required	Default	Description
Confluence URL	Yes	—	Confluence Base URL (e.g., `https://example.atlassian.net`), excluding trailing slash
Username	Yes	—	Confluence username
API Token	Yes	—	API token associated with the Confluence account
Space Key	Yes	—	Confluence Space Key to crawl
Document Locks	No	—	Keywords for restricting document access
Review Documents	No	Off	Review crawled pages before indexing

Additional Parameters¶

Field	Required	Default	Description
Maximum Pages	No	`50`	Maximum pages to crawl in the specified space
Minimum Content Length	No	`50`	Minimum characters for a page to be indexed
Confluence Page(s)	No	—	Specify page title(s) to crawl. The specified page and its child pages will be included. If blank, all pages in the space are crawled

SharePoint/OneDrive¶

Crawl SharePoint/OneDrive sites and document libraries using Microsoft Graph API.

Configuration¶

Field	Required	Default	Description
SharePoint/OneDrive Site URL	Yes	—	The SharePoint/OneDrive site URL
Connect	Yes	—	Authenticate with Microsoft. Click "Save" after connecting
Document Locks	No	—	Keywords for restricting document access
Review Documents	No	Off	Review crawled pages before indexing
Depth	No	`1`	Maximum folder depth. Set to `0` for no limit
Document Processor	Yes	Native (Fast)	Supports .pdf, .docx, .pptx

Permissions Required

A valid Microsoft Graph API access token with Sites.Read.All and Files.Read.All permissions is required to crawl SharePoint/OneDrive content.

Additional Parameters¶

Field	Required	Default	Description
Max Items	Yes	`50`	Maximum files and folders to crawl
Min Content Length	Yes	`50`	Minimum content length for indexing
Include/Exclude Mode	Yes	Exclude	Choose whether to include or exclude filename patterns
Skip Path From Index	No	—	Regex patterns to skip during indexing
File Types	No	—	File extensions to include (e.g., `pdf`, `docx`). Leave empty for all supported types
Force Update	No	Off	Override pages even if source hasn't changed

Human Assisted¶

Crawl websites that require authentication or complex navigation using a Chrome browser plugin. The plugin uses the user's active browser session for crawling.

Configuration¶

Field	Required	Default	Description
URL	Yes	—	Starting URL to crawl
Document Locks	No	—	Keywords for restricting document access
Review Documents	No	Off	Review crawled pages before indexing
Depth	No	`1`	Maximum crawl depth
Browser Token from Plugin	Yes	—	Token from the Chrome plugin
Content Parser	Yes	Readability	HTML content parser
Secondary Page Parser	Yes	None (Fast)	AI-powered content processing
Document Processor	Yes	Native (Fast)	Document processing method

Browser Plugin Setup¶

Download and extract the Assisted Crawling plugin to a folder
Open Chrome, navigate to chrome://extensions/ and enable Developer mode
Click Load unpacked, select the extracted plugin folder, and confirm
Click on details menu for the plugin and enable pin to toolbar
Click the plugin icon on the toolbar and verify it is enabled and generating a token within 2-3 minutes
Ensure plugin status shows as connected before starting any crawling job

Authentication

The browser token uses your active browsing session. Make sure you are logged in to the website and all subdomains you want to crawl.

Additional Parameters¶

Field	Required	Default	Description
Max Pages	Yes	`50`	Maximum pages to crawl
Min Content Length	Yes	`50`	Minimum content length for indexing
Include/Exclude Mode	Yes	Exclude	Include or exclude URL patterns
Skip Path	No	—	Patterns to skip during crawling
Skip Path From Index	No	—	Patterns to skip during indexing
External domains	No	Off	Allow external domains
Force Update	No	Off	Override pages if unchanged

Job Monitoring¶

Knowledge Jobs Page

After starting a crawl, monitor progress via the View Jobs button on the repository documents page.

Job History¶

Column	Description
Job ID	Unique job identifier
Trigger	Manual or Scheduled
URL(s)	The crawled URLs
Request Details	Crawl parameters (depth, browser, parser, etc.)
Status	In Progress, Success, Failed
Created On	Job start time
Actions	Pages, Rerun, Schedule, Stop, Delete

Scheduling Crawl Jobs¶

For Website and CSV URL Import crawl types, completed jobs can be scheduled to run periodically. The scheduler supports an expanded set of preset cadences — including Quarterly patterns with specific months baked in (e.g., Jan/Apr/Jul/Oct) and Twice Monthly options — plus a Custom Cron Expression field for anything the presets don't cover.

Scheduler Quick Schedule Presets Schedule modal — Quick Schedule dropdown showing daily/weekly/monthly presets plus Twice Monthly (1^st & 15^th) and Quarterly variants (Jan, Apr, Jul, Oct). The Custom Cron Expression field below handles cadences the presets don't cover

Field	Required	Description
Schedule Name	Yes	Descriptive name for the scheduled job
Description	No	Optional description
Notification Email(s)	No	Email addresses to notify (comma-separated)
Quick Schedule preset	No	Pick a preset cadence — Daily / Weekly / Monthly / Twice Monthly / Quarterly (each preset encodes its day(s), time, and target months)
Custom Cron Expression	No	Custom cron (format: `0 hour day month weekday`). Minute must be `0` for hour-only scheduling. Use for advanced cadences the presets don't cover (e.g., bi-weekly, specific month combinations not in the preset list)

Cadence Examples¶

Need	Configuration
Every quarter (Jan, Apr, Jul, Oct)	Pick Quarterly on the 1^st at 9:00 AM UTC (Jan, Apr, Jul, Oct) (or one of the other quarterly presets)
Twice a month (1^st & 15^th)	Pick Twice monthly (1^st and 15^th) at 9:00 AM UTC
Custom months not in any preset	Provide a Custom Cron Expression — e.g., `0 9 1 3,6,9,12 *` for the 1^st of Mar/Jun/Sep/Dec at 9:00 AM UTC
Bi-weekly / other advanced cadence	Provide a Custom Cron Expression

UTC Timezone

All schedules use UTC. The scheduler previously rejected non-preset cadences with an error directing users to the presets — that limitation has been removed. The Custom Cron Expression field is now the escape hatch for any cadence the presets don't cover.

Document Ingestion — Upload files and manage documents
RAG Overview — How retrieval-augmented generation works
Vector Search — How documents are searched and retrieved
Back to Knowledge

Web Crawling¶

Overview¶

Crawl Methods¶

Website Crawl¶

Configuration¶

Additional Parameters¶

CSV URL Import¶

Configuration¶

Additional Parameters¶

Confluence¶

Configuration¶

Additional Parameters¶

SharePoint/OneDrive¶

Configuration¶

Additional Parameters¶

Human Assisted¶

Configuration¶

Browser Plugin Setup¶

Additional Parameters¶

Job Monitoring¶

Job History¶

Scheduling Crawl Jobs¶

Cadence Examples¶

Related Topics¶