The Settings API allows you to configure every aspect of the crawler's behavior. Settings are persisted per user and automatically applied to new crawls. Different user tiers have access to different settings categories.
Tier-Based Access: Settings are grouped by user tier. Guest users cannot modify settings. User tier gets basic settings, Extra tier adds JavaScript and filters, Admin tier gets all settings including advanced configuration.
Endpoints
Retrieve the current user's saved settings. Returns default values for any settings not explicitly configured.
Authentication
Requires valid session cookie (user tier or above).
Example Request
curl http://localhost:5000/api/get_settings \
-b cookies.txt
Success Response (200 OK)
{
"success": true,
"settings": {
// Crawler Settings
"maxDepth": 5,
"maxUrls": 10000,
"crawlDelay": 0.5,
"followRedirects": true,
"crawlExternalLinks": false,
// Request Settings
"userAgent": "LibreCrawl/1.0",
"timeout": 30,
"retries": 3,
"acceptLanguage": "en-US,en;q=0.9",
"respectRobotsTxt": true,
"allowCookies": true,
"discoverSitemaps": true,
"enablePageSpeed": false,
"googleApiKey": "",
// Filter Settings (Extra tier)
"includeExtensions": "html,htm,php,asp,aspx",
"excludeExtensions": "jpg,jpeg,png,gif,pdf,zip",
"includePatterns": "",
"excludePatterns": "",
"maxFileSize": 10,
// JavaScript Settings (Extra tier)
"enableJavaScript": false,
"jsWaitTime": 2,
"jsTimeout": 30,
"jsBrowser": "chromium",
"jsHeadless": true,
"jsUserAgent": "",
"jsViewportWidth": 1920,
"jsViewportHeight": 1080,
"jsMaxConcurrentPages": 5,
// Export Settings
"exportFormat": "csv",
"exportFields": ["url", "status_code", "title"],
// Advanced Settings (Admin tier)
"concurrency": 10,
"memoryLimit": 2048,
"logLevel": "INFO",
"saveSession": true,
"enableProxy": false,
"proxyUrl": "",
"customHeaders": "{}",
// Custom UI (Extra tier)
"customCSS": "",
// Issue Management
"issueExclusionPatterns": ""
}
}
Save user settings. Only saves settings accessible to the user's tier level. Settings are validated and applied immediately to the crawler instance if one exists.
Authentication
Requires valid session cookie (user tier or above).
Request Body
Send a JSON object with any or all settings fields. Only provided fields will be updated.
Example Request
curl -X POST http://localhost:5000/api/save_settings \
-H "Content-Type: application/json" \
-b cookies.txt \
-d '{
"maxDepth": 3,
"maxUrls": 5000,
"crawlDelay": 1.0,
"respectRobotsTxt": true,
"enableJavaScript": true,
"jsWaitTime": 3
}'
Success Response (200 OK)
{
"success": true,
"message": "Settings saved successfully"
}
Error Response (400 Bad Request)
{
"success": false,
"error": "Invalid setting value: maxDepth must be between 1 and 5000000"
}
Reset all settings to their default values.
Authentication
Requires valid session cookie (user tier or above).
Example Request
curl -X POST http://localhost:5000/api/reset_settings \
-b cookies.txt
Success Response (200 OK)
{
"success": true,
"message": "Settings reset to defaults"
}
Apply saved settings to the active crawler instance. Useful for updating settings during a paused crawl.
Authentication
Requires valid session cookie (user tier or above).
Example Request
curl -X POST http://localhost:5000/api/update_crawler_settings \
-b cookies.txt
Success Response (200 OK)
{
"success": true,
"message": "Crawler settings updated"
}
Note: Some settings (like maxDepth, maxUrls) cannot be changed during an active crawl. Pause the crawl first, update settings, then resume.
Settings Reference
Crawler Settings (All tiers)
| Setting | Type | Default | Description |
|---|---|---|---|
| maxDepth | number | 5 | Maximum crawl depth (1-5000000) |
| maxUrls | number | 10000 | Maximum URLs to discover (1-5000000) |
| crawlDelay | number | 0.5 | Delay between requests in seconds (0-60) |
| followRedirects | boolean | true | Follow HTTP redirects (301, 302, etc.) |
| crawlExternalLinks | boolean | false | Crawl links to external domains |
Request Settings (User tier+)
| Setting | Type | Default | Description |
|---|---|---|---|
| userAgent | string | "LibreCrawl/1.0" | User-Agent header for requests |
| timeout | number | 30 | Request timeout in seconds (1-300) |
| retries | number | 3 | Number of retries for failed requests (0-10) |
| acceptLanguage | string | "en-US,en;q=0.9" | Accept-Language header value |
| respectRobotsTxt | boolean | true | Respect robots.txt directives |
| allowCookies | boolean | true | Enable cookie handling |
| discoverSitemaps | boolean | true | Auto-discover and crawl sitemaps |
| enablePageSpeed | boolean | false | Enable Google PageSpeed analysis |
| googleApiKey | string | "" | Google API key for PageSpeed (required if enabled) |
Filter Settings (Extra tier+)
| Setting | Type | Default | Description |
|---|---|---|---|
| includeExtensions | string | "html,htm,php,asp,aspx" | Comma-separated file extensions to include |
| excludeExtensions | string | "jpg,jpeg,png,gif,pdf,zip" | Comma-separated file extensions to exclude |
| includePatterns | string | "" | Regex patterns for URLs to include (one per line) |
| excludePatterns | string | "" | Regex patterns for URLs to exclude (one per line) |
| maxFileSize | number | 10 | Maximum file size in MB (1-1000) |
JavaScript Rendering (Extra tier+)
| Setting | Type | Default | Description |
|---|---|---|---|
| enableJavaScript | boolean | false | Enable JavaScript rendering with headless browser |
| jsWaitTime | number | 2 | Seconds to wait after page load (0-60) |
| jsTimeout | number | 30 | Page load timeout in seconds (1-300) |
| jsBrowser | string | "chromium" | Browser engine (chromium, firefox, webkit) |
| jsHeadless | boolean | true | Run browser in headless mode |
| jsUserAgent | string | "" | Custom User-Agent for JS rendering (empty = default) |
| jsViewportWidth | number | 1920 | Browser viewport width (320-3840) |
| jsViewportHeight | number | 1080 | Browser viewport height (240-2160) |
| jsMaxConcurrentPages | number | 5 | Max concurrent browser pages (1-20) |
Export Settings (All tiers)
| Setting | Type | Default | Description |
|---|---|---|---|
| exportFormat | string | "csv" | Default export format (csv, json, xml) |
| exportFields | array | ["url", "status_code", "title"] | Default fields to include in exports |
Advanced Settings (Admin tier)
| Setting | Type | Default | Description |
|---|---|---|---|
| concurrency | number | 10 | Max concurrent requests (1-100) |
| memoryLimit | number | 2048 | Memory limit in MB (128-16384) |
| logLevel | string | "INFO" | Logging level (DEBUG, INFO, WARNING, ERROR) |
| saveSession | boolean | true | Save crawl state for resumption |
| enableProxy | boolean | false | Use proxy for requests |
| proxyUrl | string | "" | Proxy URL (http://host:port or socks5://host:port) |
| customHeaders | string | "{}" | JSON string of custom HTTP headers |
UI Customization (Extra tier+)
| Setting | Type | Default | Description |
|---|---|---|---|
| customCSS | string | "" | Custom CSS for UI customization |
Issue Management (All tiers)
| Setting | Type | Default | Description |
|---|---|---|---|
| issueExclusionPatterns | string | "" | Wildcard patterns for URLs to exclude from issue detection (one per line) |
Best Practices
1. Set Reasonable Limits
Configure maxDepth and maxUrls based on your target site size:
- Small sites (<1K pages): maxDepth: 5, maxUrls: 5000
- Medium sites (1K-50K pages): maxDepth: 10, maxUrls: 50000
- Large sites (50K+ pages): maxDepth: 15, maxUrls: 500000+
2. Be Respectful with Crawl Delay
Set appropriate crawlDelay to avoid overwhelming target servers:
- Own sites: 0.1-0.5 seconds
- Small sites: 1-2 seconds
- Large/production sites: 2-5 seconds
3. JavaScript Rendering Performance
JavaScript rendering is resource-intensive. Optimize with:
- jsMaxConcurrentPages: Start with 5, increase if system has resources
- jsWaitTime: Use minimum time needed (2-3 seconds typical)
- jsHeadless: Always use true in production for better performance
4. Filter Efficiently
Use excludeExtensions to skip binary files and media:
"excludeExtensions": "jpg,jpeg,png,gif,svg,webp,pdf,zip,exe,dmg,mp4,mp3,avi,mov,css,js"
5. Custom Headers for Authentication
For crawling authenticated sections, use customHeaders:
"customHeaders": "{\"Authorization\": \"Bearer your-token-here\", \"X-Custom-Header\": \"value\"}"
Next Steps
- Crawl Control API - Start crawls with configured settings
- Export API - Export data with configured format
- Getting Started Guide - Complete integration tutorial