The Crawl Control API provides complete lifecycle management for website crawls. Start crawls with a single URL, pause and resume long-running operations, and stop crawls at any time while preserving collected data.

Crawl Workflow

1. Start Crawl → Running
    ↓
2. Pause (optional) → Paused → Resume → Running
    ↓
3. Complete or Stop → Completed
    ↓
4. Poll Status → Get real-time data

A typical crawl workflow involves:

  1. Starting a crawl with /api/start_crawl
  2. Polling /api/crawl_status every 1 second for progress updates
  3. Pausing/Resuming as needed with /api/pause_crawl and /api/resume_crawl
  4. Stopping with /api/stop_crawl or letting it complete naturally
  5. Exporting the collected data via the Export API

Session Isolation: Each user session has its own crawler instance. You can only control crawls started by your session.

Endpoints

POST /api/start_crawl

Start a new website crawl. The crawler will discover and crawl all linked pages from the starting URL, respecting your configured settings for depth, URL limits, and filters.

Authentication

Requires valid session cookie.

Rate Limiting

Guest users are limited to 3 crawls per 24-hour period (IP-based). Authenticated users have unlimited crawls.

Request Body

Parameter Type Required Description
url string Yes Starting URL for the crawl (must be a valid HTTP/HTTPS URL)

Example Request

curl -X POST http://localhost:5000/api/start_crawl \
  -H "Content-Type: application/json" \
  -b cookies.txt \
  -d '{
    "url": "https://example.com"
  }'

Success Response (200 OK)

{
  "success": true,
  "message": "Crawl started successfully"
}

Error Responses

# Missing URL (400 Bad Request)
{
  "success": false,
  "error": "URL is required"
}

# Invalid URL format (400 Bad Request)
{
  "success": false,
  "error": "Invalid URL format"
}

# Crawl already running (400 Bad Request)
{
  "success": false,
  "error": "A crawl is already running"
}

# Guest rate limit exceeded (429 Too Many Requests)
{
  "success": false,
  "error": "Guest crawl limit reached (3 per 24 hours)"
}

Behavior

  • Creates a new crawler instance for your session
  • Applies your saved settings (or defaults if not configured)
  • Logs the crawl start time to the database
  • Returns immediately (crawl runs asynchronously)
  • Guest users: Increments crawl count for IP-based rate limiting

Important: Starting a new crawl while one is already running will return an error. Stop or wait for the current crawl to complete first.

POST /api/stop_crawl

Stop the currently running crawl. All data collected up to this point is preserved and can be exported.

Authentication

Requires valid session cookie.

Request Body

No request body required.

Example Request

curl -X POST http://localhost:5000/api/stop_crawl \
  -b cookies.txt

Success Response (200 OK)

{
  "success": true,
  "message": "Crawl stopped successfully"
}

Error Response (400 Bad Request)

# No crawl running
{
  "success": false,
  "error": "No active crawl to stop"
}

Behavior

  • Signals the crawler to stop gracefully
  • Currently processing pages complete their requests
  • Crawler state changes to "completed"
  • All collected data remains available
  • Updates crawl history in database with completion time
POST /api/pause_crawl

Pause the currently running crawl. The crawler stops processing new URLs but preserves its state for later resumption.

Authentication

Requires valid session cookie.

Request Body

No request body required.

Example Request

curl -X POST http://localhost:5000/api/pause_crawl \
  -b cookies.txt

Success Response (200 OK)

{
  "success": true,
  "message": "Crawl paused successfully"
}

Error Responses

# No crawl running (400 Bad Request)
{
  "success": false,
  "error": "No active crawl to pause"
}

# Already paused (400 Bad Request)
{
  "success": false,
  "error": "Crawl is already paused"
}

Behavior

  • Crawler stops processing new URLs from the queue
  • In-flight requests complete normally
  • Crawler state changes to "paused"
  • All data and queue state preserved
  • Can be resumed with /api/resume_crawl

Use Case: Pause crawls to temporarily free up system resources, make configuration changes, or stop before hitting daily API limits (e.g., PageSpeed API).

POST /api/resume_crawl

Resume a paused crawl from where it left off. The crawler continues processing URLs from the queue with all previous state intact.

Authentication

Requires valid session cookie.

Request Body

No request body required.

Example Request

curl -X POST http://localhost:5000/api/resume_crawl \
  -b cookies.txt

Success Response (200 OK)

{
  "success": true,
  "message": "Crawl resumed successfully"
}

Error Responses

# No paused crawl (400 Bad Request)
{
  "success": false,
  "error": "No paused crawl to resume"
}

# Crawl already running (400 Bad Request)
{
  "success": false,
  "error": "Crawl is already running"
}

Behavior

  • Crawler state changes from "paused" to "running"
  • Processing resumes from the next URL in the queue
  • All previous settings and state preserved
  • Statistics and counters continue from previous values

Crawl Lifecycle States

State Description Allowed Actions
idle No crawl running or loaded start_crawl
running Actively crawling URLs stop_crawl, pause_crawl
paused Crawl paused, state preserved resume_crawl, stop_crawl
completed Crawl finished or stopped start_crawl (new crawl)

Multi-Tenancy & Isolation

LibreCrawl manages crawler instances per user session:

  • Session Isolation: Each browser session gets a unique crawler instance
  • Independent State: Settings, crawl data, and queue are session-specific
  • Auto-Cleanup: Inactive crawler instances are automatically removed after 1 hour
  • Concurrent Users: Multiple users can crawl simultaneously without interference

Session Expiry: If your session cookie expires, you'll lose access to the running crawler instance. Save your session cookie securely for long-running crawls.

Complete Example Workflow

# 1. Login to create session
curl -X POST http://localhost:5000/api/login \
  -H "Content-Type: application/json" \
  -c cookies.txt \
  -d '{"username": "user", "password": "pass"}'

# 2. Start crawl
curl -X POST http://localhost:5000/api/start_crawl \
  -H "Content-Type: application/json" \
  -b cookies.txt \
  -d '{"url": "https://example.com"}'

# 3. Poll status (repeat every 1 second)
curl http://localhost:5000/api/crawl_status \
  -b cookies.txt

# 4. Pause if needed
curl -X POST http://localhost:5000/api/pause_crawl \
  -b cookies.txt

# 5. Resume when ready
curl -X POST http://localhost:5000/api/resume_crawl \
  -b cookies.txt

# 6. Stop crawl (or let it complete)
curl -X POST http://localhost:5000/api/stop_crawl \
  -b cookies.txt

# 7. Export data
curl -X POST http://localhost:5000/api/export_data \
  -H "Content-Type: application/json" \
  -b cookies.txt \
  -d '{"format": "json", "fields": ["url", "status_code", "title"]}'

Best Practices

1. Configure Settings Before Starting

Use the Settings API to configure crawler behavior before calling /api/start_crawl. Key settings include:

  • maxDepth: Limit crawl depth to avoid crawling too deep
  • maxUrls: Set maximum number of URLs to discover
  • crawlDelay: Add delay between requests to be polite
  • excludePatterns: Exclude specific URL patterns

2. Implement Robust Polling

Poll /api/crawl_status every 1 second and handle errors gracefully:

async function pollStatus() {
  try {
    const res = await fetch('/api/crawl_status');
    const data = await res.json();
    
    if (data.status !== 'completed') {
      setTimeout(pollStatus, 1000);
    }
  } catch (error) {
    console.error('Polling error:', error);
    setTimeout(pollStatus, 5000); // Retry with backoff
  }
}

3. Handle Guest Rate Limits

For guest users, check remaining crawls before starting:

const info = await fetch('/api/user/info').then(r => r.json());
if (info.user.crawls_remaining === 0) {
  alert('Daily crawl limit reached. Please register for unlimited crawls.');
  return;
}

4. Preserve Partial Data on Errors

Even if a crawl fails or is stopped early, the data collected so far is available via /api/crawl_status and can be exported.

Next Steps