Crawl Control API - LibreCrawl API Documentation

The Crawl Control API provides complete lifecycle management for website crawls. Start crawls with a single URL, pause and resume long-running operations, and stop crawls at any time while preserving collected data.

Crawl Workflow

1. Start Crawl → Running
    ↓
2. Pause (optional) → Paused → Resume → Running
    ↓
3. Complete or Stop → Completed
    ↓
4. Poll Status → Get real-time data

A typical crawl workflow involves:

Starting a crawl with /api/start_crawl
Polling /api/crawl_status every 1 second for progress updates
Pausing/Resuming as needed with /api/pause_crawl and /api/resume_crawl
Stopping with /api/stop_crawl or letting it complete naturally
Exporting the collected data via the Export API

Session Isolation: Each user session has its own crawler instance. You can only control crawls started by your session.

Endpoints

POST /api/start_crawl

Start a new website crawl. The crawler will discover and crawl all linked pages from the starting URL, respecting your configured settings for depth, URL limits, and filters.

Authentication

Requires valid session cookie.

Rate Limiting

Guest users are limited to 3 crawls per 24-hour period (IP-based). Authenticated users have unlimited crawls.

Request Body

Parameter	Type	Required	Description
url	string	Yes	Starting URL for the crawl (must be a valid HTTP/HTTPS URL)

Example Request

            
curl -X POST http://localhost:5000/api/start_crawl \

  -H "Content-Type: application/json" \

  -b cookies.txt \

  -d '{

    "url": "https://example.com"

  }'

Success Response (200 OK)

            
{

  "success": true,

  "message": "Crawl started successfully"

}

Error Responses

            
# Missing URL (400 Bad Request)

{

  "success": false,

  "error": "URL is required"

}

# Invalid URL format (400 Bad Request)

{

  "success": false,

  "error": "Invalid URL format"

}

# Crawl already running (400 Bad Request)

{

  "success": false,

  "error": "A crawl is already running"

}

# Guest rate limit exceeded (429 Too Many Requests)

{

  "success": false,

  "error": "Guest crawl limit reached (3 per 24 hours)"

}

Behavior

Creates a new crawler instance for your session
Applies your saved settings (or defaults if not configured)
Logs the crawl start time to the database
Returns immediately (crawl runs asynchronously)
Guest users: Increments crawl count for IP-based rate limiting

Important: Starting a new crawl while one is already running will return an error. Stop or wait for the current crawl to complete first.

POST /api/stop_crawl

Stop the currently running crawl. All data collected up to this point is preserved and can be exported.

Authentication

Requires valid session cookie.

Request Body

No request body required.

Example Request

            
curl -X POST http://localhost:5000/api/stop_crawl \

  -b cookies.txt

Success Response (200 OK)

            
{

  "success": true,

  "message": "Crawl stopped successfully"

}

Error Response (400 Bad Request)

            
# No crawl running

{

  "success": false,

  "error": "No active crawl to stop"

}

Behavior

Signals the crawler to stop gracefully
Currently processing pages complete their requests
Crawler state changes to "completed"
All collected data remains available
Updates crawl history in database with completion time

POST /api/pause_crawl

Pause the currently running crawl. The crawler stops processing new URLs but preserves its state for later resumption.

Authentication

Requires valid session cookie.

Request Body

No request body required.

Example Request

            
curl -X POST http://localhost:5000/api/pause_crawl \

  -b cookies.txt

Success Response (200 OK)

            
{

  "success": true,

  "message": "Crawl paused successfully"

}

Error Responses

            
# No crawl running (400 Bad Request)

{

  "success": false,

  "error": "No active crawl to pause"

}

# Already paused (400 Bad Request)

{

  "success": false,

  "error": "Crawl is already paused"

}

Behavior

Crawler stops processing new URLs from the queue
In-flight requests complete normally
Crawler state changes to "paused"
All data and queue state preserved
Can be resumed with /api/resume_crawl

Use Case: Pause crawls to temporarily free up system resources, make configuration changes, or stop before hitting daily API limits (e.g., PageSpeed API).

POST /api/resume_crawl

Resume a paused crawl from where it left off. The crawler continues processing URLs from the queue with all previous state intact.

Authentication

Requires valid session cookie.

Request Body

No request body required.

Example Request

            
curl -X POST http://localhost:5000/api/resume_crawl \

  -b cookies.txt

Success Response (200 OK)

            
{

  "success": true,

  "message": "Crawl resumed successfully"

}

Error Responses

            
# No paused crawl (400 Bad Request)

{

  "success": false,

  "error": "No paused crawl to resume"

}

# Crawl already running (400 Bad Request)

{

  "success": false,

  "error": "Crawl is already running"

}

Behavior

Crawler state changes from "paused" to "running"
Processing resumes from the next URL in the queue
All previous settings and state preserved
Statistics and counters continue from previous values

Crawl Lifecycle States

State	Description	Allowed Actions
idle	No crawl running or loaded	start_crawl
running	Actively crawling URLs	stop_crawl, pause_crawl
paused	Crawl paused, state preserved	resume_crawl, stop_crawl
completed	Crawl finished or stopped	start_crawl (new crawl)

Multi-Tenancy & Isolation

LibreCrawl manages crawler instances per user session:

Session Isolation: Each browser session gets a unique crawler instance
Independent State: Settings, crawl data, and queue are session-specific
Auto-Cleanup: Inactive crawler instances are automatically removed after 1 hour
Concurrent Users: Multiple users can crawl simultaneously without interference

Session Expiry: If your session cookie expires, you'll lose access to the running crawler instance. Save your session cookie securely for long-running crawls.

Complete Example Workflow

          
# 1. Login to create session

curl -X POST http://localhost:5000/api/login \

  -H "Content-Type: application/json" \

  -c cookies.txt \

  -d '{"username": "user", "password": "pass"}'

# 2. Start crawl

curl -X POST http://localhost:5000/api/start_crawl \

  -H "Content-Type: application/json" \

  -b cookies.txt \

  -d '{"url": "https://example.com"}'

# 3. Poll status (repeat every 1 second)

curl http://localhost:5000/api/crawl_status \

  -b cookies.txt

# 4. Pause if needed

curl -X POST http://localhost:5000/api/pause_crawl \

  -b cookies.txt

# 5. Resume when ready

curl -X POST http://localhost:5000/api/resume_crawl \

  -b cookies.txt

# 6. Stop crawl (or let it complete)

curl -X POST http://localhost:5000/api/stop_crawl \

  -b cookies.txt

# 7. Export data

curl -X POST http://localhost:5000/api/export_data \

  -H "Content-Type: application/json" \

  -b cookies.txt \

  -d '{"format": "json", "fields": ["url", "status_code", "title"]}'

Best Practices

1. Configure Settings Before Starting

Use the Settings API to configure crawler behavior before calling /api/start_crawl. Key settings include:

maxDepth: Limit crawl depth to avoid crawling too deep
maxUrls: Set maximum number of URLs to discover
crawlDelay: Add delay between requests to be polite
excludePatterns: Exclude specific URL patterns

2. Implement Robust Polling

Poll /api/crawl_status every 1 second and handle errors gracefully:

          
async function pollStatus() {

  try {

    const res = await fetch('/api/crawl_status');

    const data = await res.json();

    if (data.status !== 'completed') {

      setTimeout(pollStatus, 1000);

    }

  } catch (error) {

    console.error('Polling error:', error);

    setTimeout(pollStatus, 5000); // Retry with backoff

  }

}

3. Handle Guest Rate Limits

For guest users, check remaining crawls before starting:

          
const info = await fetch('/api/user/info').then(r => r.json());

if (info.user.crawls_remaining === 0) {

  alert('Daily crawl limit reached. Please register for unlimited crawls.');

  return;

}

4. Preserve Partial Data on Errors

Even if a crawl fails or is stopped early, the data collected so far is available via /api/crawl_status and can be exported.

Next Steps

Status API - Get real-time crawl data and statistics
Settings API - Configure crawler behavior
Export API - Export crawl results
Getting Started Guide - Complete integration tutorial