LibreCrawl API Documentation - Complete REST API Reference

Quick Start

The LibreCrawl API is a RESTful HTTP API that provides programmatic access to all crawling functionality. All endpoints return JSON responses and use session-based authentication.

            
# Base URL

http://localhost:5000/api

# Example: Start a crawl

curl -X POST http://localhost:5000/api/start_crawl \

  -H "Content-Type: application/json" \

  -d '{"url": "https://example.com"}' \

  --cookie-jar cookies.txt

# Check crawl status

curl http://localhost:5000/api/crawl_status \

  --cookie cookies.txt

Authentication: All API endpoints require session-based authentication. See the Authentication guide for details.

API Overview

The LibreCrawl API is organized into seven main categories:

Authentication

Authentication & Sessions

User registration, login, logout, and session management with tier-based access control.

5 endpoints

Crawling

Crawl Control

Start, stop, pause, and resume website crawls with real-time control over the crawling process.

4 endpoints

Status

Status & Data Retrieval

Real-time crawl status, statistics, URL data, link relationships, and visualization graphs.

2 endpoints

Settings

Settings & Configuration

Manage crawler settings, JavaScript rendering, filters, proxies, and advanced configurations.

4 endpoints

Export

Export & Filtering

Export crawl data in CSV, JSON, or XML formats with customizable fields and issue filtering.

2 endpoints

Debug

Debug & Monitoring

Memory monitoring, performance profiling, and system diagnostics for crawler instances.

2 endpoints

Guide

Getting Started

Step-by-step guide to building your first application with the LibreCrawl API.

Tutorial

API Endpoints at a Glance

Authentication

Method	Endpoint	Description
POST	`/api/register`	Create a new user account
POST	`/api/login`	Authenticate and create session
POST	`/api/guest-login`	Create guest session (limited access)
POST	`/api/logout`	End current session
GET	`/api/user/info`	Get current user information

Crawl Control

Method	Endpoint	Description
POST	`/api/start_crawl`	Start a new website crawl
POST	`/api/stop_crawl`	Stop the active crawl
POST	`/api/pause_crawl`	Pause the current crawl
POST	`/api/resume_crawl`	Resume a paused crawl

Status & Data

Method	Endpoint	Description
GET	`/api/crawl_status`	Get real-time crawl status and data
GET	`/api/visualization_data`	Get graph visualization data

Settings & Configuration

Method	Endpoint	Description
GET	`/api/get_settings`	Retrieve current user settings
POST	`/api/save_settings`	Save user settings
POST	`/api/reset_settings`	Reset settings to defaults
POST	`/api/update_crawler_settings`	Apply settings to active crawler

Export & Filtering

Method	Endpoint	Description
POST	`/api/export_data`	Export crawl data in multiple formats
POST	`/api/filter_issues`	Filter issues by exclusion patterns

Debug & Monitoring

Method	Endpoint	Description
GET	`/api/debug/memory`	Get memory stats for all crawler instances
GET	`/api/debug/memory/profile`	Get detailed memory breakdown by component

Common Patterns

Request Format

All POST requests should include the Content-Type: application/json header and send data as JSON in the request body:

          
POST /api/start_crawl HTTP/1.1

Host: localhost:5000

Content-Type: application/json

Cookie: session=...

{

  "url": "https://example.com"

}

Response Format

All API responses return JSON with a consistent structure:

          
{

  "success": true,

  "message": "Operation completed successfully",

  "data": {

    // Response data

  }

}

Error responses include an error message:

          
{

  "success": false,

  "error": "Error message describing what went wrong"

}

HTTP Status Codes

200 OK - Request successful
400 Bad Request - Invalid request data or validation error
401 Unauthorized - Authentication required or session invalid
500 Internal Server Error - Server error occurred

Rate Limiting & Access Control

Tier System

LibreCrawl uses a tier-based access control system:

Guest Tier - Limited to 3 crawls per 24 hours (IP-based tracking), read-only access
User Tier - Unlimited crawls, basic settings access, data export
Extra Tier - All User features plus JavaScript rendering, custom filters, CSS customization
Admin Tier - Full access to all features including advanced settings (concurrency, memory limits, proxy configuration)

Guest Rate Limiting

Guest users are limited to 3 crawls per 24-hour period, tracked by IP address. The API checks the following headers in order:

CF-Connecting-IP (Cloudflare)
X-Forwarded-For (Proxy)
X-Real-IP (Nginx)
REMOTE_ADDR (Direct connection)

Polling Pattern

LibreCrawl uses HTTP polling instead of WebSockets for real-time updates. Your application should poll the /api/crawl_status endpoint at regular intervals (recommended: 1 second) during an active crawl:

          
async function pollCrawlStatus() {

  const response = await fetch('/api/crawl_status');

  const data = await response.json();

  // Update UI with crawl data

  updateCrawlUI(data);

  // Continue polling if crawl is still running

  if (data.status !== 'completed') {

    setTimeout(pollCrawlStatus, 1000);

  }

}

Next Steps

Ready to start building with the LibreCrawl API? Check out these resources:

Getting Started Guide - Build your first application
Authentication Documentation - Learn about session management
Crawling Control - Master the crawling workflow
GitHub Repository - View source code and examples