Quick Start
The LibreCrawl API is a RESTful HTTP API that provides programmatic access to all crawling functionality. All endpoints return JSON responses and use session-based authentication.
# Base URL
http://localhost:5000/api
# Example: Start a crawl
curl -X POST http://localhost:5000/api/start_crawl \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}' \
--cookie-jar cookies.txt
# Check crawl status
curl http://localhost:5000/api/crawl_status \
--cookie cookies.txt
Authentication: All API endpoints require session-based authentication. See the Authentication guide for details.
API Overview
The LibreCrawl API is organized into seven main categories:
Authentication & Sessions
User registration, login, logout, and session management with tier-based access control.
5 endpoints
CrawlingCrawl Control
Start, stop, pause, and resume website crawls with real-time control over the crawling process.
4 endpoints
StatusStatus & Data Retrieval
Real-time crawl status, statistics, URL data, link relationships, and visualization graphs.
2 endpoints
SettingsSettings & Configuration
Manage crawler settings, JavaScript rendering, filters, proxies, and advanced configurations.
4 endpoints
ExportExport & Filtering
Export crawl data in CSV, JSON, or XML formats with customizable fields and issue filtering.
2 endpoints
DebugDebug & Monitoring
Memory monitoring, performance profiling, and system diagnostics for crawler instances.
2 endpoints
GuideGetting Started
Step-by-step guide to building your first application with the LibreCrawl API.
Tutorial
API Endpoints at a Glance
Authentication
| Method | Endpoint | Description |
|---|---|---|
| POST | /api/register |
Create a new user account |
| POST | /api/login |
Authenticate and create session |
| POST | /api/guest-login |
Create guest session (limited access) |
| POST | /api/logout |
End current session |
| GET | /api/user/info |
Get current user information |
Crawl Control
| Method | Endpoint | Description |
|---|---|---|
| POST | /api/start_crawl |
Start a new website crawl |
| POST | /api/stop_crawl |
Stop the active crawl |
| POST | /api/pause_crawl |
Pause the current crawl |
| POST | /api/resume_crawl |
Resume a paused crawl |
Status & Data
| Method | Endpoint | Description |
|---|---|---|
| GET | /api/crawl_status |
Get real-time crawl status and data |
| GET | /api/visualization_data |
Get graph visualization data |
Settings & Configuration
| Method | Endpoint | Description |
|---|---|---|
| GET | /api/get_settings |
Retrieve current user settings |
| POST | /api/save_settings |
Save user settings |
| POST | /api/reset_settings |
Reset settings to defaults |
| POST | /api/update_crawler_settings |
Apply settings to active crawler |
Export & Filtering
| Method | Endpoint | Description |
|---|---|---|
| POST | /api/export_data |
Export crawl data in multiple formats |
| POST | /api/filter_issues |
Filter issues by exclusion patterns |
Debug & Monitoring
| Method | Endpoint | Description |
|---|---|---|
| GET | /api/debug/memory |
Get memory stats for all crawler instances |
| GET | /api/debug/memory/profile |
Get detailed memory breakdown by component |
Common Patterns
Request Format
All POST requests should include the Content-Type: application/json header and send data as JSON in the request body:
POST /api/start_crawl HTTP/1.1
Host: localhost:5000
Content-Type: application/json
Cookie: session=...
{
"url": "https://example.com"
}
Response Format
All API responses return JSON with a consistent structure:
{
"success": true,
"message": "Operation completed successfully",
"data": {
// Response data
}
}
Error responses include an error message:
{
"success": false,
"error": "Error message describing what went wrong"
}
HTTP Status Codes
- 200 OK - Request successful
- 400 Bad Request - Invalid request data or validation error
- 401 Unauthorized - Authentication required or session invalid
- 500 Internal Server Error - Server error occurred
Rate Limiting & Access Control
Tier System
LibreCrawl uses a tier-based access control system:
- Guest Tier - Limited to 3 crawls per 24 hours (IP-based tracking), read-only access
- User Tier - Unlimited crawls, basic settings access, data export
- Extra Tier - All User features plus JavaScript rendering, custom filters, CSS customization
- Admin Tier - Full access to all features including advanced settings (concurrency, memory limits, proxy configuration)
Guest Rate Limiting
Guest users are limited to 3 crawls per 24-hour period, tracked by IP address. The API checks the following headers in order:
CF-Connecting-IP(Cloudflare)X-Forwarded-For(Proxy)X-Real-IP(Nginx)REMOTE_ADDR(Direct connection)
Polling Pattern
LibreCrawl uses HTTP polling instead of WebSockets for real-time updates. Your application should poll the /api/crawl_status endpoint at regular intervals (recommended: 1 second) during an active crawl:
async function pollCrawlStatus() {
const response = await fetch('/api/crawl_status');
const data = await response.json();
// Update UI with crawl data
updateCrawlUI(data);
// Continue polling if crawl is still running
if (data.status !== 'completed') {
setTimeout(pollCrawlStatus, 1000);
}
}
Next Steps
Ready to start building with the LibreCrawl API? Check out these resources:
- Getting Started Guide - Build your first application
- Authentication Documentation - Learn about session management
- Crawling Control - Master the crawling workflow
- GitHub Repository - View source code and examples