This tutorial will walk you through building a complete application that authenticates, starts a crawl, monitors progress in real-time, and exports the results. You'll learn the essential patterns for working with the LibreCrawl API.
Prerequisites
- LibreCrawl installed and running (default:
http://localhost:5000) - Basic knowledge of HTTP/REST APIs
- Familiarity with JSON
- A programming language with HTTP client library (JavaScript, Python, etc.)
What We'll Build
We'll create a simple crawler application that:
- Authenticates with the API
- Configures crawler settings
- Starts a website crawl
- Polls for real-time status updates
- Displays progress and statistics
- Exports results when complete
Tutorial
Step 1: Set Up Your Environment
First, ensure LibreCrawl is running. For development, we'll use local mode for easier testing:
# Start LibreCrawl in local mode (all users get admin tier)
python main.py --local
LibreCrawl should now be accessible at http://localhost:5000
Step 2: Authentication
Create a session by logging in. In local mode, use guest login for quick access:
JavaScript Example
// Store cookies for session management
const sessionCookies = {};
async function login() {
const response = await fetch('http://localhost:5000/api/guest-login', {
method: 'POST',
credentials: 'include' // Important: include cookies
});
const data = await response.json();
if (data.success) {
console.log('✓ Authenticated successfully');
return true;
} else {
console.error('✗ Authentication failed:', data.error);
return false;
}
}
await login();
Python Example
import requests
BASE_URL = 'http://localhost:5000'
session = requests.Session()
def login():
response = session.post(f'{BASE_URL}/api/guest-login')
data = response.json()
if data['success']:
print('✓ Authenticated successfully')
return True
else:
print(f'✗ Authentication failed: {data["error"]}')
return False
login()
Alternative: For production applications, use /api/register and /api/login with username/password instead of guest login.
Step 3: Configure Crawler Settings (Optional)
Before starting a crawl, you can customize settings. For this tutorial, we'll set basic limits:
JavaScript Example
async function configureSettings() {
const settings = {
maxDepth: 3,
maxUrls: 100,
crawlDelay: 0.5,
respectRobotsTxt: true
};
const response = await fetch('http://localhost:5000/api/save_settings', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
credentials: 'include',
body: JSON.stringify(settings)
});
const data = await response.json();
if (data.success) {
console.log('✓ Settings configured');
} else {
console.error('✗ Settings error:', data.error);
}
}
await configureSettings();
Python Example
def configure_settings():
settings = {
'maxDepth': 3,
'maxUrls': 100,
'crawlDelay': 0.5,
'respectRobotsTxt': True
}
response = session.post(
f'{BASE_URL}/api/save_settings',
json=settings
)
data = response.json()
if data['success']:
print('✓ Settings configured')
else:
print(f'✗ Settings error: {data["error"]}')
configure_settings()
Step 4: Start the Crawl
Now we'll start crawling a website. For this example, we'll use a small test site:
JavaScript Example
async function startCrawl(url) {
const response = await fetch('http://localhost:5000/api/start_crawl', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
credentials: 'include',
body: JSON.stringify({ url })
});
const data = await response.json();
if (data.success) {
console.log(`✓ Crawl started for ${url}`);
return true;
} else {
console.error('✗ Crawl start failed:', data.error);
return false;
}
}
await startCrawl('https://example.com');
Python Example
def start_crawl(url):
response = session.post(
f'{BASE_URL}/api/start_crawl',
json={'url': url}
)
data = response.json()
if data['success']:
print(f'✓ Crawl started for {url}')
return True
else:
print(f'✗ Crawl start failed: {data["error"]}')
return False
start_crawl('https://example.com')
Step 5: Monitor Progress in Real-Time
The most important part: polling for crawl status. We'll poll every second and display progress:
JavaScript Example
async function monitorCrawl() {
let isRunning = true;
while (isRunning) {
const response = await fetch('http://localhost:5000/api/crawl_status', {
credentials: 'include'
});
const data = await response.json();
// Display progress
console.clear();
console.log('=== LibreCrawl Status ===');
console.log(`Status: ${data.status}`);
console.log(`Progress: ${data.progress.toFixed(1)}%`);
console.log(`Discovered: ${data.stats.discovered} URLs`);
console.log(`Crawled: ${data.stats.crawled} URLs`);
console.log(`Depth: ${data.stats.depth}`);
console.log(`Speed: ${data.stats.speed.toFixed(1)} URLs/sec`);
console.log(`Issues: ${data.issues.length}`);
// Check if crawl is complete
if (data.status === 'completed') {
console.log('\n✓ Crawl completed!');
isRunning = false;
return data;
}
// Wait 1 second before next poll
await new Promise(resolve => setTimeout(resolve, 1000));
}
}
const results = await monitorCrawl();
Python Example
import time
import os
def monitor_crawl():
is_running = True
while is_running:
response = session.get(f'{BASE_URL}/api/crawl_status')
data = response.json()
# Display progress
os.system('clear' if os.name == 'posix' else 'cls')
print('=== LibreCrawl Status ===')
print(f'Status: {data["status"]}')
print(f'Progress: {data["progress"]:.1f}%')
print(f'Discovered: {data["stats"]["discovered"]} URLs')
print(f'Crawled: {data["stats"]["crawled"]} URLs')
print(f'Depth: {data["stats"]["depth"]}')
print(f'Speed: {data["stats"]["speed"]:.1f} URLs/sec')
print(f'Issues: {len(data["issues"])}')
# Check if crawl is complete
if data['status'] == 'completed':
print('\n✓ Crawl completed!')
is_running = False
return data
# Wait 1 second before next poll
time.sleep(1)
results = monitor_crawl()
Best Practice: Implement error handling and exponential backoff in case of network errors during polling.
Step 6: Export Results
Once the crawl is complete, export the data in your preferred format:
JavaScript Example
async function exportResults(format = 'json') {
const exportConfig = {
format: format,
fields: [
'url', 'status_code', 'title', 'meta_description',
'h1', 'word_count', 'response_time'
]
};
const response = await fetch('http://localhost:5000/api/export_data', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
credentials: 'include',
body: JSON.stringify(exportConfig)
});
const data = await response.json();
if (data.success) {
// Decode base64 content
const content = atob(data.content);
// Save to file (browser example)
const blob = new Blob([content], { type: data.mimetype });
const url = URL.createObjectURL(blob);
const a = document.createElement('a');
a.href = url;
a.download = data.filename;
a.click();
console.log(`✓ Exported to ${data.filename}`);
} else {
console.error('✗ Export failed:', data.error);
}
}
await exportResults('json');
Python Example
import base64
def export_results(format='json'):
export_config = {
'format': format,
'fields': [
'url', 'status_code', 'title', 'meta_description',
'h1', 'word_count', 'response_time'
]
}
response = session.post(
f'{BASE_URL}/api/export_data',
json=export_config
)
data = response.json()
if data['success']:
# Decode base64 content
content = base64.b64decode(data['content'])
# Save to file
with open(data['filename'], 'wb') as f:
f.write(content)
print(f'✓ Exported to {data["filename"]}')
else:
print(f'✗ Export failed: {data["error"]}')
export_results('json')
Complete Example Application
Here's a complete working example that ties everything together:
JavaScript (Node.js)
const fetch = require('node-fetch');
const fs = require('fs');
const BASE_URL = 'http://localhost:5000';
class LibreCrawlClient {
constructor(baseUrl = BASE_URL) {
this.baseUrl = baseUrl;
this.cookies = {};
}
async request(endpoint, options = {}) {
const response = await fetch(`${this.baseUrl}${endpoint}`, {
...options,
credentials: 'include'
});
return response.json();
}
async login() {
const data = await this.request('/api/guest-login', { method: 'POST' });
if (!data.success) throw new Error(data.error);
console.log('✓ Authenticated');
}
async configure(settings) {
const data = await this.request('/api/save_settings', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(settings)
});
if (!data.success) throw new Error(data.error);
console.log('✓ Settings configured');
}
async startCrawl(url) {
const data = await this.request('/api/start_crawl', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ url })
});
if (!data.success) throw new Error(data.error);
console.log(`✓ Crawl started for ${url}`);
}
async getStatus() {
return await this.request('/api/crawl_status');
}
async waitForCompletion() {
while (true) {
const status = await this.getStatus();
console.log(`Progress: ${status.progress.toFixed(1)}% | ` +
`Crawled: ${status.stats.crawled} | ` +
`Issues: ${status.issues.length}`);
if (status.status === 'completed') {
console.log('✓ Crawl completed!');
return status;
}
await new Promise(r => setTimeout(r, 1000));
}
}
async export(format, fields) {
const data = await this.request('/api/export_data', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ format, fields })
});
if (!data.success) throw new Error(data.error);
const content = Buffer.from(data.content, 'base64');
fs.writeFileSync(data.filename, content);
console.log(`✓ Exported to ${data.filename}`);
}
}
// Usage
(async () => {
const client = new LibreCrawlClient();
await client.login();
await client.configure({ maxDepth: 3, maxUrls: 100 });
await client.startCrawl('https://example.com');
await client.waitForCompletion();
await client.export('json', ['url', 'status_code', 'title']);
})();
🎉 Congratulations!
You've successfully built a complete LibreCrawl API application. You now know how to:
- Authenticate with the API
- Configure crawler settings
- Start and monitor crawls
- Export results in multiple formats
Next Steps
Explore Advanced Features
- JavaScript Rendering: Enable
enableJavaScript: truefor React/Vue/Angular sites - Custom Filters: Use
includePatternsandexcludePatternsfor precise crawling - Proxy Configuration: Set up
proxyUrlfor crawling from different IPs - PageSpeed Integration: Enable
enablePageSpeedwith your Google API key
Production Checklist
- Use username/password authentication instead of guest login
- Implement proper error handling and retry logic
- Set appropriate
crawlDelayto respect target servers - Configure
respectRobotsTxt: truefor ethical crawling - Monitor memory usage for large crawls
- Set up HTTPS and secure cookies for production deployment
Learn More
- Authentication API - Full authentication reference
- Crawl Control API - Advanced crawl management
- Settings API - Complete settings reference
- Status API - Real-time data access
- Export API - Data export options
- API Overview - Complete API reference
Example Projects
- SEO Audit Dashboard: Build a web dashboard that displays crawl results, issues, and visualizations
- Automated Monitor: Schedule daily crawls and email reports when issues are detected
- Content Inventory: Export all page titles, descriptions, and word counts for content audits
- Link Checker: Find all broken links across a website
- Site Structure Analyzer: Visualize site architecture using the visualization API
Get Help
- GitHub Repository - Source code, issues, and discussions
- Report Issues - Bug reports and feature requests
- LibreCrawl Blog - Tutorials, comparisons, and guides