This tutorial will walk you through building a complete application that authenticates, starts a crawl, monitors progress in real-time, and exports the results. You'll learn the essential patterns for working with the LibreCrawl API.

Prerequisites

  • LibreCrawl installed and running (default: http://localhost:5000)
  • Basic knowledge of HTTP/REST APIs
  • Familiarity with JSON
  • A programming language with HTTP client library (JavaScript, Python, etc.)

What We'll Build

We'll create a simple crawler application that:

  1. Authenticates with the API
  2. Configures crawler settings
  3. Starts a website crawl
  4. Polls for real-time status updates
  5. Displays progress and statistics
  6. Exports results when complete

Tutorial

Step 1: Set Up Your Environment

First, ensure LibreCrawl is running. For development, we'll use local mode for easier testing:

# Start LibreCrawl in local mode (all users get admin tier)
python main.py --local

LibreCrawl should now be accessible at http://localhost:5000

Step 2: Authentication

Create a session by logging in. In local mode, use guest login for quick access:

JavaScript Example

// Store cookies for session management
const sessionCookies = {};

async function login() {
  const response = await fetch('http://localhost:5000/api/guest-login', {
    method: 'POST',
    credentials: 'include' // Important: include cookies
  });
  
  const data = await response.json();
  
  if (data.success) {
    console.log('✓ Authenticated successfully');
    return true;
  } else {
    console.error('✗ Authentication failed:', data.error);
    return false;
  }
}

await login();

Python Example

import requests

BASE_URL = 'http://localhost:5000'
session = requests.Session()

def login():
    response = session.post(f'{BASE_URL}/api/guest-login')
    data = response.json()
    
    if data['success']:
        print('✓ Authenticated successfully')
        return True
    else:
        print(f'✗ Authentication failed: {data["error"]}')
        return False

login()

Alternative: For production applications, use /api/register and /api/login with username/password instead of guest login.

Step 3: Configure Crawler Settings (Optional)

Before starting a crawl, you can customize settings. For this tutorial, we'll set basic limits:

JavaScript Example

async function configureSettings() {
  const settings = {
    maxDepth: 3,
    maxUrls: 100,
    crawlDelay: 0.5,
    respectRobotsTxt: true
  };
  
  const response = await fetch('http://localhost:5000/api/save_settings', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    credentials: 'include',
    body: JSON.stringify(settings)
  });
  
  const data = await response.json();
  
  if (data.success) {
    console.log('✓ Settings configured');
  } else {
    console.error('✗ Settings error:', data.error);
  }
}

await configureSettings();

Python Example

def configure_settings():
    settings = {
        'maxDepth': 3,
        'maxUrls': 100,
        'crawlDelay': 0.5,
        'respectRobotsTxt': True
    }
    
    response = session.post(
        f'{BASE_URL}/api/save_settings',
        json=settings
    )
    data = response.json()
    
    if data['success']:
        print('✓ Settings configured')
    else:
        print(f'✗ Settings error: {data["error"]}')

configure_settings()

Step 4: Start the Crawl

Now we'll start crawling a website. For this example, we'll use a small test site:

JavaScript Example

async function startCrawl(url) {
  const response = await fetch('http://localhost:5000/api/start_crawl', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    credentials: 'include',
    body: JSON.stringify({ url })
  });
  
  const data = await response.json();
  
  if (data.success) {
    console.log(`✓ Crawl started for ${url}`);
    return true;
  } else {
    console.error('✗ Crawl start failed:', data.error);
    return false;
  }
}

await startCrawl('https://example.com');

Python Example

def start_crawl(url):
    response = session.post(
        f'{BASE_URL}/api/start_crawl',
        json={'url': url}
    )
    data = response.json()
    
    if data['success']:
        print(f'✓ Crawl started for {url}')
        return True
    else:
        print(f'✗ Crawl start failed: {data["error"]}')
        return False

start_crawl('https://example.com')

Step 5: Monitor Progress in Real-Time

The most important part: polling for crawl status. We'll poll every second and display progress:

JavaScript Example

async function monitorCrawl() {
  let isRunning = true;
  
  while (isRunning) {
    const response = await fetch('http://localhost:5000/api/crawl_status', {
      credentials: 'include'
    });
    
    const data = await response.json();
    
    // Display progress
    console.clear();
    console.log('=== LibreCrawl Status ===');
    console.log(`Status: ${data.status}`);
    console.log(`Progress: ${data.progress.toFixed(1)}%`);
    console.log(`Discovered: ${data.stats.discovered} URLs`);
    console.log(`Crawled: ${data.stats.crawled} URLs`);
    console.log(`Depth: ${data.stats.depth}`);
    console.log(`Speed: ${data.stats.speed.toFixed(1)} URLs/sec`);
    console.log(`Issues: ${data.issues.length}`);
    
    // Check if crawl is complete
    if (data.status === 'completed') {
      console.log('\n✓ Crawl completed!');
      isRunning = false;
      return data;
    }
    
    // Wait 1 second before next poll
    await new Promise(resolve => setTimeout(resolve, 1000));
  }
}

const results = await monitorCrawl();

Python Example

import time
import os

def monitor_crawl():
    is_running = True
    
    while is_running:
        response = session.get(f'{BASE_URL}/api/crawl_status')
        data = response.json()
        
        # Display progress
        os.system('clear' if os.name == 'posix' else 'cls')
        print('=== LibreCrawl Status ===')
        print(f'Status: {data["status"]}')
        print(f'Progress: {data["progress"]:.1f}%')
        print(f'Discovered: {data["stats"]["discovered"]} URLs')
        print(f'Crawled: {data["stats"]["crawled"]} URLs')
        print(f'Depth: {data["stats"]["depth"]}')
        print(f'Speed: {data["stats"]["speed"]:.1f} URLs/sec')
        print(f'Issues: {len(data["issues"])}')
        
        # Check if crawl is complete
        if data['status'] == 'completed':
            print('\n✓ Crawl completed!')
            is_running = False
            return data
        
        # Wait 1 second before next poll
        time.sleep(1)

results = monitor_crawl()

Best Practice: Implement error handling and exponential backoff in case of network errors during polling.

Step 6: Export Results

Once the crawl is complete, export the data in your preferred format:

JavaScript Example

async function exportResults(format = 'json') {
  const exportConfig = {
    format: format,
    fields: [
      'url', 'status_code', 'title', 'meta_description',
      'h1', 'word_count', 'response_time'
    ]
  };
  
  const response = await fetch('http://localhost:5000/api/export_data', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    credentials: 'include',
    body: JSON.stringify(exportConfig)
  });
  
  const data = await response.json();
  
  if (data.success) {
    // Decode base64 content
    const content = atob(data.content);
    
    // Save to file (browser example)
    const blob = new Blob([content], { type: data.mimetype });
    const url = URL.createObjectURL(blob);
    const a = document.createElement('a');
    a.href = url;
    a.download = data.filename;
    a.click();
    
    console.log(`✓ Exported to ${data.filename}`);
  } else {
    console.error('✗ Export failed:', data.error);
  }
}

await exportResults('json');

Python Example

import base64

def export_results(format='json'):
    export_config = {
        'format': format,
        'fields': [
            'url', 'status_code', 'title', 'meta_description',
            'h1', 'word_count', 'response_time'
        ]
    }
    
    response = session.post(
        f'{BASE_URL}/api/export_data',
        json=export_config
    )
    data = response.json()
    
    if data['success']:
        # Decode base64 content
        content = base64.b64decode(data['content'])
        
        # Save to file
        with open(data['filename'], 'wb') as f:
            f.write(content)
        
        print(f'✓ Exported to {data["filename"]}')
    else:
        print(f'✗ Export failed: {data["error"]}')

export_results('json')

Complete Example Application

Here's a complete working example that ties everything together:

JavaScript (Node.js)

const fetch = require('node-fetch');
const fs = require('fs');

const BASE_URL = 'http://localhost:5000';

class LibreCrawlClient {
  constructor(baseUrl = BASE_URL) {
    this.baseUrl = baseUrl;
    this.cookies = {};
  }

  async request(endpoint, options = {}) {
    const response = await fetch(`${this.baseUrl}${endpoint}`, {
      ...options,
      credentials: 'include'
    });
    return response.json();
  }

  async login() {
    const data = await this.request('/api/guest-login', { method: 'POST' });
    if (!data.success) throw new Error(data.error);
    console.log('✓ Authenticated');
  }

  async configure(settings) {
    const data = await this.request('/api/save_settings', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify(settings)
    });
    if (!data.success) throw new Error(data.error);
    console.log('✓ Settings configured');
  }

  async startCrawl(url) {
    const data = await this.request('/api/start_crawl', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ url })
    });
    if (!data.success) throw new Error(data.error);
    console.log(`✓ Crawl started for ${url}`);
  }

  async getStatus() {
    return await this.request('/api/crawl_status');
  }

  async waitForCompletion() {
    while (true) {
      const status = await this.getStatus();
      console.log(`Progress: ${status.progress.toFixed(1)}% | ` +
                  `Crawled: ${status.stats.crawled} | ` +
                  `Issues: ${status.issues.length}`);
      
      if (status.status === 'completed') {
        console.log('✓ Crawl completed!');
        return status;
      }
      
      await new Promise(r => setTimeout(r, 1000));
    }
  }

  async export(format, fields) {
    const data = await this.request('/api/export_data', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ format, fields })
    });
    
    if (!data.success) throw new Error(data.error);
    
    const content = Buffer.from(data.content, 'base64');
    fs.writeFileSync(data.filename, content);
    console.log(`✓ Exported to ${data.filename}`);
  }
}

// Usage
(async () => {
  const client = new LibreCrawlClient();
  
  await client.login();
  await client.configure({ maxDepth: 3, maxUrls: 100 });
  await client.startCrawl('https://example.com');
  await client.waitForCompletion();
  await client.export('json', ['url', 'status_code', 'title']);
})();

🎉 Congratulations!

You've successfully built a complete LibreCrawl API application. You now know how to:

  • Authenticate with the API
  • Configure crawler settings
  • Start and monitor crawls
  • Export results in multiple formats

Next Steps

Explore Advanced Features

  • JavaScript Rendering: Enable enableJavaScript: true for React/Vue/Angular sites
  • Custom Filters: Use includePatterns and excludePatterns for precise crawling
  • Proxy Configuration: Set up proxyUrl for crawling from different IPs
  • PageSpeed Integration: Enable enablePageSpeed with your Google API key

Production Checklist

  • Use username/password authentication instead of guest login
  • Implement proper error handling and retry logic
  • Set appropriate crawlDelay to respect target servers
  • Configure respectRobotsTxt: true for ethical crawling
  • Monitor memory usage for large crawls
  • Set up HTTPS and secure cookies for production deployment

Learn More

Example Projects

  • SEO Audit Dashboard: Build a web dashboard that displays crawl results, issues, and visualizations
  • Automated Monitor: Schedule daily crawls and email reports when issues are detected
  • Content Inventory: Export all page titles, descriptions, and word counts for content audits
  • Link Checker: Find all broken links across a website
  • Site Structure Analyzer: Visualize site architecture using the visualization API

Get Help