Status & Data API - LibreCrawl API Documentation

The Status & Data API provides real-time access to crawl progress, collected URL data, link relationships, detected issues, and graph visualization data. These endpoints are designed for high-frequency polling during active crawls.

Polling Recommended: Poll /api/crawl_status every 1 second during active crawls for real-time UI updates. The endpoint is optimized for high-frequency requests.

Endpoints

GET /api/crawl_status

Get comprehensive real-time data about the current or most recent crawl, including status, statistics, discovered URLs, link relationships, detected issues, and performance metrics.

Authentication

Requires valid session cookie.

Request Parameters

Optional query parameters for incremental updates (recommended for large crawls):

Parameter	Type	Description
url_since	integer	Only return URLs after this index (e.g., 100 returns URLs from index 100 onward)
link_since	integer	Only return links after this index
issue_since	integer	Only return issues after this index

Example Requests

Full Status (Initial Request)

            
curl http://localhost:5000/api/crawl_status \

  -b cookies.txt

Incremental Update (Subsequent Requests)

            
curl "http://localhost:5000/api/crawl_status?url_since=100&link_since=500&issue_since=10" \

  -b cookies.txt

Success Response (200 OK)

The response contains comprehensive crawl data organized into several sections:

            
{

  "status": "running",  // "running" | "paused" | "completed" | "idle"

  "stats": {

    "discovered": 1247,

    "crawled": 856,

    "depth": 4,

    "speed": 12.5,  // URLs per second

    "pagespeed_results": {

      "performance": 89,

      "accessibility": 95,

      "best_practices": 92,

      "seo": 100

    }

  },

  "urls": [

    {

      "url": "https://example.com/page",

      "status_code": 200,

      "title": "Example Page",

      "meta_description": "This is an example page",

      "h1": "Main Heading",

      "h2": ["Subheading 1", "Subheading 2"],

      "h3": ["Section 1", "Section 2"],

      "word_count": 1250,

      "response_time": 0.342,  // seconds

      "analytics": {

        "google_analytics": "UA-12345-1",

        "google_tag_manager": "GTM-XXXX"

      },

      "og_tags": {

        "og:title": "Example Page",

        "og:description": "Description",

        "og:image": "https://example.com/image.jpg"

      },

      "twitter_tags": {

        "twitter:card": "summary_large_image",

        "twitter:title": "Example Page"

      },

      "json_ld": [

        {

          "@context": "https://schema.org",

          "@type": "Article",

          "headline": "Example Article"

        }

      ],

      "internal_links": 24,

      "external_links": 5,

      "images": [

        {

          "src": "https://example.com/image.jpg",

          "alt": "Image description",

          "width": 800,

          "height": 600

        }

      ]

    }

  ],

  "links": [

    {

      "source_url": "https://example.com",

      "target_url": "https://example.com/page",

      "anchor_text": "Click here",

      "is_internal": true,

      "target_domain": "example.com",

      "target_status": "200",

      "placement": "navigation"  // "navigation" | "content" | "footer" | "sidebar"

    }

  ],

  "issues": [

    {

      "url": "https://example.com/broken",

      "type": "404",

      "category": "broken_link",

      "issue": "Page not found",

      "details": "This URL returns a 404 status code"

    },

    {

      "url": "https://example.com/no-title",

      "type": "missing_title",

      "category": "seo",

      "issue": "Missing title tag",

      "details": "Page does not have a title tag"

    }

  ],

  "progress": 68.6,  // percentage (0-100)

  "memory": {

    "current_mb": 245.8,

    "peak_mb": 312.4,

    "limit_mb": 2048

  },

  "is_running_pagespeed": false

}

Response Fields

Top-Level Fields

Field	Type	Description
status	string	Current crawl state: "running", "paused", "completed", or "idle"
stats	object	Crawl statistics and metrics
urls	array	Array of discovered and crawled URLs with full metadata
links	array	Array of link relationships between URLs
issues	array	Array of detected SEO and technical issues
progress	number	Crawl progress percentage (0-100)
memory	object	Memory usage statistics
is_running_pagespeed	boolean	Whether PageSpeed analysis is currently running

URL Object Fields

Field	Type	Description
url	string	Full URL of the page
status_code	number	HTTP status code (200, 404, 500, etc.)
title	string	Page title from <title> tag
meta_description	string	Meta description content
h1	string	First H1 heading on page
h2	array	Array of H2 headings
h3	array	Array of H3 headings
word_count	number	Total word count of page content
response_time	number	Page load time in seconds
analytics	object	Detected analytics tracking codes
og_tags	object	Open Graph meta tags
twitter_tags	object	Twitter Card meta tags
json_ld	array	Structured data (JSON-LD schemas)
internal_links	number	Count of internal links on page
external_links	number	Count of external links on page
images	array	Array of image objects with src, alt, width, height

Link Object Fields

Field	Type	Description
source_url	string	URL where the link was found
target_url	string	URL the link points to
anchor_text	string	Link anchor text
is_internal	boolean	True if link is internal to the site
target_domain	string	Domain of the target URL
target_status	string	HTTP status code of target (if crawled)
placement	string	Link location: "navigation", "content", "footer", "sidebar"

Issue Object Fields

Field	Type	Description
url	string	URL where issue was detected
type	string	Issue type identifier
category	string	Issue category: "seo", "broken_link", "performance", etc.
issue	string	Short issue description
details	string	Detailed explanation of the issue

Common Issue Types

404 - Page not found
500 - Server error
missing_title - No <title> tag
missing_meta_description - No meta description
duplicate_title - Title duplicated across pages
duplicate_meta_description - Meta description duplicated
missing_h1 - No H1 heading
multiple_h1 - Multiple H1 headings
thin_content - Low word count
broken_link - Link to non-existent page
slow_response - High response time
missing_alt - Image without alt text

GET /api/visualization_data

Get graph visualization data for rendering interactive site structure diagrams. Returns nodes (pages) and edges (links) formatted for graph libraries like Cytoscape.js.

Authentication

Requires valid session cookie.

Request Parameters

No request parameters required.

Example Request

            
curl http://localhost:5000/api/visualization_data \

  -b cookies.txt

Success Response (200 OK)

            
{

  "success": true,

  "nodes": [

    {

      "data": {

        "id": "https://example.com",

        "label": "Homepage",

        "status": 200,

        "type": "page",

        "depth": 0

      }

    },

    {

      "data": {

        "id": "https://example.com/about",

        "label": "About Us",

        "status": 200,

        "type": "page",

        "depth": 1

      }

    }

  ],

  "edges": [

    {

      "data": {

        "source": "https://example.com",

        "target": "https://example.com/about",

        "label": "About Us"  // anchor text

      }

    }

  ],

  "total_pages": 1247,

  "visualized_pages": 1000,  // May be truncated for performance

  "truncated": true

}

Response Fields

Field	Type	Description
success	boolean	Whether the request succeeded
nodes	array	Array of node objects representing pages
edges	array	Array of edge objects representing links
total_pages	number	Total number of crawled pages
visualized_pages	number	Number of pages included in visualization
truncated	boolean	Whether data was truncated (typically at 1000 nodes)

Performance Note: For large sites (>1000 pages), the visualization data may be truncated to prevent browser performance issues. The most important pages (by link count and depth) are prioritized.

Using with Cytoscape.js

            
// Fetch visualization data

const response = await fetch('/api/visualization_data');

const data = await response.json();

// Initialize Cytoscape

const cy = cytoscape({

  container: document.getElementById('cy'),

  elements: {

    nodes: data.nodes,

    edges: data.edges

  },

  layout: { name: 'cose' },

  style: [

    {

      selector: 'node',

      style: {

        'label': 'data(label)',

        'background-color': '#0066cc'

      }

    }

  ]

});

Polling Best Practices

Efficient Status Polling

          
let isPolling = false;

async function pollCrawlStatus() {

  if (!isPolling) return;

  try {

    const response = await fetch('/api/crawl_status');

    const data = await response.json();

    // Update UI

    updateCrawlUI(data);

    // Continue polling if crawl is active

    if (data.status === 'running' || data.status === 'paused') {

      setTimeout(pollCrawlStatus, 1000);

    } else {

      isPolling = false;

      onCrawlComplete(data);

    }

  } catch (error) {

    console.error('Polling error:', error);

    // Retry with exponential backoff

    setTimeout(pollCrawlStatus, 5000);

  }

}

// Start polling

isPolling = true;

pollCrawlStatus();

Incremental Updates

For large crawls (>1000 URLs), use incremental polling to reduce bandwidth and improve performance. Request only new data since the last poll:

          
let lastUrlCount = 0;

let lastLinkCount = 0;

let lastIssueCount = 0;

let allUrls = [];

let allLinks = [];

let allIssues = [];

async function pollCrawlStatus() {

  // Request only new data

  const params = new URLSearchParams({

    url_since: lastUrlCount,

    link_since: lastLinkCount,

    issue_since: lastIssueCount

  });

  const response = await fetch(`/api/crawl_status?${params}`);

  const data = await response.json();

  // Accumulate new data

  allUrls.push(...data.urls);

  allLinks.push(...data.links);

  allIssues.push(...data.issues);

  // Update tracking indices

  lastUrlCount = allUrls.length;

  lastLinkCount = allLinks.length;

  lastIssueCount = allIssues.length;

  // Update UI with only new items

  data.urls.forEach(url => addUrlToTable(url));

  data.links.forEach(link => addLinkToGraph(link));

  data.issues.forEach(issue => addIssueToList(issue));

  // Update statistics (always sent in full)

  document.getElementById('discovered').textContent = data.stats.discovered;

  document.getElementById('crawled').textContent = data.stats.crawled;

  document.getElementById('progress').value = data.progress;

  // Continue polling

  if (data.status === 'running') {

    setTimeout(pollCrawlStatus, 1000);

  }

}

Performance Benefit: For a 10,000 URL crawl, incremental polling reduces per-request bandwidth from ~20MB to ~200KB (with gzip compression), preventing browser memory issues and improving responsiveness.

Data Size Considerations

The /api/crawl_status endpoint can return large responses. Response sizes (with gzip compression enabled):

1,000 URLs: ~100-200 KB compressed (~1-2 MB uncompressed)
10,000 URLs: ~1-2 MB compressed (~10-20 MB uncompressed)
100,000+ URLs: Use incremental polling or export instead

Recommendations:

For crawls <1,000 URLs: Standard polling is fine
For crawls 1,000-50,000 URLs: Use incremental polling (query parameters)
For crawls >50,000 URLs: Use the Export API for batch data access

All responses are automatically gzip-compressed by the server, reducing bandwidth by ~90% for typical crawl data.

Next Steps

Export API - Export crawl data in CSV, JSON, or XML
Settings API - Configure what data to collect
Getting Started Guide - Build a complete integration