The Status & Data API provides real-time access to crawl progress, collected URL data, link relationships, detected issues, and graph visualization data. These endpoints are designed for high-frequency polling during active crawls.

Polling Recommended: Poll /api/crawl_status every 1 second during active crawls for real-time UI updates. The endpoint is optimized for high-frequency requests.

Endpoints

GET /api/crawl_status

Get comprehensive real-time data about the current or most recent crawl, including status, statistics, discovered URLs, link relationships, detected issues, and performance metrics.

Authentication

Requires valid session cookie.

Request Parameters

No request parameters required.

Example Request

curl http://localhost:5000/api/crawl_status \
  -b cookies.txt

Success Response (200 OK)

The response contains comprehensive crawl data organized into several sections:

{
  "status": "running", // "running" | "paused" | "completed" | "idle"
  "stats": {
    "discovered": 1247,
    "crawled": 856,
    "depth": 4,
    "speed": 12.5, // URLs per second
    "pagespeed_results": {
      "performance": 89,
      "accessibility": 95,
      "best_practices": 92,
      "seo": 100
    }
  },
  "urls": [
    {
      "url": "https://example.com/page",
      "status_code": 200,
      "title": "Example Page",
      "meta_description": "This is an example page",
      "h1": "Main Heading",
      "h2": ["Subheading 1", "Subheading 2"],
      "h3": ["Section 1", "Section 2"],
      "word_count": 1250,
      "response_time": 0.342, // seconds
      "analytics": {
        "google_analytics": "UA-12345-1",
        "google_tag_manager": "GTM-XXXX"
      },
      "og_tags": {
        "og:title": "Example Page",
        "og:description": "Description",
        "og:image": "https://example.com/image.jpg"
      },
      "twitter_tags": {
        "twitter:card": "summary_large_image",
        "twitter:title": "Example Page"
      },
      "json_ld": [
        {
          "@context": "https://schema.org",
          "@type": "Article",
          "headline": "Example Article"
        }
      ],
      "internal_links": 24,
      "external_links": 5,
      "images": [
        {
          "src": "https://example.com/image.jpg",
          "alt": "Image description",
          "width": 800,
          "height": 600
        }
      ]
    }
  ],
  "links": [
    {
      "source_url": "https://example.com",
      "target_url": "https://example.com/page",
      "anchor_text": "Click here",
      "is_internal": true,
      "target_domain": "example.com",
      "target_status": "200",
      "placement": "navigation" // "navigation" | "content" | "footer" | "sidebar"
    }
  ],
  "issues": [
    {
      "url": "https://example.com/broken",
      "type": "404",
      "category": "broken_link",
      "issue": "Page not found",
      "details": "This URL returns a 404 status code"
    },
    {
      "url": "https://example.com/no-title",
      "type": "missing_title",
      "category": "seo",
      "issue": "Missing title tag",
      "details": "Page does not have a title tag"
    }
  ],
  "progress": 68.6, // percentage (0-100)
  "memory": {
    "current_mb": 245.8,
    "peak_mb": 312.4,
    "limit_mb": 2048
  },
  "is_running_pagespeed": false
}

Response Fields

Top-Level Fields

Field Type Description
status string Current crawl state: "running", "paused", "completed", or "idle"
stats object Crawl statistics and metrics
urls array Array of discovered and crawled URLs with full metadata
links array Array of link relationships between URLs
issues array Array of detected SEO and technical issues
progress number Crawl progress percentage (0-100)
memory object Memory usage statistics
is_running_pagespeed boolean Whether PageSpeed analysis is currently running

URL Object Fields

Field Type Description
url string Full URL of the page
status_code number HTTP status code (200, 404, 500, etc.)
title string Page title from <title> tag
meta_description string Meta description content
h1 string First H1 heading on page
h2 array Array of H2 headings
h3 array Array of H3 headings
word_count number Total word count of page content
response_time number Page load time in seconds
analytics object Detected analytics tracking codes
og_tags object Open Graph meta tags
twitter_tags object Twitter Card meta tags
json_ld array Structured data (JSON-LD schemas)
internal_links number Count of internal links on page
external_links number Count of external links on page
images array Array of image objects with src, alt, width, height

Link Object Fields

Field Type Description
source_url string URL where the link was found
target_url string URL the link points to
anchor_text string Link anchor text
is_internal boolean True if link is internal to the site
target_domain string Domain of the target URL
target_status string HTTP status code of target (if crawled)
placement string Link location: "navigation", "content", "footer", "sidebar"

Issue Object Fields

Field Type Description
url string URL where issue was detected
type string Issue type identifier
category string Issue category: "seo", "broken_link", "performance", etc.
issue string Short issue description
details string Detailed explanation of the issue

Common Issue Types

  • 404 - Page not found
  • 500 - Server error
  • missing_title - No <title> tag
  • missing_meta_description - No meta description
  • duplicate_title - Title duplicated across pages
  • duplicate_meta_description - Meta description duplicated
  • missing_h1 - No H1 heading
  • multiple_h1 - Multiple H1 headings
  • thin_content - Low word count
  • broken_link - Link to non-existent page
  • slow_response - High response time
  • missing_alt - Image without alt text
GET /api/visualization_data

Get graph visualization data for rendering interactive site structure diagrams. Returns nodes (pages) and edges (links) formatted for graph libraries like Cytoscape.js.

Authentication

Requires valid session cookie.

Request Parameters

No request parameters required.

Example Request

curl http://localhost:5000/api/visualization_data \
  -b cookies.txt

Success Response (200 OK)

{
  "success": true,
  "nodes": [
    {
      "data": {
        "id": "https://example.com",
        "label": "Homepage",
        "status": 200,
        "type": "page",
        "depth": 0
      }
    },
    {
      "data": {
        "id": "https://example.com/about",
        "label": "About Us",
        "status": 200,
        "type": "page",
        "depth": 1
      }
    }
  ],
  "edges": [
    {
      "data": {
        "source": "https://example.com",
        "target": "https://example.com/about",
        "label": "About Us" // anchor text
      }
    }
  ],
  "total_pages": 1247,
  "visualized_pages": 1000, // May be truncated for performance
  "truncated": true
}

Response Fields

Field Type Description
success boolean Whether the request succeeded
nodes array Array of node objects representing pages
edges array Array of edge objects representing links
total_pages number Total number of crawled pages
visualized_pages number Number of pages included in visualization
truncated boolean Whether data was truncated (typically at 1000 nodes)

Performance Note: For large sites (>1000 pages), the visualization data may be truncated to prevent browser performance issues. The most important pages (by link count and depth) are prioritized.

Using with Cytoscape.js

// Fetch visualization data
const response = await fetch('/api/visualization_data');
const data = await response.json();

// Initialize Cytoscape
const cy = cytoscape({
  container: document.getElementById('cy'),
  elements: {
    nodes: data.nodes,
    edges: data.edges
  },
  layout: { name: 'cose' },
  style: [
    {
      selector: 'node',
      style: {
        'label': 'data(label)',
        'background-color': '#0066cc'
      }
    }
  ]
});

Polling Best Practices

Efficient Status Polling

let isPolling = false;

async function pollCrawlStatus() {
  if (!isPolling) return;
  
  try {
    const response = await fetch('/api/crawl_status');
    const data = await response.json();
    
    // Update UI
    updateCrawlUI(data);
    
    // Continue polling if crawl is active
    if (data.status === 'running' || data.status === 'paused') {
      setTimeout(pollCrawlStatus, 1000);
    } else {
      isPolling = false;
      onCrawlComplete(data);
    }
  } catch (error) {
    console.error('Polling error:', error);
    // Retry with exponential backoff
    setTimeout(pollCrawlStatus, 5000);
  }
}

// Start polling
isPolling = true;
pollCrawlStatus();

Incremental Updates

For large crawls, consider processing incremental updates instead of re-rendering all data each time:

let lastUrlCount = 0;

function updateCrawlUI(data) {
  // Only process new URLs
  const newUrls = data.urls.slice(lastUrlCount);
  newUrls.forEach(url => addUrlToTable(url));
  lastUrlCount = data.urls.length;
  
  // Update statistics
  document.getElementById('discovered').textContent = data.stats.discovered;
  document.getElementById('crawled').textContent = data.stats.crawled;
  document.getElementById('progress').value = data.progress;
}

Data Size Considerations

The /api/crawl_status endpoint returns all crawled data in a single response. For very large crawls, this can result in multi-megabyte responses:

  • 1,000 URLs: ~1-2 MB response
  • 10,000 URLs: ~10-20 MB response
  • 100,000+ URLs: Consider exporting instead

For large datasets, use the Export API to download data in chunks or filtered by specific fields.

Next Steps