The Status & Data API provides real-time access to crawl progress, collected URL data, link relationships, detected issues, and graph visualization data. These endpoints are designed for high-frequency polling during active crawls.
Polling Recommended: Poll /api/crawl_status every 1 second during active crawls for real-time UI updates. The endpoint is optimized for high-frequency requests.
Endpoints
Get comprehensive real-time data about the current or most recent crawl, including status, statistics, discovered URLs, link relationships, detected issues, and performance metrics.
Authentication
Requires valid session cookie.
Request Parameters
Optional query parameters for incremental updates (recommended for large crawls):
| Parameter | Type | Description |
|---|---|---|
| url_since | integer | Only return URLs after this index (e.g., 100 returns URLs from index 100 onward) |
| link_since | integer | Only return links after this index |
| issue_since | integer | Only return issues after this index |
Example Requests
Full Status (Initial Request)
curl http://localhost:5000/api/crawl_status \
-b cookies.txt
Incremental Update (Subsequent Requests)
curl "http://localhost:5000/api/crawl_status?url_since=100&link_since=500&issue_since=10" \
-b cookies.txt
Success Response (200 OK)
The response contains comprehensive crawl data organized into several sections:
{
"status": "running", // "running" | "paused" | "completed" | "idle"
"stats": {
"discovered": 1247,
"crawled": 856,
"depth": 4,
"speed": 12.5, // URLs per second
"pagespeed_results": {
"performance": 89,
"accessibility": 95,
"best_practices": 92,
"seo": 100
}
},
"urls": [
{
"url": "https://example.com/page",
"status_code": 200,
"title": "Example Page",
"meta_description": "This is an example page",
"h1": "Main Heading",
"h2": ["Subheading 1", "Subheading 2"],
"h3": ["Section 1", "Section 2"],
"word_count": 1250,
"response_time": 0.342, // seconds
"analytics": {
"google_analytics": "UA-12345-1",
"google_tag_manager": "GTM-XXXX"
},
"og_tags": {
"og:title": "Example Page",
"og:description": "Description",
"og:image": "https://example.com/image.jpg"
},
"twitter_tags": {
"twitter:card": "summary_large_image",
"twitter:title": "Example Page"
},
"json_ld": [
{
"@context": "https://schema.org",
"@type": "Article",
"headline": "Example Article"
}
],
"internal_links": 24,
"external_links": 5,
"images": [
{
"src": "https://example.com/image.jpg",
"alt": "Image description",
"width": 800,
"height": 600
}
]
}
],
"links": [
{
"source_url": "https://example.com",
"target_url": "https://example.com/page",
"anchor_text": "Click here",
"is_internal": true,
"target_domain": "example.com",
"target_status": "200",
"placement": "navigation" // "navigation" | "content" | "footer" | "sidebar"
}
],
"issues": [
{
"url": "https://example.com/broken",
"type": "404",
"category": "broken_link",
"issue": "Page not found",
"details": "This URL returns a 404 status code"
},
{
"url": "https://example.com/no-title",
"type": "missing_title",
"category": "seo",
"issue": "Missing title tag",
"details": "Page does not have a title tag"
}
],
"progress": 68.6, // percentage (0-100)
"memory": {
"current_mb": 245.8,
"peak_mb": 312.4,
"limit_mb": 2048
},
"is_running_pagespeed": false
}
Response Fields
Top-Level Fields
| Field | Type | Description |
|---|---|---|
| status | string | Current crawl state: "running", "paused", "completed", or "idle" |
| stats | object | Crawl statistics and metrics |
| urls | array | Array of discovered and crawled URLs with full metadata |
| links | array | Array of link relationships between URLs |
| issues | array | Array of detected SEO and technical issues |
| progress | number | Crawl progress percentage (0-100) |
| memory | object | Memory usage statistics |
| is_running_pagespeed | boolean | Whether PageSpeed analysis is currently running |
URL Object Fields
| Field | Type | Description |
|---|---|---|
| url | string | Full URL of the page |
| status_code | number | HTTP status code (200, 404, 500, etc.) |
| title | string | Page title from <title> tag |
| meta_description | string | Meta description content |
| h1 | string | First H1 heading on page |
| h2 | array | Array of H2 headings |
| h3 | array | Array of H3 headings |
| word_count | number | Total word count of page content |
| response_time | number | Page load time in seconds |
| analytics | object | Detected analytics tracking codes |
| og_tags | object | Open Graph meta tags |
| twitter_tags | object | Twitter Card meta tags |
| json_ld | array | Structured data (JSON-LD schemas) |
| internal_links | number | Count of internal links on page |
| external_links | number | Count of external links on page |
| images | array | Array of image objects with src, alt, width, height |
Link Object Fields
| Field | Type | Description |
|---|---|---|
| source_url | string | URL where the link was found |
| target_url | string | URL the link points to |
| anchor_text | string | Link anchor text |
| is_internal | boolean | True if link is internal to the site |
| target_domain | string | Domain of the target URL |
| target_status | string | HTTP status code of target (if crawled) |
| placement | string | Link location: "navigation", "content", "footer", "sidebar" |
Issue Object Fields
| Field | Type | Description |
|---|---|---|
| url | string | URL where issue was detected |
| type | string | Issue type identifier |
| category | string | Issue category: "seo", "broken_link", "performance", etc. |
| issue | string | Short issue description |
| details | string | Detailed explanation of the issue |
Common Issue Types
- 404 - Page not found
- 500 - Server error
- missing_title - No <title> tag
- missing_meta_description - No meta description
- duplicate_title - Title duplicated across pages
- duplicate_meta_description - Meta description duplicated
- missing_h1 - No H1 heading
- multiple_h1 - Multiple H1 headings
- thin_content - Low word count
- broken_link - Link to non-existent page
- slow_response - High response time
- missing_alt - Image without alt text
Get graph visualization data for rendering interactive site structure diagrams. Returns nodes (pages) and edges (links) formatted for graph libraries like Cytoscape.js.
Authentication
Requires valid session cookie.
Request Parameters
No request parameters required.
Example Request
curl http://localhost:5000/api/visualization_data \
-b cookies.txt
Success Response (200 OK)
{
"success": true,
"nodes": [
{
"data": {
"id": "https://example.com",
"label": "Homepage",
"status": 200,
"type": "page",
"depth": 0
}
},
{
"data": {
"id": "https://example.com/about",
"label": "About Us",
"status": 200,
"type": "page",
"depth": 1
}
}
],
"edges": [
{
"data": {
"source": "https://example.com",
"target": "https://example.com/about",
"label": "About Us" // anchor text
}
}
],
"total_pages": 1247,
"visualized_pages": 1000, // May be truncated for performance
"truncated": true
}
Response Fields
| Field | Type | Description |
|---|---|---|
| success | boolean | Whether the request succeeded |
| nodes | array | Array of node objects representing pages |
| edges | array | Array of edge objects representing links |
| total_pages | number | Total number of crawled pages |
| visualized_pages | number | Number of pages included in visualization |
| truncated | boolean | Whether data was truncated (typically at 1000 nodes) |
Performance Note: For large sites (>1000 pages), the visualization data may be truncated to prevent browser performance issues. The most important pages (by link count and depth) are prioritized.
Using with Cytoscape.js
// Fetch visualization data
const response = await fetch('/api/visualization_data');
const data = await response.json();
// Initialize Cytoscape
const cy = cytoscape({
container: document.getElementById('cy'),
elements: {
nodes: data.nodes,
edges: data.edges
},
layout: { name: 'cose' },
style: [
{
selector: 'node',
style: {
'label': 'data(label)',
'background-color': '#0066cc'
}
}
]
});
Polling Best Practices
Efficient Status Polling
let isPolling = false;
async function pollCrawlStatus() {
if (!isPolling) return;
try {
const response = await fetch('/api/crawl_status');
const data = await response.json();
// Update UI
updateCrawlUI(data);
// Continue polling if crawl is active
if (data.status === 'running' || data.status === 'paused') {
setTimeout(pollCrawlStatus, 1000);
} else {
isPolling = false;
onCrawlComplete(data);
}
} catch (error) {
console.error('Polling error:', error);
// Retry with exponential backoff
setTimeout(pollCrawlStatus, 5000);
}
}
// Start polling
isPolling = true;
pollCrawlStatus();
Incremental Updates
For large crawls (>1000 URLs), use incremental polling to reduce bandwidth and improve performance. Request only new data since the last poll:
let lastUrlCount = 0;
let lastLinkCount = 0;
let lastIssueCount = 0;
let allUrls = [];
let allLinks = [];
let allIssues = [];
async function pollCrawlStatus() {
// Request only new data
const params = new URLSearchParams({
url_since: lastUrlCount,
link_since: lastLinkCount,
issue_since: lastIssueCount
});
const response = await fetch(`/api/crawl_status?${params}`);
const data = await response.json();
// Accumulate new data
allUrls.push(...data.urls);
allLinks.push(...data.links);
allIssues.push(...data.issues);
// Update tracking indices
lastUrlCount = allUrls.length;
lastLinkCount = allLinks.length;
lastIssueCount = allIssues.length;
// Update UI with only new items
data.urls.forEach(url => addUrlToTable(url));
data.links.forEach(link => addLinkToGraph(link));
data.issues.forEach(issue => addIssueToList(issue));
// Update statistics (always sent in full)
document.getElementById('discovered').textContent = data.stats.discovered;
document.getElementById('crawled').textContent = data.stats.crawled;
document.getElementById('progress').value = data.progress;
// Continue polling
if (data.status === 'running') {
setTimeout(pollCrawlStatus, 1000);
}
}
Performance Benefit: For a 10,000 URL crawl, incremental polling reduces per-request bandwidth from ~20MB to ~200KB (with gzip compression), preventing browser memory issues and improving responsiveness.
Data Size Considerations
The /api/crawl_status endpoint can return large responses. Response sizes (with gzip compression enabled):
- 1,000 URLs: ~100-200 KB compressed (~1-2 MB uncompressed)
- 10,000 URLs: ~1-2 MB compressed (~10-20 MB uncompressed)
- 100,000+ URLs: Use incremental polling or export instead
Recommendations:
- For crawls <1,000 URLs: Standard polling is fine
- For crawls 1,000-50,000 URLs: Use incremental polling (query parameters)
- For crawls >50,000 URLs: Use the Export API for batch data access
All responses are automatically gzip-compressed by the server, reducing bandwidth by ~90% for typical crawl data.
Next Steps
- Export API - Export crawl data in CSV, JSON, or XML
- Settings API - Configure what data to collect
- Getting Started Guide - Build a complete integration