The Status & Data API provides real-time access to crawl progress, collected URL data, link relationships, detected issues, and graph visualization data. These endpoints are designed for high-frequency polling during active crawls.
Polling Recommended: Poll /api/crawl_status every 1 second during active crawls for real-time UI updates. The endpoint is optimized for high-frequency requests.
Endpoints
Get comprehensive real-time data about the current or most recent crawl, including status, statistics, discovered URLs, link relationships, detected issues, and performance metrics.
Authentication
Requires valid session cookie.
Request Parameters
No request parameters required.
Example Request
curl http://localhost:5000/api/crawl_status \
-b cookies.txt
Success Response (200 OK)
The response contains comprehensive crawl data organized into several sections:
{
"status": "running", // "running" | "paused" | "completed" | "idle"
"stats": {
"discovered": 1247,
"crawled": 856,
"depth": 4,
"speed": 12.5, // URLs per second
"pagespeed_results": {
"performance": 89,
"accessibility": 95,
"best_practices": 92,
"seo": 100
}
},
"urls": [
{
"url": "https://example.com/page",
"status_code": 200,
"title": "Example Page",
"meta_description": "This is an example page",
"h1": "Main Heading",
"h2": ["Subheading 1", "Subheading 2"],
"h3": ["Section 1", "Section 2"],
"word_count": 1250,
"response_time": 0.342, // seconds
"analytics": {
"google_analytics": "UA-12345-1",
"google_tag_manager": "GTM-XXXX"
},
"og_tags": {
"og:title": "Example Page",
"og:description": "Description",
"og:image": "https://example.com/image.jpg"
},
"twitter_tags": {
"twitter:card": "summary_large_image",
"twitter:title": "Example Page"
},
"json_ld": [
{
"@context": "https://schema.org",
"@type": "Article",
"headline": "Example Article"
}
],
"internal_links": 24,
"external_links": 5,
"images": [
{
"src": "https://example.com/image.jpg",
"alt": "Image description",
"width": 800,
"height": 600
}
]
}
],
"links": [
{
"source_url": "https://example.com",
"target_url": "https://example.com/page",
"anchor_text": "Click here",
"is_internal": true,
"target_domain": "example.com",
"target_status": "200",
"placement": "navigation" // "navigation" | "content" | "footer" | "sidebar"
}
],
"issues": [
{
"url": "https://example.com/broken",
"type": "404",
"category": "broken_link",
"issue": "Page not found",
"details": "This URL returns a 404 status code"
},
{
"url": "https://example.com/no-title",
"type": "missing_title",
"category": "seo",
"issue": "Missing title tag",
"details": "Page does not have a title tag"
}
],
"progress": 68.6, // percentage (0-100)
"memory": {
"current_mb": 245.8,
"peak_mb": 312.4,
"limit_mb": 2048
},
"is_running_pagespeed": false
}
Response Fields
Top-Level Fields
| Field | Type | Description |
|---|---|---|
| status | string | Current crawl state: "running", "paused", "completed", or "idle" |
| stats | object | Crawl statistics and metrics |
| urls | array | Array of discovered and crawled URLs with full metadata |
| links | array | Array of link relationships between URLs |
| issues | array | Array of detected SEO and technical issues |
| progress | number | Crawl progress percentage (0-100) |
| memory | object | Memory usage statistics |
| is_running_pagespeed | boolean | Whether PageSpeed analysis is currently running |
URL Object Fields
| Field | Type | Description |
|---|---|---|
| url | string | Full URL of the page |
| status_code | number | HTTP status code (200, 404, 500, etc.) |
| title | string | Page title from <title> tag |
| meta_description | string | Meta description content |
| h1 | string | First H1 heading on page |
| h2 | array | Array of H2 headings |
| h3 | array | Array of H3 headings |
| word_count | number | Total word count of page content |
| response_time | number | Page load time in seconds |
| analytics | object | Detected analytics tracking codes |
| og_tags | object | Open Graph meta tags |
| twitter_tags | object | Twitter Card meta tags |
| json_ld | array | Structured data (JSON-LD schemas) |
| internal_links | number | Count of internal links on page |
| external_links | number | Count of external links on page |
| images | array | Array of image objects with src, alt, width, height |
Link Object Fields
| Field | Type | Description |
|---|---|---|
| source_url | string | URL where the link was found |
| target_url | string | URL the link points to |
| anchor_text | string | Link anchor text |
| is_internal | boolean | True if link is internal to the site |
| target_domain | string | Domain of the target URL |
| target_status | string | HTTP status code of target (if crawled) |
| placement | string | Link location: "navigation", "content", "footer", "sidebar" |
Issue Object Fields
| Field | Type | Description |
|---|---|---|
| url | string | URL where issue was detected |
| type | string | Issue type identifier |
| category | string | Issue category: "seo", "broken_link", "performance", etc. |
| issue | string | Short issue description |
| details | string | Detailed explanation of the issue |
Common Issue Types
- 404 - Page not found
- 500 - Server error
- missing_title - No <title> tag
- missing_meta_description - No meta description
- duplicate_title - Title duplicated across pages
- duplicate_meta_description - Meta description duplicated
- missing_h1 - No H1 heading
- multiple_h1 - Multiple H1 headings
- thin_content - Low word count
- broken_link - Link to non-existent page
- slow_response - High response time
- missing_alt - Image without alt text
Get graph visualization data for rendering interactive site structure diagrams. Returns nodes (pages) and edges (links) formatted for graph libraries like Cytoscape.js.
Authentication
Requires valid session cookie.
Request Parameters
No request parameters required.
Example Request
curl http://localhost:5000/api/visualization_data \
-b cookies.txt
Success Response (200 OK)
{
"success": true,
"nodes": [
{
"data": {
"id": "https://example.com",
"label": "Homepage",
"status": 200,
"type": "page",
"depth": 0
}
},
{
"data": {
"id": "https://example.com/about",
"label": "About Us",
"status": 200,
"type": "page",
"depth": 1
}
}
],
"edges": [
{
"data": {
"source": "https://example.com",
"target": "https://example.com/about",
"label": "About Us" // anchor text
}
}
],
"total_pages": 1247,
"visualized_pages": 1000, // May be truncated for performance
"truncated": true
}
Response Fields
| Field | Type | Description |
|---|---|---|
| success | boolean | Whether the request succeeded |
| nodes | array | Array of node objects representing pages |
| edges | array | Array of edge objects representing links |
| total_pages | number | Total number of crawled pages |
| visualized_pages | number | Number of pages included in visualization |
| truncated | boolean | Whether data was truncated (typically at 1000 nodes) |
Performance Note: For large sites (>1000 pages), the visualization data may be truncated to prevent browser performance issues. The most important pages (by link count and depth) are prioritized.
Using with Cytoscape.js
// Fetch visualization data
const response = await fetch('/api/visualization_data');
const data = await response.json();
// Initialize Cytoscape
const cy = cytoscape({
container: document.getElementById('cy'),
elements: {
nodes: data.nodes,
edges: data.edges
},
layout: { name: 'cose' },
style: [
{
selector: 'node',
style: {
'label': 'data(label)',
'background-color': '#0066cc'
}
}
]
});
Polling Best Practices
Efficient Status Polling
let isPolling = false;
async function pollCrawlStatus() {
if (!isPolling) return;
try {
const response = await fetch('/api/crawl_status');
const data = await response.json();
// Update UI
updateCrawlUI(data);
// Continue polling if crawl is active
if (data.status === 'running' || data.status === 'paused') {
setTimeout(pollCrawlStatus, 1000);
} else {
isPolling = false;
onCrawlComplete(data);
}
} catch (error) {
console.error('Polling error:', error);
// Retry with exponential backoff
setTimeout(pollCrawlStatus, 5000);
}
}
// Start polling
isPolling = true;
pollCrawlStatus();
Incremental Updates
For large crawls, consider processing incremental updates instead of re-rendering all data each time:
let lastUrlCount = 0;
function updateCrawlUI(data) {
// Only process new URLs
const newUrls = data.urls.slice(lastUrlCount);
newUrls.forEach(url => addUrlToTable(url));
lastUrlCount = data.urls.length;
// Update statistics
document.getElementById('discovered').textContent = data.stats.discovered;
document.getElementById('crawled').textContent = data.stats.crawled;
document.getElementById('progress').value = data.progress;
}
Data Size Considerations
The /api/crawl_status endpoint returns all crawled data in a single response. For very large crawls, this can result in multi-megabyte responses:
- 1,000 URLs: ~1-2 MB response
- 10,000 URLs: ~10-20 MB response
- 100,000+ URLs: Consider exporting instead
For large datasets, use the Export API to download data in chunks or filtered by specific fields.
Next Steps
- Export API - Export crawl data in CSV, JSON, or XML
- Settings API - Configure what data to collect
- Getting Started Guide - Build a complete integration