Database Persistence and Content Duplication Detection in LibreCrawl

Losing hours of crawl progress to a server crash or browser restart is frustrating. Identifying duplicate content across thousands of pages by manually comparing titles and descriptions is tedious. LibreCrawl's latest update solves both problems with SQLite-powered database persistence and intelligent content duplication detection. Your crawls are now saved automatically as they run, enabling crash recovery and historical analysis, while the new duplication detector compares page content across your entire site to flag near-duplicates that harm SEO performance.

The Problem: Crawls Are Ephemeral

Before database persistence, LibreCrawl stored crawl data entirely in memory. This architecture worked well for small to medium crawls, but it meant that if your browser crashed, your server restarted, or you accidentally closed the tab, all crawl progress vanished. A four-hour crawl of 50,000 pages lost to a system update represented wasted time, wasted bandwidth, and wasted opportunity to analyze that data.

It also meant you couldn't build a history of how your site evolved over time. You might run weekly audits of your site, but each crawl existed in isolation. There was no easy way to compare this week's crawl results with last week's, track how duplicate content issues changed after making fixes, or monitor trends in page count, status code distribution, or issue frequency over time. The data existed only for as long as your browser session remained open.

Duplicate content detection faced its own challenges. While LibreCrawl flagged obvious issues like missing titles or meta descriptions on individual pages, it couldn't compare content across your entire site to identify pages with substantially similar content. These near-duplicates confuse search engines, dilute ranking signals, and waste crawl budget, but detecting them required manually reviewing hundreds or thousands of pages to spot patterns.

SQLite-Powered Database Persistence

LibreCrawl now saves all crawl data to a SQLite database as the crawl progresses. Every discovered URL, every extracted piece of metadata, every detected issue, and every internal link gets written to disk in real-time. If your server crashes mid-crawl, the data persists. When LibreCrawl restarts, it detects the incomplete crawl and offers to resume from where it left off. No progress lost, no bandwidth wasted re-crawling pages you already analyzed.

The database schema is comprehensive. The crawls table stores high-level information about each crawl: the base URL, start time, completion time, configuration snapshot, number of URLs discovered and crawled, maximum depth reached, memory usage statistics, and current status. Each crawl gets a unique ID that connects it to all related data.

The crawled_urls table holds everything LibreCrawl extracts from each page: status code, content type, size, depth, title, meta description, all heading levels, word count, canonical URL, language, charset, viewport, robots directives, OpenGraph tags, Twitter Card tags, JSON-LD structured data, analytics tracking codes, images with alt text, hreflang attributes, schema markup, redirect chains, and which pages linked to this URL. This comprehensive data capture means you can query your crawl history with SQL for deep analysis that goes beyond what the UI provides.

The crawl_links table records every internal and external link discovered during the crawl. For each link, it stores the source URL, target URL, anchor text, whether it's internal or external, the target domain, the target's status code, and where on the page the link appeared. This granular link data powers the visualization tab's interactive graphs and enables analysis of how link equity flows through your site.

The crawl_issues table catalogs every SEO problem detected across your crawl: missing titles, duplicate meta descriptions, thin content, redirect chains, missing canonical URLs, broken structured data, accessibility issues, and now content duplication. Each issue includes the URL, issue type (error, warning, or info), category, problem description, and specific details like "Title is 85 characters (recommended: ≤60)."

The crawl_queue table supports crash recovery by maintaining a checkpoint of URLs still waiting to be crawled. If a crawl fails mid-execution, LibreCrawl consults this table to reconstruct the queue and resume processing from where it stopped. This queue-based recovery ensures that even crawls interrupted after processing thousands of pages can continue without starting over.

Crash Recovery and Resume Capability

When LibreCrawl starts up, it queries the database for any crawls marked as "running." Under normal circumstances, no crawls should be in running status when the server starts, which means these are crashed crawls that need recovery. The system presents these crawls in the UI with clear options: resume the crawl from where it left off, mark it as failed and move on, or delete it entirely.

Resuming a crashed crawl loads the crawl configuration, restores all URLs already processed, rebuilds the queue of URLs still waiting to be crawled, and continues as if the interruption never happened. The crawler picks up the next URL from the queue and proceeds normally. Users see the crawl progress continue from its previous state, with the crawled URL count incrementing from where it stopped rather than starting at zero.

This crash recovery capability is particularly valuable for large crawls that might run for hours or days. If you're crawling a million-page e-commerce site and your server needs to restart for updates after processing 400,000 pages, you don't lose 10+ hours of work. Resume the crawl, and LibreCrawl continues processing the remaining 600,000 pages as if nothing happened.

The database persistence also means that completed crawls remain accessible even after you close your browser or restart the server. Each crawl is saved with its full dataset, letting you load historical crawls from the UI and analyze them as if they just completed. This historical access enables trend analysis, before-and-after comparisons when making site changes, and the ability to reference past crawls when planning future SEO work.

Intelligent Content Duplication Detection

Content duplication is one of the most insidious SEO issues. Two pages with nearly identical content compete against each other in search results, diluting ranking signals and confusing search engines about which version to index. Duplicate content often emerges from printer-friendly versions, regional variations of the same page, paginated content without proper canonical tags, or CMS mistakes that create multiple URLs for the same content.

LibreCrawl's new duplication detection analyzes every crawled page and compares its content against all other pages to identify near-duplicates. The system examines multiple content signals: title tags, meta descriptions, H1 headings, and overall page structure. It calculates a similarity ratio using sequence matching algorithms that detect how much two pages have in common, accounting for minor variations in wording while identifying substantial content overlap.

The duplication threshold is configurable, defaulting to 85% similarity. This means two pages must share at least 85% of their content signals to be flagged as duplicates. This threshold strikes a balance between catching real duplication problems while avoiding false positives from pages that share templates or common elements but contain unique core content. You can adjust this threshold in settings if your site needs stricter or more lenient duplicate detection.

When duplication is detected, LibreCrawl adds issues to both URLs involved in the duplicate pair. If Page A and Page B are 92% similar, both pages show a warning in the Issues tab: "Duplicate Content Detected: Content is 92.0% similar to [other URL]." This bilateral reporting ensures you see the duplication from both sides, making it easier to decide which URL should be canonical and which should redirect or be consolidated.

The duplication detector respects your issue exclusion patterns. If you've configured LibreCrawl to ignore certain URL patterns, those URLs won't participate in duplication analysis. This is useful for legitimate duplicate scenarios like printer-friendly versions or staging environments where you don't want false positives cluttering your issue reports.

Duplication detection runs after the main crawl completes, analyzing all crawled content in a single pass. For small to medium sites, this analysis completes in seconds. For larger sites with tens of thousands of pages, it might take a few minutes as the system performs pairwise comparisons across your content. The results integrate seamlessly with the existing Issues tab, where you can sort by category to see all Duplication issues together, export them to CSV for reporting, or click through to the URLs to review the content and decide how to resolve the duplication.

Performance and Scalability

Database persistence adds minimal overhead to crawl performance. LibreCrawl batches database writes, inserting URLs, links, and issues in groups rather than one at a time. This batching reduces database transactions and maintains crawl throughput even on sites with thousands of pages. The SQLite database uses write-ahead logging for concurrent access, allowing reads and writes to happen simultaneously without blocking.

Indexes on frequently queried columns ensure that loading historical crawls, filtering by status code, or searching for specific URLs remains fast even as your database grows to store dozens or hundreds of crawls. The database file is stored in the same directory as your LibreCrawl installation, making backups simple: just copy the users.db file to preserve all your crawl history and user data.

For users concerned about database size, LibreCrawl includes optional maintenance functions. The cleanup_old_crawls() function deletes crawls older than a specified number of days, automatically removing completed or failed crawls that you no longer need. This automated cleanup keeps your database lean while retaining recent history for trend analysis.

Duplication detection scales surprisingly well even for large sites. The algorithm uses efficient data structures to avoid redundant comparisons, and it processes only HTML pages rather than assets like CSS, JavaScript, or images. A crawl of 10,000 pages generates roughly 50 million comparisons in the worst case, but with optimizations like skipping identical titles or early-exit when similarity is clearly below threshold, the actual comparison count is much lower in practice.

What This Means for SEO Workflows

Database persistence fundamentally changes how you can use LibreCrawl for ongoing SEO monitoring. Instead of running one-off audits and losing the data after fixing issues, you now build a historical record of your site's technical health. Run a crawl every week, and you can track how your issue count trends over time. Did fixing those 200 missing meta descriptions actually reduce your total issues, or did new problems emerge elsewhere on the site?

The ability to compare crawls over time reveals patterns that single audits miss. Maybe your site gradually accumulates orphaned pages as content teams publish without updating navigation. Maybe duplicate content issues spike whenever you launch a new product category because templates aren't configured correctly. These trends only become visible when you have historical data to analyze.

Duplication detection addresses a gap that manual analysis can't efficiently fill. Even experienced SEO professionals struggle to spot content duplication across large sites without automated tools. You might notice that three product pages have similar descriptions, but you'd never catch the 47 blog posts that all use variations of the same introduction paragraph, or the regional landing pages that differ only in city names. The duplication detector finds these problems systematically, giving you actionable data on where to focus content differentiation efforts.

For agencies managing multiple client sites, database persistence means you can maintain a complete crawl history for each client. Reference past crawls when clients question whether an issue existed before or after you made changes. Show month-over-month improvement in issue counts to demonstrate the value of your technical SEO work. Export crawl data to CSV and integrate it into your reporting dashboards to track SEO health alongside traffic and ranking data.

Configuring Database Persistence and Duplication Detection

Database persistence activates automatically in LibreCrawl's latest release. When you start a new crawl, the system creates a crawl record in the database and begins saving data as URLs are processed. You don't need to configure anything for basic persistence and crash recovery to work.

Duplication detection is enabled by default with an 85% similarity threshold. You can adjust this threshold in Settings under the Issues section. A lower threshold like 70% catches more potential duplicates but may produce false positives. A higher threshold like 95% reduces false positives but might miss subtle duplication. Experiment with different thresholds on your site to find the balance that works for your content.

To disable duplication detection entirely, toggle the "Enable Duplication Check" setting to off. This can be useful for very large sites where the duplication analysis adds significant processing time, or for sites where legitimate content similarity (like product variants or legal pages) would generate too many false positives.

The database file users.db stores all persistence data alongside your user accounts and authentication records. Back up this file regularly if you want to preserve your crawl history long-term. The file grows proportionally to how many crawls you run and how large those crawls are. A typical crawl of 5,000 pages adds about 50-100 MB to the database, though this varies based on how much metadata each page contains.

Future Enhancements

Database persistence opens up new possibilities for LibreCrawl's roadmap. Scheduled crawls that run automatically on a recurring basis become feasible when results are stored persistently. You could configure LibreCrawl to crawl your site every night at 2 AM, and each morning you'd have fresh data waiting in your crawl history without manual intervention.

Historical trend analysis features could visualize how your site's metrics change over time. Charts showing total URLs crawled, issue count by category, average page speed, or duplication frequency over weeks or months would make it easy to spot improvements or regressions at a glance. This dashboard-style reporting would transform LibreCrawl from an audit tool into an ongoing monitoring platform.

Advanced duplication analysis could extend beyond current page content to include historical comparisons. Alert you when a page's content suddenly becomes similar to another page, suggesting that someone copied content between pages. Track when duplicate content is fixed and removed from the duplication report. Generate recommendations about which duplicate to keep based on metrics like inbound links, social shares, or content quality signals.

Multi-site support could leverage the database to manage crawls across different projects or clients. Each site gets its own crawl history, issue tracking, and trend analysis while all data lives in the same database for centralized management. This would be particularly valuable for agencies running LibreCrawl for multiple clients from a single installation.

Getting Started

Database persistence and duplication detection are available now in LibreCrawl's latest release. If you're running a self-hosted instance, pull the latest code from the GitHub repository to access these features. The database will initialize automatically on first launch, creating all necessary tables and indexes.

If you have existing LibreCrawl crawls saved as JSON files, those files remain valid and can still be loaded through the UI. The new database persistence doesn't replace JSON exports; it complements them by providing automatic background persistence and crash recovery while JSON exports remain useful for sharing crawls between installations or archiving specific audits.

To test crash recovery, start a crawl of a moderately large site, let it run for a few minutes to process some URLs, then restart your LibreCrawl server. When it starts back up, you'll see the incomplete crawl listed with an option to resume. Click resume, and watch it continue from where it left off. To see duplication detection in action, crawl a site where you suspect content duplication, then check the Issues tab for the Duplication category to see what similar pages were identified.

Database persistence and duplication detection represent a major step forward for LibreCrawl's reliability and analytical capabilities. Never lose crawl progress to crashes again. Build a historical record of your site's SEO health over time. Systematically identify duplicate content that harms your search performance. These features transform LibreCrawl from a powerful audit tool into a robust, production-ready SEO monitoring platform that learns from your site's history and protects your work from data loss.