simple script to convert web resources to a single warc file
updated at May 8, 2024, 5:21 a.m.
A search interface and wayback machine for the UKWA Solr based warc-indexer framework.
updated at May 7, 2024, 6:08 a.m.
A client for the Archive-It And Webrecorder WASAPI Data Transfer API
updated at May 7, 2024, 3:08 a.m.
An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.
updated at May 5, 2024, 4:14 a.m.
brozzler - distributed browser-based web crawler
updated at May 4, 2024, 4:59 a.m.
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
updated at May 1, 2024, 4:39 p.m.
Various Jupyter notebooks about Common Crawl data
updated at May 1, 2024, 4:06 p.m.
WarcDB: Web crawl data as SQLite databases.
updated at May 1, 2024, 4:03 p.m.
A list of things related to software, literature, and other content for 🕣 Memento
updated at April 27, 2024, 8:55 a.m.
Web application for distributed compute analysis of Archive-It web archive collections.
updated at April 24, 2024, 8:10 p.m.
A dockerized, queued high fidelity web archiver based on Squidwarc
updated at April 23, 2024, 1:39 a.m.
Command line tools and libraries for handling and manipulating WARC files (and HTTP contents)
updated at April 11, 2024, 9:06 a.m.