Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.
updated at March 26, 2024, 10:50 p.m.
A commandline tool and Python library for archiving data from Facebook using the Graph API.
updated at May 17, 2024, 4:57 a.m.
A collection of tools for archiving and analysing the internet.
updated at June 17, 2024, 9:09 p.m.
A client for the Archive-It And Webrecorder WASAPI Data Transfer API
updated at June 28, 2024, 7:33 p.m.
simple script to convert web resources to a single warc file
updated at June 29, 2024, 9:24 a.m.
A dockerized, queued high fidelity web archiver based on Squidwarc
updated at July 23, 2024, 9:51 p.m.
Command line tools and libraries for handling and manipulating WARC files (and HTTP contents)
updated at Aug. 29, 2024, 5:43 p.m.
Streaming WARC/ARC library for fast web archive IO
updated at Aug. 31, 2024, 6:12 a.m.
WarcDB: Web crawl data as SQLite databases.
updated at Sept. 10, 2024, 3:01 p.m.
brozzler - distributed browser-based web crawler
updated at Sept. 15, 2024, 12:07 p.m.
Convert HTTP Archive (HAR) -> Web Archive (WARC) format
updated at Sept. 18, 2024, 11:21 a.m.