Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.
created at March 9, 2015, 8:32 p.m.
brozzler - distributed browser-based web crawler
created at July 13, 2015, 11:48 p.m.
An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.
created at Aug. 6, 2015, 7:42 p.m.
golang readers for ARC and WARC webarchive formats
created at Sept. 21, 2015, 6:38 a.m.
simple script to convert web resources to a single warc file
created at Dec. 30, 2015, 2:29 p.m.
A Splitable Hadoop InputFormat for Concatenated GZIP Files and *.(w)arc.gz
created at Aug. 8, 2016, 1:36 p.m.
A list of things related to software, literature, and other content for 🕣 Memento
created at Sept. 16, 2016, 1:33 a.m.
A search interface and wayback machine for the UKWA Solr based warc-indexer framework.
created at Feb. 8, 2017, 9:33 a.m.
A Tool To Push Web Resources Into Web Archives
created at Feb. 9, 2017, 12:29 p.m.
Partition (W)ARC Files by MIME Type and Year
created at Feb. 13, 2017, 3:45 p.m.
A commandline tool and Python library for archiving data from Facebook using the Graph API.
created at Feb. 14, 2017, 11:45 p.m.
Streaming WARC/ARC library for fast web archive IO
created at March 6, 2017, 6:17 p.m.