An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.
created at Aug. 6, 2015, 7:42 p.m.
An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)
created at Jan. 29, 2016, 10:43 a.m.
A Splitable Hadoop InputFormat for Concatenated GZIP Files and *.(w)arc.gz
created at Aug. 8, 2016, 1:36 p.m.
Partition (W)ARC Files by MIME Type and Year
created at Feb. 13, 2017, 3:45 p.m.