A Splitable Hadoop InputFormat for Concatenated GZIP Files and *.(w)arc.gz
updated at April 7, 2021, 12:20 a.m.
Partition (W)ARC Files by MIME Type and Year
updated at Jan. 29, 2022, 10:23 p.m.
NEW!
NEW!
An open-source toolkit for analyzing line-oriented JSON Twitter archives with Apache Spark.
updated at June 12, 2023, 7:59 a.m.
A Rails engine supporting the discovery of web archives.
updated at Aug. 7, 2023, 6:51 p.m.
Tika based link (URL) extractor for httpreserve
updated at Sept. 8, 2023, 5:23 p.m.
A commandline tool and Python library for archiving data from Facebook using the Graph API.
updated at Oct. 22, 2023, 8:33 p.m.
CLI implementation of httpreserve that can test links and retrieve internet archive replacements
updated at Nov. 18, 2023, 5:02 p.m.
Various examples of notebooks for working with web archives with the Archives Unleashed Toolkit, and derivatives generated by the Archives Unleashed Toolkit.
updated at Jan. 21, 2024, 10:04 a.m.
golang readers for ARC and WARC webarchive formats
updated at Feb. 6, 2024, 11:28 p.m.