ArchiveSpark by helgeho

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

created at Aug. 6, 2015, 7:42 p.m.

Scala

14 +0

141 +0

19 +0

GitHub
aut by archivesunleashed

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

created at July 6, 2017, 10:13 a.m.

Scala

15 +0

133 +0

33 +0

GitHub
Web2Warc by helgeho

An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)

created at Jan. 29, 2016, 10:43 a.m.

Scala

3 +0

24 +0

4 +0

GitHub
arch by internetarchive

Web application for distributed compute analysis of Archive-It web archive collections.

created at April 28, 2022, 3:18 p.m.

Scala

19 +0

13 +0

4 +0

GitHub
Sparkling by internetarchive

Internet Archive's Sparkling Data Processing Library

created at April 28, 2022, 2:28 p.m.

Scala

17 +0

10 +0

2 +0

GitHub
twut by archivesunleashed

An open-source toolkit for analyzing line-oriented JSON Twitter archives with Apache Spark.

created at Nov. 29, 2019, 2:52 p.m.

Scala

4 +0

9 +0

2 +0

GitHub