aut by archivesunleashed

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

updated at June 21, 2024, 9:23 p.m.

Scala

15 +0

135 +1

33 +0

GitHub
arch by internetarchive

Web application for distributed compute analysis of Archive-It web archive collections.

updated at June 17, 2024, 9:16 p.m.

Scala

19 +0

13 +0

4 +0

GitHub
ArchiveSpark by helgeho

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

updated at June 12, 2024, 4:19 p.m.

Scala

14 +0

143 +0

19 +0

GitHub
Sparkling by internetarchive

Internet Archive's Sparkling Data Processing Library

updated at June 5, 2024, 8:32 p.m.

Scala

17 +0

10 +0

2 +0

GitHub
Web2Warc by helgeho

An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)

updated at Oct. 22, 2023, 8:37 p.m.

Scala

3 +0

24 +0

4 +0

GitHub
twut by archivesunleashed

An open-source toolkit for analyzing line-oriented JSON Twitter archives with Apache Spark.

updated at June 12, 2023, 7:59 a.m.

Scala

4 +0

9 +0

2 +0

GitHub