Web2Warc by helgeho

An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)

updated at Oct. 22, 2023, 8:37 p.m.

Scala

3 +0

24 +0

4 +0

GitHub
twut by archivesunleashed

An open-source toolkit for analyzing line-oriented JSON Twitter archives with Apache Spark.

updated at Aug. 10, 2024, 1:19 p.m.

Scala

4 +0

9 +0

2 +0

GitHub
arch by internetarchive

Web application for distributed compute analysis of Archive-It web archive collections.

updated at Aug. 28, 2024, 7:31 p.m.

Scala

19 +0

15 +0

4 +0

GitHub
aut by archivesunleashed

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

updated at Aug. 29, 2024, 4:20 p.m.

Scala

15 +0

137 +0

33 +0

GitHub
Sparkling by internetarchive

Internet Archive's Sparkling Data Processing Library

updated at Sept. 12, 2024, 3:06 p.m.

Scala

17 +0

11 +0

2 +0

GitHub
ArchiveSpark by helgeho

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

updated at Sept. 13, 2024, 6:53 a.m.

Scala

14 +0

143 +0

19 +0

GitHub