node-cdxj by N0taN3rd

Parse CDXJ(https://github.com/oduwsdl/ORS/wiki/CDXJ) files with node.js

updated at May 21, 2017, 6:20 a.m.

JavaScript

3 +0

0 +0

1 +0

GitHub
HadoopConcatGz by helgeho

A Splitable Hadoop InputFormat for Concatenated GZIP Files and *.(w)arc.gz

updated at April 7, 2021, 12:20 a.m.

Java

2 +0

9 +0

3 +0

GitHub
WarcPartitioner by helgeho

Partition (W)ARC Files by MIME Type and Year

updated at Jan. 29, 2022, 10:23 p.m.

Java

2 +0

1 +0

1 +0

GitHub
wget-lua by alard

Wget with Lua extension

updated at July 17, 2022, 10:25 a.m.

C

4 +0

22 +0

9 +0

GitHub
jwat-tools by netarchivesuite

JWAT Tools

updated at March 13, 2023, 10:12 a.m.

Java

NEW!

7 +0

4 +0

2 +0

GitHub
jwat by netarchivesuite

Java Web Archive Toolkit

updated at April 17, 2023, 8:40 p.m.

Java

NEW!

8 +0

3 +0

2 +0

GitHub
twut by archivesunleashed

An open-source toolkit for analyzing line-oriented JSON Twitter archives with Apache Spark.

updated at June 12, 2023, 7:59 a.m.

Scala

4 +0

9 +0

2 +0

GitHub
warclight by archivesunleashed

A Rails engine supporting the discovery of web archives.

updated at Aug. 7, 2023, 6:51 p.m.

Ruby

5 +0

48 +0

10 +0

GitHub
tikalinkextract by httpreserve

Tika based link (URL) extractor for httpreserve

updated at Sept. 8, 2023, 5:23 p.m.

HTML

4 +0

8 +0

1 +0

GitHub
fbarc by justinlittman

A commandline tool and Python library for archiving data from Facebook using the Graph API.

updated at Oct. 22, 2023, 8:33 p.m.

Python

16 +0

78 +0

11 +0

GitHub
Web2Warc by helgeho

An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)

updated at Oct. 22, 2023, 8:37 p.m.

Scala

3 +0

24 +0

4 +0

GitHub
crocoite by PromyLOPh

Web archiving using Google Chrome

updated at Oct. 23, 2023, 11:32 a.m.

Python

8 +0

42 +0

7 +0

GitHub
linkstat by httpreserve

CLI implementation of httpreserve that can test links and retrieve internet archive replacements

updated at Nov. 18, 2023, 5:02 p.m.

Go

3 +0

7 +0

0 +0

GitHub
heritrix-walkthrough by web-archive-group

None

updated at Dec. 9, 2023, 12:31 a.m.

Shell

6 +0

9 +0

1 +0

GitHub
web-archiving-course by vphill

Web Archiving Course

updated at Dec. 19, 2023, 6:19 p.m.

Unknown languages

1 +0

19 +0

6 +0

GitHub
notebooks by archivesunleashed

Various examples of notebooks for working with web archives with the Archives Unleashed Toolkit, and derivatives generated by the Archives Unleashed Toolkit.

updated at Jan. 21, 2024, 10:04 a.m.

Jupyter Notebook

6 +0

21 +0

4 +0

GitHub
warcrefs by arcalex

Web archive deduplication tools

updated at Jan. 26, 2024, 12:55 a.m.

Java

5 +0

6 +0

1 +0

GitHub
shine by ukwa

Prototype SOLR-powered web archive exploration UI.

updated at Jan. 29, 2024, 1:03 a.m.

JavaScript

17 +0

42 +0

7 +0

GitHub
httrack2warc by nla

Converts HTTrack crawls to WARC files

updated at Jan. 30, 2024, 12:40 p.m.

Java

20 +0

27 +0

6 +0

GitHub
webarchive by richardlehane

golang readers for ARC and WARC webarchive formats

updated at Feb. 6, 2024, 11:28 p.m.

Go

7 +0

17 +0

2 +0

GitHub