MementoMap by oduwsdl

A Tool to Summarize Web Archive Holdings

created at Jan. 20, 2019, 1:30 a.m.

Python

7 +0

9 +0

0 +0

GitHub
unwarcit by emmadickson

None

created at Dec. 11, 2021, 7:19 p.m.

Python

5 +0

6 +0

0 +0

GitHub
linkstat by httpreserve

CLI implementation of httpreserve that can test links and retrieve internet archive replacements

created at March 19, 2019, 9:23 p.m.

Go

3 +0

7 +0

0 +0

GitHub
warc-safe by natliblux

A tool for detecting viruses and NSFW material in WARC files

created at May 3, 2024, 6:24 a.m.

Python

NEW!

4 +0

5 +0

0 +0

GitHub
tikalinkextract by httpreserve

Tika based link (URL) extractor for httpreserve

created at April 3, 2017, 2:35 a.m.

HTML

4 +0

8 +0

1 +0

GitHub
warcrefs by arcalex

Web archive deduplication tools

created at April 22, 2014, 8:02 a.m.

Java

5 +0

6 +0

1 +0

GitHub
heritrix-walkthrough by web-archive-group

None

created at June 1, 2016, 10:35 p.m.

Shell

6 +0

9 +0

1 +0

GitHub
node-cdxj by N0taN3rd

Parse CDXJ(https://github.com/oduwsdl/ORS/wiki/CDXJ) files with node.js

created at May 18, 2017, 4:45 a.m.

JavaScript

3 +0

0 +0

1 +0

GitHub
playback by wabarc

Playback webpages from Wayback Machine

created at April 8, 2021, 2:21 p.m.

Go

4 +0

6 +0

1 +0

GitHub
WarcPartitioner by helgeho

Partition (W)ARC Files by MIME Type and Year

created at Feb. 13, 2017, 3:45 p.m.

Java

2 +0

1 +0

1 +0

GitHub
gowarcserver by nlnwa

None

created at Jan. 15, 2021, 10:42 a.m.

Go

7 +0

12 +0

1 +0

GitHub
Sparkling by internetarchive

Internet Archive's Sparkling Data Processing Library

created at April 28, 2022, 2:28 p.m.

Scala

17 +0

10 +0

2 +0

GitHub
jwat by netarchivesuite

Java Web Archive Toolkit

created at Aug. 30, 2018, 5:28 p.m.

Java

NEW!

8 +0

3 +0

2 +0

GitHub
webarchive by richardlehane

golang readers for ARC and WARC webarchive formats

created at Sept. 21, 2015, 6:38 a.m.

Go

7 +0

17 +0

2 +0

GitHub
html2warc by steffenfritz

simple script to convert web resources to a single warc file

created at Dec. 30, 2015, 2:29 p.m.

Python

4 +0

15 +0

2 +0

GitHub
jwat-tools by netarchivesuite

JWAT Tools

created at Aug. 30, 2018, 5:54 p.m.

Java

NEW!

7 +0

4 +0

2 +0

GitHub
cairn by wabarc

NPM package and CLI tool for saving web page as single HTML file

created at Oct. 8, 2020, 7:18 a.m.

TypeScript

4 +0

37 +0

2 +0

GitHub
Zotero-Robust-Links-Extension by lanl

Create Robust Links from within Zotero

created at June 28, 2021, 9:38 p.m.

JavaScript

3 +0

17 +0

2 +0

GitHub
twut by archivesunleashed

An open-source toolkit for analyzing line-oriented JSON Twitter archives with Apache Spark.

created at Nov. 29, 2019, 2:52 p.m.

Scala

4 +0

9 +0

2 +0

GitHub
HadoopConcatGz by helgeho

A Splitable Hadoop InputFormat for Concatenated GZIP Files and *.(w)arc.gz

created at Aug. 8, 2016, 1:36 p.m.

Java

2 +0

9 +0

3 +0

GitHub