httrack2warc by nla

Converts HTTrack crawls to WARC files

created at Oct. 23, 2017, 5:52 a.m.

Java

20 +0

30 +0

6 +0

GitHub
wasp by webis-de

None

created at March 25, 2018, 6:58 p.m.

Java

13 +0

26 +0

4 +0

GitHub
Web2Warc by helgeho

An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)

created at Jan. 29, 2016, 10:43 a.m.

Scala

3 +0

24 +0

4 +0

GitHub
wget-lua by alard

Wget with Lua extension

created at Aug. 21, 2012, 8:39 p.m.

C

4 +0

23 +0

9 +0

GitHub
notebooks by archivesunleashed

Various examples of notebooks for working with web archives with the Archives Unleashed Toolkit, and derivatives generated by the Archives Unleashed Toolkit.

created at Nov. 6, 2019, 3:09 a.m.

Jupyter Notebook

6 +0

22 +0

4 +0

GitHub
web-archiving-course by vphill

Web Archiving Course

created at Feb. 22, 2022, 2:33 a.m.

Unknown languages

1 +0

20 +0

6 +0

GitHub
webarchive by richardlehane

golang readers for ARC and WARC webarchive formats

created at Sept. 21, 2015, 6:38 a.m.

Go

7 +0

20 +0

2 +0

GitHub
html2warc by steffenfritz

simple script to convert web resources to a single warc file

created at Dec. 30, 2015, 2:29 p.m.

Python

4 +0

18 +0

2 +0

GitHub
Zotero-Robust-Links-Extension by lanl

Create Robust Links from within Zotero

created at June 28, 2021, 9:38 p.m.

JavaScript

3 +0

17 +0

2 +0

GitHub
arch by internetarchive

Web application for distributed compute analysis of Archive-It web archive collections.

created at April 28, 2022, 3:18 p.m.

Scala

21 +0

15 +0

4 +0

GitHub
py-wasapi-client by unt-libraries

A client for the Archive-It And Webrecorder WASAPI Data Transfer API

created at Aug. 10, 2017, 5:25 p.m.

Python

5 +0

14 +0

5 +0

GitHub
gowarcserver by nlnwa

None

created at Jan. 15, 2021, 10:42 a.m.

Go

7 +0

14 +0

2 +0

GitHub
Sparkling by internetarchive

Internet Archive's Sparkling Data Processing Library

created at April 28, 2022, 2:28 p.m.

Scala

20 +0

11 +0

2 +0

GitHub
warc-safe by natliblux

A tool for detecting viruses and NSFW material in WARC files

created at May 3, 2024, 6:24 a.m.

Python

4 +0

10 +0

0 +0

GitHub
MementoMap by oduwsdl

A Tool to Summarize Web Archive Holdings

created at Jan. 20, 2019, 1:30 a.m.

Python

7 +0

10 +0

1 +0

GitHub
HadoopConcatGz by helgeho

A Splitable Hadoop InputFormat for Concatenated GZIP Files and *.(w)arc.gz

created at Aug. 8, 2016, 1:36 p.m.

Java

2 +0

9 +0

3 +0

GitHub
heritrix-walkthrough by web-archive-group

None

created at June 1, 2016, 10:35 p.m.

Shell

6 +0

9 +0

1 +0

GitHub
linkstat by httpreserve

CLI implementation of httpreserve that can test links and retrieve internet archive replacements

created at March 19, 2019, 9:23 p.m.

Go

3 +0

9 +0

0 +0

GitHub
tikalinkextract by httpreserve

Tika based link (URL) extractor for httpreserve

created at April 3, 2017, 2:35 a.m.

HTML

4 +0

9 +0

1 +0

GitHub
twut by archivesunleashed

An open-source toolkit for analyzing line-oriented JSON Twitter archives with Apache Spark.

created at Nov. 29, 2019, 2:52 p.m.

Scala

4 +0

9 +0

2 +0

GitHub