notebooks by archivesunleashed

Various examples of notebooks for working with web archives with the Archives Unleashed Toolkit, and derivatives generated by the Archives Unleashed Toolkit.

created at Nov. 6, 2019, 3:09 a.m.

Jupyter Notebook

6 +0

21 +0

4 +0

GitHub
wasp by webis-de

None

created at March 25, 2018, 6:58 p.m.

Java

13 +0

25 +0

4 +0

GitHub
arch by internetarchive

Web application for distributed compute analysis of Archive-It web archive collections.

created at April 28, 2022, 3:18 p.m.

Scala

19 +0

13 +0

4 +0

GitHub
wasapi-downloader by sul-dlss

Java application to download WARCs from WASAPI

created at April 28, 2017, 9:15 p.m.

Java

22 +0

6 +0

4 +0

GitHub
py-wasapi-client by unt-libraries

A client for the Archive-It And Webrecorder WASAPI Data Transfer API

created at Aug. 10, 2017, 5:25 p.m.

Python

5 +0

14 +0

4 +0

GitHub
HadoopConcatGz by helgeho

A Splitable Hadoop InputFormat for Concatenated GZIP Files and *.(w)arc.gz

created at Aug. 8, 2016, 1:36 p.m.

Java

2 +0

9 +0

3 +0

GitHub
Mink by machawk1

Chrome extension that uses Memento to indicate that a page a user is viewing on the live web has an archived copy and to give the user access to the copy

created at Jan. 17, 2014, 6:25 p.m.

JavaScript

6 +0

45 +0

3 +0

GitHub
warc2html by iipc

Converts WARC files to static HTML

created at Nov. 8, 2021, 4:09 a.m.

Java

10 +0

38 +0

3 +0

GitHub
har2warc by webrecorder

Convert HTTP Archive (HAR) -> Web Archive (WARC) format

created at March 16, 2017, 12:14 a.m.

Python

7 +0

42 +0

3 +0

GitHub
jwat by netarchivesuite

Java Web Archive Toolkit

created at Aug. 30, 2018, 5:28 p.m.

Java

8 +0

3 +0

2 +0

GitHub
html2warc by steffenfritz

simple script to convert web resources to a single warc file

created at Dec. 30, 2015, 2:29 p.m.

Python

4 +0

15 +0

2 +0

GitHub
twut by archivesunleashed

An open-source toolkit for analyzing line-oriented JSON Twitter archives with Apache Spark.

created at Nov. 29, 2019, 2:52 p.m.

Scala

4 +0

9 +0

2 +0

GitHub
webarchive by richardlehane

golang readers for ARC and WARC webarchive formats

created at Sept. 21, 2015, 6:38 a.m.

Go

7 +0

17 +0

2 +0

GitHub
cairn by wabarc

NPM package and CLI tool for saving web page as single HTML file

created at Oct. 8, 2020, 7:18 a.m.

TypeScript

4 +0

37 +0

2 +0

GitHub
Zotero-Robust-Links-Extension by lanl

Create Robust Links from within Zotero

created at June 28, 2021, 9:38 p.m.

JavaScript

3 +0

17 +0

2 +0

GitHub
Sparkling by internetarchive

Internet Archive's Sparkling Data Processing Library

created at April 28, 2022, 2:28 p.m.

Scala

17 +0

10 +0

2 +0

GitHub
jwat-tools by netarchivesuite

JWAT Tools

created at Aug. 30, 2018, 5:54 p.m.

Java

7 +0

5 +1

2 +0

GitHub
warcrefs by arcalex

Web archive deduplication tools

created at April 22, 2014, 8:02 a.m.

Java

5 +0

6 +0

1 +0

GitHub
WarcPartitioner by helgeho

Partition (W)ARC Files by MIME Type and Year

created at Feb. 13, 2017, 3:45 p.m.

Java

2 +0

1 +0

1 +0

GitHub
gowarcserver by nlnwa

None

created at Jan. 15, 2021, 10:42 a.m.

Go

7 +0

12 +0

1 +0

GitHub