igorbarinov/awesome-data-engineering

snappydata by TIBCOSoftware

Project SnappyData - memory optimized analytics database, based on Apache Spark™ and Apache Geode™. Stream, Transact, Analyze, Predict in one cluster

updated at April 13, 2024, 6:31 a.m.

Scala

84 +0

1,037 +0

203 +0

GitHub

PyHive by dropbox

Python interface to Hive and Presto. 🐝

updated at April 17, 2024, 5:33 p.m.

Python

62 +0

1,665 +0

552 +0

GitHub

eventsim by Interana

Event data simulator. Generates a stream of pseudo-random events from a set of users, designed to simulate web traffic.

updated at April 20, 2024, 5:45 a.m.

Scala

111 +0

486 +0

126 +0

GitHub

HyperDex by rescrv

HyperDex is a scalable, searchable key-value store

updated at April 20, 2024, 10:02 a.m.

C++

88 +0

1,394 +0

168 +0

GitHub

secor by pinterest

Secor is a service implementing Kafka log persistence

updated at April 22, 2024, 8:31 a.m.

Java

70 +0

1,835 +0

541 +0

GitHub

haproxy_exporter by prometheus

Simple server that scrapes HAProxy stats and exports them via HTTP for Prometheus consumption

updated at April 22, 2024, 5:30 p.m.

Go

30 +0

609 +0

219 +0

GitHub

elasticsearch-jdbc by jprante

JDBC importer for Elasticsearch

updated at April 23, 2024, 2:40 a.m.

Java

231 +0

2,838 +0

711 -1

GitHub

mysql_utils by pinterest

Pinterest MySQL Management Tools

updated at April 25, 2024, 6:37 a.m.

Python

72 +0

879 +0

141 +0

GitHub

zodiac by CenturyLinkLabs

A lightweight tool for easy deployment and rollback of dockerized applications.

updated at April 25, 2024, 7:03 p.m.

Go

22 +0

194 +0

20 +0

GitHub

flockdb by twitter-archive

A distributed, fault-tolerant graph database

updated at April 27, 2024, 5:35 p.m.

Scala

279 +0

3,330 +0

273 +0

GitHub

Akumuli by akumuli

Time-series database

updated at April 28, 2024, 8:05 a.m.

C++

44 +0

838 +1

86 +0

GitHub

ccm by riptano

A script to easily create and destroy an Apache Cassandra cluster on localhost

updated at April 29, 2024, 12:45 p.m.

Python

76 +0

1,212 +0

302 +0

GitHub

opentsdb by OpenTSDB

A scalable, distributed Time Series Database.

updated at April 29, 2024, 1:49 p.m.

Java

337 +0

4,951 +2

1,253 +0

GitHub

kafka-node by SOHU-Co

Node.js client for Apache Kafka 0.8 and later.

updated at April 30, 2024, 9:57 a.m.

JavaScript

99 +0

2,659 +1

630 +0

GitHub

delight by datamechanics

A Spark UI and Spark History Server alternative with CPU and Memory metrics! Delight is free, cross-platform, and open-source.

updated at April 30, 2024, 9:48 p.m.

Scala

16 +0

335 +1

50 +0

GitHub

Gaffer by gchq

A large-scale entity and relation database supporting aggregation of properties

updated at May 1, 2024, 11:33 a.m.

Java

142 +0

1,734 +1

354 +0

GitHub

FiloDB by filodb

Distributed Prometheus time series database

updated at May 1, 2024, 4:06 p.m.

Scala

89 +0

1,413 +0

223 +0

GitHub

DataProfiler by capitalone

What's in your data? Extract schema, statistics and entities from datasets

updated at May 2, 2024, 2:22 a.m.

Python

21 +0

1,363 +1

154 +0

GitHub

smart_open by piskvorky

Utils for streaming large files (S3, HDFS, gzip, bz2...)

updated at May 2, 2024, 12:46 p.m.

Python

49 +0

3,094 +1

378 +0

GitHub

zilla by aklivity

🦎 A multi-protocol, event-native proxy. Securely interface web apps, IoT clients, & microservices to Apache Kafka® via declaratively defined, stateless APIs.

updated at May 2, 2024, 3:48 p.m.

Java

9 +0

486 +0

47 +0

GitHub