igorbarinov/awesome-data-engineering

secor by pinterest

Secor is a service implementing Kafka log persistence

created at April 15, 2014, 10:26 p.m.

Java

68 +0

1,845 -2

540 +0

GitHub

Gaffer by gchq

A large-scale entity and relation database supporting aggregation of properties

created at Dec. 14, 2015, 12:12 p.m.

Java

138 +0

1,772 +2

354 +0

GitHub

kairosdb by kairosdb

Fast scalable time series database

created at Feb. 5, 2013, 10:27 p.m.

Java

115 -1

1,738 -3

344 +0

GitHub

PyHive by dropbox

Python interface to Hive and Presto. 🐝

created at Feb. 1, 2014, 9:05 a.m.

Python

62 +0

1,674 +3

551 +2

GitHub

faust by faust-streaming

Python Stream Processing. A Faust fork

created at Oct. 22, 2020, 3:32 p.m.

Python

32 -1

1,668 +7

183 +0

GitHub

multiwoven by Multiwoven

🔥🔥🔥 Open Source Alternative to Hightouch, Census, and RudderStack - Reverse ETL & Data Activation

created at Oct. 20, 2023, 3:21 p.m.

Ruby

17 +0

1,550 +2

66 +1

GitHub

ekuiper by lf-edge

Lightweight data stream processing engine for IoT edge

created at July 3, 2019, 7:37 a.m.

Go

45 +0

1,493 +10

417 +1

GitHub

DataProfiler by capitalone

What's in your data? Extract schema, statistics and entities from datasets

created at Nov. 9, 2020, 3:20 p.m.

Python

21 +0

1,437 +3

163 +1

GitHub

FiloDB by filodb

Distributed Prometheus time series database

created at Jan. 14, 2015, 6:35 p.m.

Scala

89 +0

1,430 +2

227 +0

GitHub

HyperDex by rescrv

HyperDex is a scalable, searchable key-value store

created at Feb. 20, 2012, 11:32 a.m.

C++

87 +0

1,394 -1

166 +0

GitHub

ccm by riptano

A script to easily create and destroy an Apache Cassandra cluster on localhost

created at March 1, 2011, 9:42 a.m.

Python

74 +0

1,218 +2

303 +0

GitHub

nessie by projectnessie

Nessie: Transactional Catalog for Data Lakes with Git-like semantics

created at April 9, 2020, 6:39 p.m.

Java

31 +0

1,049 +5

133 +3

GitHub

pinball by pinterest

Pinball is a scalable workflow manager

created at March 4, 2015, 3:13 a.m.

JavaScript

53 +0

1,046 -1

130 +0

GitHub

snappydata by TIBCOSoftware

Project SnappyData - memory optimized analytics database, based on Apache Spark™ and Apache Geode™. Stream, Transact, Analyze, Predict in one cluster

created at Sept. 16, 2015, 10:36 a.m.

Scala

83 +0

1,041 +1

200 +0

GitHub

mysql_utils by pinterest

Pinterest MySQL Management Tools

created at Oct. 24, 2015, 5:33 p.m.

Python

71 +0

883 +0

142 +0

GitHub

snakebite by spotify

A pure python HDFS client

created at May 7, 2013, 9:44 a.m.

Python

129 +1

854 -1

216 +0

GitHub

heroic by spotify

The Heroic Time Series Database

created at May 29, 2015, 5:20 a.m.

Java

58 +0

848 +0

109 +0

GitHub

Akumuli by akumuli

Time-series database

created at Jan. 28, 2014, 9:31 p.m.

C++

44 +0

835 -1

85 +0

GitHub

hstream by hstreamdb

HStreamDB is an open-source, cloud-native streaming database for IoT and beyond. Modernize your data stack for real-time applications.

created at Aug. 31, 2020, 9:42 a.m.

Haskell

25 +0

710 +2

55 +0

GitHub

dalmatinerdb by dalmatinerdb

See gitlab: https://gitlab.com/Project-FiFo/DalmatinerDB/dalmatinerdb

created at June 13, 2014, 7:08 p.m.

Erlang

37 +0

694 -1

43 +0

GitHub