igorbarinov/awesome-data-engineering

secor by pinterest

Secor is a service implementing Kafka log persistence

created at April 15, 2014, 10:26 p.m.

Java

68 +0

1,845 +0

540 +0

GitHub

Gaffer by gchq

A large-scale entity and relation database supporting aggregation of properties

created at Dec. 14, 2015, 12:12 p.m.

Java

138 +0

1,772 +0

354 +0

GitHub

kairosdb by kairosdb

Fast scalable time series database

created at Feb. 5, 2013, 10:27 p.m.

Java

116 +1

1,739 +1

344 +0

GitHub

PyHive by dropbox

Python interface to Hive and Presto. 🐝

created at Feb. 1, 2014, 9:05 a.m.

Python

62 +0

1,676 +2

552 +1

GitHub

faust by faust-streaming

Python Stream Processing. A Faust fork

created at Oct. 22, 2020, 3:32 p.m.

Python

32 +0

1,671 +3

183 +0

GitHub

multiwoven by Multiwoven

🔥🔥🔥 Open Source Alternative to Hightouch, Census, and RudderStack - Reverse ETL & Data Activation

created at Oct. 20, 2023, 3:21 p.m.

Ruby

17 +0

1,552 +2

67 +1

GitHub

ekuiper by lf-edge

Lightweight data stream processing engine for IoT edge

created at July 3, 2019, 7:37 a.m.

Go

45 +0

1,499 +6

416 -1

GitHub

DataProfiler by capitalone

What's in your data? Extract schema, statistics and entities from datasets

created at Nov. 9, 2020, 3:20 p.m.

Python

21 +0

1,442 +5

163 +0

GitHub

FiloDB by filodb

Distributed Prometheus time series database

created at Jan. 14, 2015, 6:35 p.m.

Scala

89 +0

1,430 +0

227 +0

GitHub

HyperDex by rescrv

HyperDex is a scalable, searchable key-value store

created at Feb. 20, 2012, 11:32 a.m.

C++

87 +0

1,393 -1

166 +0

GitHub

ccm by riptano

A script to easily create and destroy an Apache Cassandra cluster on localhost

created at March 1, 2011, 9:42 a.m.

Python

74 +0

1,220 +2

303 +0

GitHub

nessie by projectnessie

Nessie: Transactional Catalog for Data Lakes with Git-like semantics

created at April 9, 2020, 6:39 p.m.

Java

31 +0

1,056 +7

134 +1

GitHub

pinball by pinterest

Pinball is a scalable workflow manager

created at March 4, 2015, 3:13 a.m.

JavaScript

53 +0

1,046 +0

130 +0

GitHub

snappydata by TIBCOSoftware

Project SnappyData - memory optimized analytics database, based on Apache Spark™ and Apache Geode™. Stream, Transact, Analyze, Predict in one cluster

created at Sept. 16, 2015, 10:36 a.m.

Scala

83 +0

1,041 +0

200 +0

GitHub

mysql_utils by pinterest

Pinterest MySQL Management Tools

created at Oct. 24, 2015, 5:33 p.m.

Python

71 +0

883 +0

142 +0

GitHub

snakebite by spotify

A pure python HDFS client

created at May 7, 2013, 9:44 a.m.

Python

129 +0

854 +0

216 +0

GitHub

heroic by spotify

The Heroic Time Series Database

created at May 29, 2015, 5:20 a.m.

Java

58 +0

848 +0

109 +0

GitHub

Akumuli by akumuli

Time-series database

created at Jan. 28, 2014, 9:31 p.m.

C++

44 +0

835 +0

85 +0

GitHub

hstream by hstreamdb

HStreamDB is an open-source, cloud-native streaming database for IoT and beyond. Modernize your data stack for real-time applications.

created at Aug. 31, 2020, 9:42 a.m.

Haskell

25 +0

711 +1

55 +0

GitHub

dalmatinerdb by dalmatinerdb

See gitlab: https://gitlab.com/Project-FiFo/DalmatinerDB/dalmatinerdb

created at June 13, 2014, 7:08 p.m.

Erlang

37 +0

694 +0

43 +0

GitHub