igorbarinov/awesome-data-engineering

kafka-docker by wurstmeister

Dockerfile for Apache Kafka

created at Dec. 23, 2013, 10:01 p.m.

Shell

162 +0

6,940 +1

2,729 +1

GitHub

weave by weaveworks

Simple, resilient multi-host containers networking and more.

created at Aug. 18, 2014, 5:19 a.m.

Go

228 +0

6,619 -1

671 +1

GitHub

kryo by EsotericSoftware

Java binary serialization and cloning: fast, efficient, automatic

created at Nov. 6, 2013, 1:24 p.m.

HTML

289 +0

6,214 +7

828 +1

GitHub

snappy by google

A fast compressor/decompressor

created at March 3, 2014, 9:58 p.m.

C++

194 +0

6,196 +11

985 +1

GitHub

kcat by edenhill

Generic command line non-JVM Apache Kafka producer and consumer

created at March 30, 2014, 4:25 a.m.

C

77 +0

5,461 +6

484 +0

GitHub

opentsdb by OpenTSDB

A scalable, distributed Time Series Database.

created at Aug. 27, 2010, 2:05 a.m.

Java

334 +0

5,006 +4

1,247 +0

GitHub

zombodb by zombodb

Making Postgres and Elasticsearch work together like it's 2023

created at July 17, 2015, 4:53 p.m.

PLpgSQL

92 +0

4,685 +1

212 +0

GitHub

lakeFS by treeverse

lakeFS - Data version control for your data lake | Git for data

created at Sept. 12, 2019, 11:46 a.m.

Go

44 +0

4,468 +6

359 +0

GitHub

rudder-server by rudderlabs

Privacy and Security focused Segment-alternative, in Golang and React

created at July 19, 2019, 9:24 a.m.

Go

62 -1

4,103 +5

318 +1

GitHub

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

created at Feb. 26, 2019, 1:39 a.m.

Python

60 +0

3,938 -1

702 +1

GitHub

flocker by ClusterHQ

Container data volume manager for your Dockerized application

created at April 28, 2014, 6:02 p.m.

Python

169 +0

3,389 -1

290 +0

GitHub

heka by mozilla-services

DEPRECATED: Data collection and processing made easy.

created at Oct. 16, 2012, 5:20 p.m.

Go

203 +0

3,389 +0

528 +0

GitHub

flockdb by twitter-archive

A distributed, fault-tolerant graph database

created at April 12, 2010, 3:53 a.m.

Scala

278 +0

3,338 +1

258 +0

GitHub

smart_open by piskvorky

Utils for streaming large files (S3, HDFS, gzip, bz2...)

created at Jan. 2, 2015, 1:05 p.m.

Python

47 +0

3,221 +3

382 +0

GitHub

elasticsearch-jdbc by jprante

JDBC importer for Elasticsearch

created at June 2, 2012, 11:17 p.m.

Java

230 +0

2,838 +1

709 +0

GitHub

kafka-node by SOHU-Co

Node.js client for Apache Kafka 0.8 and later.

created at Oct. 23, 2013, 3:34 a.m.

JavaScript

97 +0

2,664 -1

628 +0

GitHub

pipelinedb by pipelinedb

High-performance time-series aggregation for PostgreSQL

created at Nov. 26, 2013, 12:11 a.m.

C

104 +0

2,637 +3

241 +0

GitHub

pyxley by stitchfix

Python helpers for building dashboards using Flask and React

created at June 22, 2015, 10:23 p.m.

JavaScript

279 +0

2,271 +1

258 +0

GitHub

gobblin by apache

A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.

created at Dec. 1, 2014, 6:10 p.m.

Java

165 +0

2,228 -4

750 -1

GitHub

hamilton by DAGWorks-Inc

Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage/tracing and metadata. Runs and scales everywhere python does.

created at Feb. 23, 2023, 5:16 p.m.

Jupyter Notebook

17 +0

1,885 +7

126 +1

GitHub

kafka-docker by wurstmeister

weave by weaveworks

kryo by EsotericSoftware

snappy by google

kcat by edenhill

opentsdb by OpenTSDB

zombodb by zombodb

lakeFS by treeverse

rudder-server by rudderlabs

aws-sdk-pandas by aws

flocker by ClusterHQ

heka by mozilla-services

flockdb by twitter-archive

smart_open by piskvorky

elasticsearch-jdbc by jprante

kafka-node by SOHU-Co

pipelinedb by pipelinedb

pyxley by stitchfix

gobblin by apache

hamilton by DAGWorks-Inc