awesome-spark/awesome-spark

scikit-learn by scikit-learn

scikit-learn: machine learning in Python

created at Aug. 17, 2010, 9:43 a.m.

Python

2,140 +1

58,132 +86

24,976 +6

GitHub

spark-csv by databricks

CSV Data Source for Apache Spark 1.x

created at Dec. 3, 2014, 12:56 a.m.

Scala

418 +2

1,048 +0

446 +0

GitHub

koalas by databricks

Koalas: pandas API on Apache Spark

created at Jan. 3, 2019, 9:46 p.m.

Python

316 +1

3,320 +2

353 +0

GitHub

dplyr by tidyverse

dplyr: A grammar of data manipulation

created at Oct. 28, 2012, 1:39 p.m.

R

246 +0

4,656 +3

2,116 +0

GitHub

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max). A PyTorch LLM library that seamlessly integrates with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, etc.

created at Aug. 29, 2016, 7:59 a.m.

Python

243 +0

5,954 +43

1,200 +5

GitHub

spark-jobserver by spark-jobserver

REST job server for Apache Spark

created at Aug. 21, 2014, 11:07 p.m.

Scala

221 +0

2,841 -1

1,005 -1

GitHub

delta by delta-io

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

created at April 22, 2019, 6:56 p.m.

Scala

215 +1

6,903 +27

1,573 +4

GitHub

oryx by OryxProject

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

created at July 25, 2014, 8:08 p.m.

Java

209 +0

1,787 +1

405 +0

GitHub

blaze by blaze

NumPy and Pandas interface to Big Data

created at Oct. 26, 2012, 2:25 p.m.

Python

195 +0

3,181 -1

393 +0

GitHub

spark-notebook by spark-notebook

Interactive and Reactive Data Science using Scala and Spark.

created at Sept. 5, 2014, 7:35 p.m.

JavaScript

190 +0

3,147 +1

654 +0

GitHub

sparkling-water by h2oai

Sparkling Water provides H2O functionality inside Spark cluster

created at Oct. 13, 2014, 11:06 p.m.

Scala

178 +0

952 -1

363 +1

GitHub

spark-cassandra-connector by datastax

DataStax Connector for Apache Spark to Apache Cassandra

created at June 27, 2014, 3:45 p.m.

Scala

162 +0

1,930 +0

914 +0

GitHub

aas by sryza

Code to accompany Advanced Analytics with Spark from O'Reilly Media

created at Nov. 8, 2014, 10:18 p.m.

Scala

148 +0

1,515 +1

1,032 +0

GitHub

SynapseML by Microsoft

Simple and Distributed Machine Learning

created at June 5, 2017, 8:23 a.m.

Scala

146 +0

4,968 +1

813 +1

GitHub

Mobius by Microsoft

C# and F# language binding and extensions to Apache Spark

created at Oct. 27, 2015, 7:21 p.m.

C#

145 +0

937 +0

212 +0

GitHub

spark-timeseries by sryza

A library for time series analysis on Apache Spark

created at March 11, 2015, 8:14 a.m.

Scala

134 +0

1,189 +1

427 +0

GitHub

cromwell by broadinstitute

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments

created at April 17, 2015, 7:39 p.m.

Scala

112 -1

958 +1

349 +1

GitHub

crossdata by Stratio

DISCONTINUED - Easy access to big things. Library for Apache Spark extending and improving its capabilities

created at Feb. 6, 2014, 9:41 a.m.

Scala

101 +0

169 +0

51 +0

GitHub

spark-nlp by JohnSnowLabs

State of the Art Natural Language Processing

created at Sept. 24, 2017, 7:36 p.m.

Scala

100 +0

3,693 +22

699 +1

GitHub

adam by bigdatagenomics

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.

created at Nov. 19, 2013, 11:47 p.m.

Scala

100 +0

967 +0

304 +0

GitHub

scikit-learn by scikit-learn

spark-csv by databricks

koalas by databricks

dplyr by tidyverse

ipex-llm by intel-analytics

spark-jobserver by spark-jobserver

delta by delta-io

oryx by OryxProject

blaze by blaze

spark-notebook by spark-notebook

sparkling-water by h2oai

spark-cassandra-connector by datastax

aas by sryza

SynapseML by Microsoft

Mobius by Microsoft

spark-timeseries by sryza

cromwell by broadinstitute

crossdata by Stratio

spark-nlp by JohnSnowLabs

adam by bigdatagenomics