awesome-spark/awesome-spark

delta by delta-io

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

updated at May 19, 2024, 3:57 p.m.

Scala

215 +0

6,961 +26

1,584 +1

GitHub

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max). A PyTorch LLM library that seamlessly integrates with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, etc.

updated at May 19, 2024, 3:24 p.m.

Python

242 +0

6,072 +23

1,208 +4

GitHub

dplyr by tidyverse

dplyr: A grammar of data manipulation

updated at May 19, 2024, 3:19 p.m.

R

245 -2

4,671 +6

2,117 -2

GitHub

incubator-toree by apache

Mirror of Apache Toree (Incubating)

updated at May 19, 2024, 3:10 p.m.

Scala

48 +0

732 +1

224 +0

GitHub

scikit-learn by scikit-learn

scikit-learn: machine learning in Python

updated at May 19, 2024, 11:28 a.m.

Python

2,141 +0

58,351 +86

25,018 +14

GitHub

flint by twosigma

A Time Series Library for Apache Spark

updated at May 19, 2024, 9:51 a.m.

Scala

77 +0

992 +0

184 +0

GitHub

SynapseML by Microsoft

Simple and Distributed Machine Learning

updated at May 19, 2024, 7:27 a.m.

Scala

146 +0

4,984 +9

818 +3

GitHub

sedona by apache

A cluster computing framework for processing large-scale geospatial data

updated at May 19, 2024, 7:09 a.m.

Java

95 -1

1,786 +2

646 +0

GitHub

hail by hail-is

Cloud-native genomic dataframes and batch computing

updated at May 19, 2024, 5:57 a.m.

Python

55 +0

941 +3

236 +1

GitHub

spark-nlp by JohnSnowLabs

State of the Art Natural Language Processing

updated at May 19, 2024, 4:15 a.m.

Scala

100 +0

3,716 +8

702 +0

GitHub

joblib by joblib

Computing with Python functions.

updated at May 18, 2024, 11:19 p.m.

Python

62 +1

3,685 +6

407 +2

GitHub

mongo-spark by mongodb

The MongoDB Spark Connector

updated at May 18, 2024, 8:19 a.m.

Java

79 +0

703 +1

307 +0

GitHub

dbscan-on-spark by irvingc

An implementation of DBSCAN runing on top of Apache Spark

updated at May 18, 2024, 7:22 a.m.

Scala

19 +0

183 +1

58 +0

GitHub

spark-testing-base by holdenk

Base classes to use when writing tests with Spark

updated at May 17, 2024, 4:55 p.m.

Scala

78 +0

1,498 +1

358 +0

GitHub

spark-fast-tests by MrPowers

Apache Spark testing helpers (dependency free & works with Scalatest, uTest, and MUnit)

updated at May 17, 2024, 4:50 p.m.

Scala

15 +0

421 +3

73 +0

GitHub

kyuubi by apache

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.

updated at May 17, 2024, 2:52 a.m.

Scala

62 +0

1,951 +4

861 +1

GitHub

flambo by sorenmacbeth

A Clojure DSL for Apache Spark

updated at May 16, 2024, 8:28 p.m.

Clojure

78 +0

609 +1

86 +0

GitHub

spark by dotnet

.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.

updated at May 16, 2024, 7:26 p.m.

C#

91 +0

2,001 +2

309 +1

GitHub

cromwell by broadinstitute

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments

updated at May 16, 2024, 2:33 p.m.

Scala

112 +0

961 +2

350 +0

GitHub

flintrock by nchammas

A command-line tool for launching Apache Spark clusters.

updated at May 16, 2024, 1:26 p.m.

Python

33 +0

632 +1

115 +1

GitHub

delta by delta-io

ipex-llm by intel-analytics

dplyr by tidyverse

incubator-toree by apache

scikit-learn by scikit-learn

flint by twosigma

SynapseML by Microsoft

sedona by apache

hail by hail-is

spark-nlp by JohnSnowLabs

joblib by joblib

mongo-spark by mongodb

dbscan-on-spark by irvingc

spark-testing-base by holdenk

spark-fast-tests by MrPowers

kyuubi by apache

flambo by sorenmacbeth

spark by dotnet

cromwell by broadinstitute

flintrock by nchammas