delta by delta-io

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

updated at May 19, 2024, 3:57 p.m.

Scala

215 +0

6,961 +26

1,584 +1

GitHub
ipex-llm by intel-analytics

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max). A PyTorch LLM library that seamlessly integrates with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, etc.

updated at May 19, 2024, 3:24 p.m.

Python

242 +0

6,072 +23

1,208 +4

GitHub
dplyr by tidyverse

dplyr: A grammar of data manipulation

updated at May 19, 2024, 3:19 p.m.

R

245 -2

4,671 +6

2,117 -2

GitHub
incubator-toree by apache

Mirror of Apache Toree (Incubating)

updated at May 19, 2024, 3:10 p.m.

Scala

48 +0

732 +1

224 +0

GitHub
scikit-learn by scikit-learn

scikit-learn: machine learning in Python

updated at May 19, 2024, 11:28 a.m.

Python

2,141 +0

58,351 +86

25,018 +14

GitHub
flint by twosigma

A Time Series Library for Apache Spark

updated at May 19, 2024, 9:51 a.m.

Scala

77 +0

992 +0

184 +0

GitHub
SynapseML by Microsoft

Simple and Distributed Machine Learning

updated at May 19, 2024, 7:27 a.m.

Scala

146 +0

4,984 +9

818 +3

GitHub
sedona by apache

A cluster computing framework for processing large-scale geospatial data

updated at May 19, 2024, 7:09 a.m.

Java

95 -1

1,786 +2

646 +0

GitHub
hail by hail-is

Cloud-native genomic dataframes and batch computing

updated at May 19, 2024, 5:57 a.m.

Python

55 +0

941 +3

236 +1

GitHub
spark-nlp by JohnSnowLabs

State of the Art Natural Language Processing

updated at May 19, 2024, 4:15 a.m.

Scala

100 +0

3,716 +8

702 +0

GitHub
joblib by joblib

Computing with Python functions.

updated at May 18, 2024, 11:19 p.m.

Python

62 +1

3,685 +6

407 +2

GitHub
mongo-spark by mongodb

The MongoDB Spark Connector

updated at May 18, 2024, 8:19 a.m.

Java

79 +0

703 +1

307 +0

GitHub
dbscan-on-spark by irvingc

An implementation of DBSCAN runing on top of Apache Spark

updated at May 18, 2024, 7:22 a.m.

Scala

19 +0

183 +1

58 +0

GitHub
spark-testing-base by holdenk

Base classes to use when writing tests with Spark

updated at May 17, 2024, 4:55 p.m.

Scala

78 +0

1,498 +1

358 +0

GitHub
spark-fast-tests by MrPowers

Apache Spark testing helpers (dependency free & works with Scalatest, uTest, and MUnit)

updated at May 17, 2024, 4:50 p.m.

Scala

15 +0

421 +3

73 +0

GitHub
kyuubi by apache

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.

updated at May 17, 2024, 2:52 a.m.

Scala

62 +0

1,951 +4

861 +1

GitHub
flambo by sorenmacbeth

A Clojure DSL for Apache Spark

updated at May 16, 2024, 8:28 p.m.

Clojure

78 +0

609 +1

86 +0

GitHub
spark by dotnet

.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.

updated at May 16, 2024, 7:26 p.m.

C#

91 +0

2,001 +2

309 +1

GitHub
cromwell by broadinstitute

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments

updated at May 16, 2024, 2:33 p.m.

Scala

112 +0

961 +2

350 +0

GitHub
flintrock by nchammas

A command-line tool for launching Apache Spark clusters.

updated at May 16, 2024, 1:26 p.m.

Python

33 +0

632 +1

115 +1

GitHub