mongo-spark by mongodb

The MongoDB Spark Connector

updated at May 12, 2024, 6:15 a.m.

Java

79 +0

702 -1

307 +0

GitHub
ipex-llm by intel-analytics

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max). A PyTorch LLM library that seamlessly integrates with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, etc.

updated at May 12, 2024, 3:48 a.m.

Python

242 +0

6,049 +48

1,204 +2

GitHub
scikit-learn by scikit-learn

scikit-learn: machine learning in Python

updated at May 12, 2024, 2:16 a.m.

Python

2,141 +0

58,265 +63

25,004 +18

GitHub
mleap by combust

MLeap: Deploy ML Pipelines to Production

updated at May 12, 2024, 1 a.m.

Scala

69 +0

1,496 +2

313 +0

GitHub
deequ by awslabs

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

updated at May 11, 2024, 11:29 p.m.

Scala

80 +0

3,140 +6

514 +0

GitHub
spark-nlp by JohnSnowLabs

State of the Art Natural Language Processing

updated at May 11, 2024, 9:33 p.m.

Scala

100 +0

3,708 +9

702 +1

GitHub
hail by hail-is

Cloud-native genomic dataframes and batch computing

updated at May 11, 2024, 1:20 p.m.

Python

55 +0

938 +0

235 +0

GitHub
joblib by joblib

Computing with Python functions.

updated at May 11, 2024, 9 a.m.

Python

61 +0

3,679 +9

405 +3

GitHub
kyuubi by apache

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.

updated at May 11, 2024, 7:40 a.m.

Scala

62 -1

1,947 +6

860 +1

GitHub
sedona by apache

A cluster computing framework for processing large-scale geospatial data

updated at May 11, 2024, 6:20 a.m.

Java

96 +0

1,784 +4

646 +1

GitHub
spark-testing-base by holdenk

Base classes to use when writing tests with Spark

updated at May 11, 2024, 6 a.m.

Scala

78 +0

1,497 +4

358 +0

GitHub
koalas by databricks

Koalas: pandas API on Apache Spark

updated at May 11, 2024, 3:34 a.m.

Python

316 +0

3,321 +0

355 +0

GitHub
delta by delta-io

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

updated at May 10, 2024, 11 p.m.

Scala

215 +0

6,935 +13

1,583 +3

GitHub
cromwell by broadinstitute

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments

updated at May 10, 2024, 10:06 p.m.

Scala

112 +0

959 -1

350 +0

GitHub
Mobius by Microsoft

C# and F# language binding and extensions to Apache Spark

updated at May 10, 2024, 9:08 p.m.

C#

145 +0

939 +2

212 +0

GitHub
adam by bigdatagenomics

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.

updated at May 10, 2024, 3:25 p.m.

Scala

100 +0

968 +1

304 -1

GitHub
dplyr by tidyverse

dplyr: A grammar of data manipulation

updated at May 10, 2024, 1:59 p.m.

R

247 +1

4,665 +6

2,119 +1

GitHub
neo4j-spark-connector by neo4j-contrib

Neo4j Connector for Apache Spark, which provides bi-directional read/write access to Neo4j from Spark, using the Spark DataSource APIs

updated at May 10, 2024, 1:50 p.m.

Scala

35 +0

304 +1

114 +0

GitHub
graphframes by graphframes

None

updated at May 10, 2024, 11:48 a.m.

Scala

58 +0

972 +1

232 +0

GitHub
SynapseML by Microsoft

Simple and Distributed Machine Learning

updated at May 10, 2024, 10:34 a.m.

Scala

146 +0

4,975 +3

815 +0

GitHub