mleap by combust

MLeap: Deploy ML Pipelines to Production

updated at May 12, 2024, 1 a.m.

Scala

69 +0

1,496 +2

313 +0

GitHub
deequ by awslabs

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

updated at May 11, 2024, 11:29 p.m.

Scala

80 +0

3,140 +6

514 +0

GitHub
spark-nlp by JohnSnowLabs

State of the Art Natural Language Processing

updated at May 11, 2024, 9:33 p.m.

Scala

100 +0

3,708 +9

702 +1

GitHub
kyuubi by apache

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.

updated at May 11, 2024, 7:40 a.m.

Scala

62 -1

1,947 +6

860 +1

GitHub
spark-testing-base by holdenk

Base classes to use when writing tests with Spark

updated at May 11, 2024, 6 a.m.

Scala

78 +0

1,497 +4

358 +0

GitHub
delta by delta-io

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

updated at May 10, 2024, 11 p.m.

Scala

215 +0

6,935 +13

1,583 +3

GitHub
cromwell by broadinstitute

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments

updated at May 10, 2024, 10:06 p.m.

Scala

112 +0

959 -1

350 +0

GitHub
adam by bigdatagenomics

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.

updated at May 10, 2024, 3:25 p.m.

Scala

100 +0

968 +1

304 -1

GitHub
neo4j-spark-connector by neo4j-contrib

Neo4j Connector for Apache Spark, which provides bi-directional read/write access to Neo4j from Spark, using the Spark DataSource APIs

updated at May 10, 2024, 1:50 p.m.

Scala

35 +0

304 +1

114 +0

GitHub
graphframes by graphframes

None

updated at May 10, 2024, 11:48 a.m.

Scala

58 +0

972 +1

232 +0

GitHub
SynapseML by Microsoft

Simple and Distributed Machine Learning

updated at May 10, 2024, 10:34 a.m.

Scala

146 +0

4,975 +3

815 +0

GitHub
spark-xml by databricks

XML data source for Spark SQL and DataFrames

updated at May 10, 2024, 3:38 a.m.

Scala

40 +0

489 +1

223 +0

GitHub
incubator-livy by apache

Apache Livy is an open source REST interface for interacting with Apache Spark from anywhere.

updated at May 10, 2024, 3:34 a.m.

Scala

57 +0

857 +1

594 +0

GitHub
spark-daria by MrPowers

Essential Spark extensions and helper methods ✨😲

updated at May 9, 2024, 4:48 p.m.

Scala

33 +0

743 +1

148 +0

GitHub
flint by twosigma

A Time Series Library for Apache Spark

updated at May 9, 2024, 3:30 a.m.

Scala

77 +0

992 +0

184 +0

GitHub
spark-cassandra-connector by datastax

DataStax Connector for Apache Spark to Apache Cassandra

updated at May 9, 2024, 3:23 a.m.

Scala

162 +0

1,931 +1

913 -1

GitHub
spark-jobserver by spark-jobserver

REST job server for Apache Spark

updated at May 9, 2024, 3:16 a.m.

Scala

221 +0

2,841 +1

1,004 +0

GitHub
sparkling-water by h2oai

Sparkling Water provides H2O functionality inside Spark cluster

updated at May 8, 2024, 4:42 p.m.

Scala

179 +1

951 -1

363 +0

GitHub
magellan by harsha2010

Geo Spatial Data Analytics on Spark

updated at May 8, 2024, 1:18 p.m.

Scala

65 +0

534 +1

150 +0

GitHub
spark-csv by databricks

CSV Data Source for Apache Spark 1.x

updated at May 7, 2024, 12:54 p.m.

Scala

418 +0

1,049 +1

446 +0

GitHub