Datasets

MNIST database

The MNIST database (Modified National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems.

EMNIST Dataset - handwritten character digits

ARC Corpus - AI2 Reasoning Challenge (ARC)

The ARC Corpus contains 14M unordered, science-related sentences including knowledge relevant to ARC, and is provided to as a starting point for addressing the challenge. The Corpus contains sentences from: science-related documents downloaded from the Web; dictionary definitions from Wiktionary, and articles from Simple Wikipedia that were tagged as science.

LLM Datasets

WikiText-103 Dataset | Papers With Code

BBH - OpenCompass

A suite of 23 challenging BIG-Bench tasks which we call BIG-Bench Hard (BBH). These are the task for which prior language model evaluations did not outperform the average human-rater.

BIG-Bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their future capabilities.

GitHub - google/BIG-bench: Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models

Common Crawl - Blog - October 2024 Crawl Archive Now Available

LAION (Large-scale Artificial Intelligence Open Network)

LAION-5B: A NEW ERA OF OPEN LARGE-SCALE MULTI-MODAL DATASETS | LAION

LAION - Wikipedia

YCSB Workloads

YCSB includes a set of core workloads that define a basic benchmark for cloud systems.

The Yahoo! Cloud Serving Benchmark (YCSB) is an open-source specification and program suite for evaluating retrieval and maintenance capabilities of computer programs. It is often used to compare relative performance of NoSQL database management systems.

https://en.wikipedia.org/wiki/YCSB

https://github.com/brianfrankcooper/YCSB/wiki/Core-Workloads

TPC (Transaction Processing Performance Council)

TPC stands for Transaction Processing Performance Council. It is a non-profit organization that was founded in 1988. The TPC's goal is to define benchmarks for transaction processing and databases. They also distribute objective and verifiable performance data to the industry.

Here are some TPC benchmarks:

TPC-C: Compares the performance of online transaction processing
TPC-E: Measures the performance of online transaction processing systems
TPC-H: A benchmark for transaction processing and databases specific to decision support
- TPC-H dataset with a scale factor (SF) of 50. It consists of 8 tables of different sizes. With SF=50, the largest table (lineitem) has 300M rows, the second-largest (orders) has 75M rows, and so forth.

Other TPC benchmarks include: TPC-DS, TPCI.

TPC-DS has more difficult SQL like SQL queries with different types of JOINS compared to TPC-H.

DS - Decision Support

TPC-DS Homepage

H and DS use similar datasets, and DS is basically the next-gen version of H. While H generates relatively straightforward queries (22 queries) and is generally shard-friendly, DS (99 queries) gets its kicks from using advanced SQL features and functions, and it loves lopsided filters. Running DS is notoriously, intentionally difficult

TPC Benchmarks Overview

What is the difference between TPC-H and TPC-DS benchmarks? | by Albert Wong | Oct, 2023 | Medium

GitHub - gregrahn/tpcds-kit: TPC-DS benchmark kit with some modifications/fixes

Others

TICKIT - Sample database - Amazon Redshift
- This small database consists of seven tables: two fact tables and five dimensions
miriad/miriad-4.4M · Datasets at Hugging Face
Stock Market Dataset | Kaggle
GitHub - lorint/AdventureWorks-for-Postgres: Set up the AdventureWorks sample database for use with Postgres
- AdventureWorks Sample Databases - SQL Server | Microsoft Learn

Datasets

MNIST database

ARC Corpus - AI2 Reasoning Challenge (ARC)

LLM Datasets

BBH - OpenCompass

BIG-Bench

LAION (Large-scale Artificial Intelligence Open Network)

YCSB Workloads

TPC (Transaction Processing Performance Council)

Others

Tools

Links

MNIST database​

ARC Corpus - AI2 Reasoning Challenge (ARC)​

LLM Datasets​

BBH - OpenCompass​

BIG-Bench​

LAION (Large-scale Artificial Intelligence Open Network)​

YCSB Workloads​

TPC (Transaction Processing Performance Council)​

Others​

Tools​

Links​

MNIST database

ARC Corpus - AI2 Reasoning Challenge (ARC)

LLM Datasets

BBH - OpenCompass

BIG-Bench

LAION (Large-scale Artificial Intelligence Open Network)

YCSB Workloads

TPC (Transaction Processing Performance Council)

Others

Tools

Links