Skip to main content

Datasets

Home - Data Commons

https://www.kaggle.com/dalpozz/creditcardfraud

20+ Amazing (And Free) Data Sources Anyone Can Use To Build AIs

MNIST database

The MNIST database (Modified National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems.

EMNIST Dataset - handwritten character digits

ARC Corpus - AI2 Reasoning Challenge (ARC)

The ARC Corpus contains 14M unordered, science-related sentences including knowledge relevant to ARC, and is provided to as a starting point for addressing the challenge. The Corpus contains sentences from: science-related documents downloaded from the Web; dictionary definitions from Wiktionary, and articles from Simple Wikipedia that were tagged as science.

LLM Datasets

WikiText-103 Dataset | Papers With Code

BBH - OpenCompass

A suite of 23 challenging BIG-Bench tasks which we call BIG-Bench Hard (BBH). These are the task for which prior language model evaluations did not outperform the average human-rater.

BIG-Bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their future capabilities.

GitHub - google/BIG-bench: Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models

Common Crawl - Blog - October 2024 Crawl Archive Now Available

LAION (Large-scale Artificial Intelligence Open Network)

LAION-5B: A NEW ERA OF OPEN LARGE-SCALE MULTI-MODAL DATASETS | LAION

LAION - Wikipedia

YCSB Workloads

YCSB includes a set of core workloads that define a basic benchmark for cloud systems.

The Yahoo! Cloud Serving Benchmark (YCSB) is an open-source specification and program suite for evaluating retrieval and maintenance capabilities of computer programs. It is often used to compare relative performance of NoSQL database management systems.

https://en.wikipedia.org/wiki/YCSB

https://github.com/brianfrankcooper/YCSB/wiki/Core-Workloads

TPC (Transaction Processing Performance Council)

TPC stands for Transaction Processing Performance Council. It is a non-profit organization that was founded in 1988. The TPC's goal is to define benchmarks for transaction processing and databases. They also distribute objective and verifiable performance data to the industry.

Here are some TPC benchmarks:

  • TPC-C: Compares the performance of online transaction processing
  • TPC-E: Measures the performance of online transaction processing systems
  • TPC-H: A benchmark for transaction processing and databases specific to decision support

Other TPC benchmarks include: TPC-DS, TPCI.

TPC-DS has more difficult SQL like SQL queries with different types of JOINS compared to TPC-H.

DS - Decision Support

H and DS use similar datasets, and DS is basically the next-gen version of H. While H generates relatively straightforward queries (22 queries) and is generally shard-friendly, DS (99 queries) gets its kicks from using advanced SQL features and functions, and it loves lopsided filters. Running DS is notoriously, intentionally difficult

TPC Benchmarks Overview

What is the difference between TPC-H and TPC-DS benchmarks? | by Albert Wong | Oct, 2023 | Medium

Tools