Welcome to the UC Irvine Machine Learning Repository

We currently maintain 677 datasets as a service to the machine learning community. Here, you can donate and find datasets used by millions of people all around the world!

Popular Datasets


A small classic dataset from Fisher, 1936. One of the earliest known datasets used for evaluating classification methods.

Heart Disease

4 databases: Cleveland, Hungary, Switzerland, and the VA Long Beach

Wine Quality

Two datasets are included, related to red and white vinho verde wine samples, from the north of Portugal. The goal is to model wine quality based on physicochemical tests (see [Cortez et al., 2009], http://www3.dsi.uminho.pt/pcortez/wine/).

Breast Cancer Wisconsin (Diagnostic)

Diagnostic Wisconsin Breast Cancer Database.


Predict whether annual income of an individual exceeds $50K/yr based on census data. Also known as "Census Income" dataset.

Bank Marketing

The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).

See More Popular Datasets

New Datasets


This dataset comprises molecular descriptors generated using RDKit, specifically curated for the study of drug-induced autoimmunity through ensemble machine learning approaches. It is divided into a training set and a testing set, containing numerical features that represent molecular properties and structural characteristics of drugs. The dataset supports predictive modeling tasks aimed at identifying potential autoimmune risks associated with drug candidates. These molecular descriptors include physicochemical properties, providing a comprehensive foundation for machine learning analysis. The dataset facilitates the development of interpretable models for drug toxicity prediction, contributing to advancements in computational toxicology and drug safety assessment.


The PIRvision dataset contains occupancy detection data collected from a Synchronized Low-Energy Electronically-chopped Passive Infra-Red sensing node in residential and office environments. Each observation represents 4 seconds of recorded human activity within the sensor Field-of-View (FoV).

Lattice-physics (PWR fuel assembly neutronics simulation results)

This dataset encompasses lattice-physics parameters—the infinite multiplication factor (k-inf) and the pin power peaking factor (PPPF)—modeled as functions of variations in fuel pin enrichments for the NuScale US600 fuel assembly type C-01 (NFAC-01) [NuScale FSAR]. These critical parameters were computed using the MCNP6 code, a Monte Carlo-based tool for nuclear reactor criticality simulations. Fuel pin enrichments were uniformly sampled within the range of 0.7–5.0 weight percent (w/o) U-235 to generate the dataset. The dataset contains 39 features, each representing the enrichment of a specific fuel rod in a one-eighth symmetry of the NFAC assembly. The outputs of interest are the k-inf and PPPF values associated with these enrichments.

Gas sensor array low-concentration

This dataset contains 6 gas responses collected by a sensor array consisting of 10 metal oxide semiconductor sensors, with gas concentrations at the ppb level (below the minimum detection limit of the sensors)

Twitter Geospatial Data

Seven days of geo-tagged Tweet data from the United States with exact GPS location and timestamp.


A Comprehensive CAN Bus Attack Dataset from Moving Vehicles for Intrusion Detection System Evaluation This dataset includes CAN bus attacks collected from a modern automobile equipped with autonomous driving capabilities, operating in real-world driving scenarios. The dataset encompasses physically verified attacks to enhance the comparison and validation of in-vehicle network Intrusion Detection Systems.

See More New Datasets

By using the UCI Machine Learning Repository, you acknowledge and accept the cookies and privacy practices used by the UCI Machine Learning Repository.

Read Policy