Lightgbm
Lightgbm
Lightgbm
Release 4.0.0
Microsoft Corporation
1 Installation Guide 3
2 Quick Start 21
3 Python-package Introduction 23
4 Features 29
5 Experiments 37
6 Parameters 43
7 Parameters Tuning 67
8 C API 73
19 Documentation 285
Index 289
i
ii
LightGBM, Release 4.0.0
LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed
and efficient with the following advantages:
• Faster training speed and higher efficiency.
• Lower memory usage.
• Better accuracy.
• Support of parallel, distributed, and GPU learning.
• Capable of handling large-scale data.
For more details, please refer to Features.
CONTENTS: 1
LightGBM, Release 4.0.0
2 CONTENTS:
CHAPTER
ONE
INSTALLATION GUIDE
This is a guide for building the LightGBM Command Line Interface (CLI). If you want to build the Python-package or
R-package please refer to Python-package and R-package folders respectively.
All instructions below are aimed at compiling the 64-bit version of LightGBM. It is worth compiling the 32-bit version
only in very rare special cases involving environmental limitations. The 32-bit version is slow and untested, so use it
at your own risk and don’t forget to adjust some of the commands below when installing.
If you need to build a static library instead of a shared one, you can add -DBUILD_STATIC_LIB=ON to CMake flags.
Users who want to perform benchmarking can make LightGBM output time costs for different internal routines by
adding -DUSE_TIMETAG=ON to CMake flags.
It is possible to build LightGBM in debug mode. In this mode all compiler optimizations are disabled and LightGBM
performs more checks internally. To enable debug mode you can add -DUSE_DEBUG=ON to CMake flags or choose
Debug_* configuration (e.g. Debug_DLL, Debug_mpi) in Visual Studio depending on how you are building LightGBM.
In addition to the debug mode, LightGBM can be built with compiler sanitizers. To enable them add
-DUSE_SANITIZER=ON -DENABLED_SANITIZERS="address;leak;undefined" to CMake flags. These values re-
fer to the following supported sanitizers:
• address - AddressSanitizer (ASan);
• leak - LeakSanitizer (LSan);
• undefined - UndefinedBehaviorSanitizer (UBSan);
• thread - ThreadSanitizer (TSan).
Please note, that ThreadSanitizer cannot be used together with other sanitizers. For more info and additional sanitizers’
parameters please refer to the following docs. It is very useful to build C++ unit tests with sanitizers.
You can also download the artifacts of the latest successful build on master branch (nightly builds) here: .
Contents
• Windows
• Linux
• macOS
• Docker
• Build Threadless Version (not Recommended)
• Build MPI Version
• Build GPU Version
3
LightGBM, Release 4.0.0
1.1 Windows
With GUI
1. Install Git for Windows, CMake (3.8 or higher) and VS Build Tools (VS Build Tools is not needed if Visual
Studio (2015 or newer) is already installed).
2. Run the following commands:
1.1.2 MinGW-w64
1.2 Linux
1.2. Linux 5
LightGBM, Release 4.0.0
1.3 macOS
On macOS LightGBM can be installed using Homebrew, or can be built using CMake and Apple Clang or gcc.
2. Install OpenMP:
1.3.2 gcc
2. Install gcc:
mkdir build
cd build
(continues on next page)
1.4 Docker
The default build version of LightGBM is based on OpenMP. You can build LightGBM without OpenMP support but
it is strongly not recommended.
1.5.1 Windows
With GUI
1.4. Docker 7
LightGBM, Release 4.0.0
1. Install Git for Windows, CMake (3.8 or higher) and VS Build Tools (VS Build Tools is not needed if Visual
Studio (2015 or newer) is already installed).
2. Run the following commands:
MinGW-w64
1.5.2 Linux
On Linux a version of LightGBM without OpenMP support can be built using CMake and gcc or Clang.
1. Install CMake.
2. Run the following commands:
1.5.3 macOS
On macOS a version of LightGBM without OpenMP support can be built using CMake and Apple Clang or gcc.
Apple Clang
gcc
2. Install gcc:
mkdir build
cd build
cmake -DUSE_OPENMP=OFF ..
make -j4
The default build version of LightGBM is based on socket. LightGBM also supports MPI. MPI is a high performance
communication approach with RDMA support.
If you need to run a distributed learning application with high performance communication, you can build the Light-
GBM with MPI support.
1.6.1 Windows
With GUI
1. You need to install MS MPI first. Both msmpisdk.msi and msmpisetup.exe are needed.
2. Install Visual Studio (2015 or newer).
3. Navigate to one of the releases at https://github.com/microsoft/LightGBM/releases, download
LightGBM-complete_source_code_zip.zip, and unzip it.
4. Go to LightGBM-master/windows folder.
5. Open LightGBM.sln file with Visual Studio, choose Release_mpi configuration and click BUILD -> Build
Solution (Ctrl+Shift+B).
If you have errors about Platform Toolset, go to PROJECT -> Properties -> Configuration Properties
-> General and select the toolset installed on your machine.
The .exe file will be in LightGBM-master/windows/x64/Release_mpi folder.
1. You need to install MS MPI first. Both msmpisdk.msi and msmpisetup.exe are needed.
2. Install Git for Windows, CMake (3.8 or higher) and VS Build Tools (VS Build Tools is not needed if Visual
Studio (2015 or newer) is already installed).
3. Run the following commands:
1.6.2 Linux
On Linux an MPI version of LightGBM can be built using Open MPI, CMake and gcc or Clang.
1. Install Open MPI.
2. Install CMake.
3. Run the following commands:
1.6.3 macOS
On macOS an MPI version of LightGBM can be built using Open MPI, CMake and Apple Clang or gcc.
Apple Clang
2. Install OpenMP:
gcc
2. Install gcc:
mkdir build
cd build
cmake -DUSE_MPI=ON ..
make -j4
1.7.1 Linux
On Linux a GPU version of LightGBM (device_type=gpu) can be built using OpenCL, Boost, CMake and gcc or
Clang.
The following dependencies should be installed before compilation:
• OpenCL 1.2 headers and libraries, which is usually provided by GPU manufacture.
The generic OpenCL ICD packages (for example, Debian package ocl-icd-libopencl1 and
ocl-icd-opencl-dev) can also be used.
• libboost 1.56 or later (1.61 or later is recommended).
We use Boost.Compute as the interface to GPU, which is part of the Boost library since version 1.61. However,
since we include the source code of Boost.Compute as a submodule, we only require the host has Boost 1.56
or later installed. We also use Boost.Align for memory allocation. Boost.Compute requires Boost.System and
Boost.Filesystem to store offline kernel cache.
The following Debian packages should provide necessary Boost libraries: libboost-dev,
libboost-system-dev, libboost-filesystem-dev.
• CMake 3.2 or later.
To build LightGBM GPU version, run the following commands:
make -j4
1.7.2 Windows
On Windows a GPU version of LightGBM (device_type=gpu) can be built using OpenCL, Boost, CMake and VS
Build Tools or MinGW.
If you use MinGW, the build procedure is similar to the build on Linux. Refer to GPU Windows Compilation to get
more details.
Following procedure is for the MSVC (Microsoft Visual C++) build.
1. Install Git for Windows, CMake (3.8 or higher) and VS Build Tools (VS Build Tools is not needed if Visual
Studio (2015 or newer) is installed).
2. Install OpenCL for Windows. The installation depends on the brand (NVIDIA, AMD, Intel) of your GPU card.
• For running on Intel, get Intel SDK for OpenCL.
• For running on AMD, get AMD APP SDK.
• For running on NVIDIA, get CUDA Toolkit.
Further reading and correspondence table: GPU SDK Correspondence and Device Targeting Table.
3. Install Boost Binaries.
Note: Match your Visual C++ version:
Visual Studio 2015 -> msvc-14.0-64.exe,
Visual Studio 2017 -> msvc-14.1-64.exe,
Visual Studio 2019 -> msvc-14.2-64.exe,
Visual Studio 2022 -> msvc-14.3-64.exe.
4. Run the following commands:
# if you have installed NVIDIA CUDA to a customized location, you should specify␣
˓→paths to OpenCL headers and library like the following:
1.7.3 Docker
1.8.1 Linux
On Linux a CUDA version of LightGBM can be built using CUDA, CMake and gcc or Clang.
The following dependencies should be installed before compilation:
• CUDA 10.0 or later libraries. Please refer to this detailed guide. Pay great attention to the minimum required
versions of host compilers listed in the table from that guide and use only recommended versions of compilers.
• CMake 3.16 or later.
To build LightGBM CUDA version, run the following commands:
1.9.1 Linux
On Linux a HDFS version of LightGBM can be built using CMake and gcc.
1. Install CMake.
2. Run the following commands:
# cmake \
# -DUSE_HDFS=ON \
# -DHDFS_LIB="/opt/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/lib64/libhdfs.so"␣
˓→\
# -DHDFS_INCLUDE_DIR="/opt/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/include/"␣
˓→\
# ..
make -j4
Using the following instructions you can generate a JAR file containing the LightGBM C API wrapped by SWIG.
1.10.1 Windows
On Windows a Java wrapper of LightGBM can be built using Java, SWIG, CMake and VS Build Tools or MinGW.
VS Build Tools
1. Install Git for Windows, CMake (3.8 or higher) and VS Build Tools (VS Build Tools is not needed if Visual
Studio (2015 or newer) is already installed).
2. Install SWIG and Java (also make sure that JAVA_HOME is set properly).
3. Run the following commands:
The .jar file will be in LightGBM/build folder and the .dll files will be in LightGBM/Release folder.
MinGW-w64
The .jar file will be in LightGBM/build folder and the .dll files will be in LightGBM/ folder.
Note: You may need to run the cmake -G "MinGW Makefiles" -DUSE_SWIG=ON .. one more time if you en-
counter the sh.exe was found in your PATH error.
It is recommended to use VS Build Tools (Visual Studio) since it has better multithreading efficiency in Windows for
many-core systems (see Question 4 and Question 8).
Also, you may want to read gcc Tips.
1.10.2 Linux
On Linux a Java wrapper of LightGBM can be built using Java, SWIG, CMake and gcc or Clang.
1. Install CMake, SWIG and Java (also make sure that JAVA_HOME is set properly).
2. Run the following commands:
1.10.3 macOS
On macOS a Java wrapper of LightGBM can be built using Java, SWIG, CMake and Apple Clang or gcc.
First, install SWIG and Java (also make sure that JAVA_HOME is set properly). Then, either follow the Apple Clang or
gcc installation instructions below.
Apple Clang
2. Install OpenMP:
gcc
2. Install gcc:
mkdir build
cd build
cmake -DUSE_SWIG=ON -DAPPLE_OUTPUT_DYLIB=ON ..
make -j4
1.11.1 Windows
On Windows, C++ unit tests of LightGBM can be built using CMake and VS Build Tools.
1. Install Git for Windows, CMake (3.8 or higher) and VS Build Tools (VS Build Tools is not needed if Visual
Studio (2015 or newer) is already installed).
1.11.2 Linux
On Linux a C++ unit tests of LightGBM can be built using CMake and gcc or Clang.
1. Install CMake.
2. Run the following commands:
1.11.3 macOS
On macOS a C++ unit tests of LightGBM can be built using CMake and Apple Clang or gcc.
Apple Clang
gcc
2. Install gcc:
mkdir build
cd build
cmake -DBUILD_CPP_TEST=ON -DUSE_OPENMP=OFF ..
make testlightgbm -j4
TWO
QUICK START
LightGBM supports input data files with CSV, TSV and LibSVM (zero-based) formats.
Files could be both with and without headers.
Label column could be specified both by index and by name.
Some columns could be ignored.
LightGBM can use categorical features directly (without one-hot encoding). The experiment on Expo data shows about
8x speed-up compared with one-hot encoding.
For the setting details, please refer to the categorical_feature parameter.
LightGBM also supports weighted training, it needs an additional weight data. And it needs an additional query data
for ranking task.
Also, weight and query data could be specified as columns in training data in the same manner as label.
21
LightGBM, Release 4.0.0
Parameters can be set both in the config file and command line, and the parameters in command line have higher
priority than in the config file. For example, the following command line will keep num_trees=10 and ignore the
same parameter in the config file.
2.4 Examples
• Binary Classification
• Regression
• Lambdarank
• Distributed Learning
THREE
PYTHON-PACKAGE INTRODUCTION
3.1 Install
import numpy as np
To load a LibSVM (zero-based) text file or a LightGBM binary file into Dataset:
23
LightGBM, Release 4.0.0
train_data = lgb.Dataset('train.svm.bin')
import scipy
csr = scipy.sparse.csr_matrix((dat, (row, col)))
train_data = lgb.Dataset(csr)
import h5py
class HDFSequence(lgb.Sequence):
def __init__(self, hdf_dataset, batch_size):
self.data = hdf_dataset
self.batch_size = batch_size
def __len__(self):
return len(self.data)
f = h5py.File('train.hdf5', 'r')
train_data = lgb.Dataset(HDFSequence(f['X'], 8192), label=f['Y'][:])
train_data = lgb.Dataset('train.svm.txt')
train_data.save_binary('train.bin')
validation_data = train_data.create_valid('validation.svm')
or
LightGBM can use categorical features as input directly. It doesn’t need to convert to one-hot encoding, and is much
faster than one-hot encoding (about 8x speed-up).
Note: You should convert your categorical features to int type before you construct Dataset.
Weights can be set when needed:
w = np.random.rand(500, )
train_data = lgb.Dataset(data, label=label, weight=w)
or
And you can use Dataset.set_init_score() to set initial score, and Dataset.set_group() to set group/query
data for ranking tasks.
Memory efficient usage:
The Dataset object in LightGBM is very memory-efficient, it only needs to save discrete bins. However,
Numpy/Array/Pandas object is memory expensive. If you are concerned about your memory consumption, you can
save memory by:
1. Set free_raw_data=True (default is True) when constructing the Dataset
2. Explicitly set raw_data=None after the Dataset has been constructed
3. Call gc
3.4 Training
num_round = 10
bst = lgb.train(param, train_data, num_round, valid_sets=[validation_data])
bst.save_model('model.txt')
json_model = bst.dump_model()
3.5 CV
If you have a validation set, you can use early stopping to find the optimal number of boosting rounds. Early stopping
requires at least one set in valid_sets. If there is more than one, it will use all of them except the training data:
bst.save_model('model.txt', num_iteration=bst.best_iteration)
The model will train until the validation score stops improving. Validation score needs to improve at least every
stopping_rounds to continue training.
The index of iteration that has the best performance will be saved in the best_iteration field if early stopping logic
is enabled by setting early_stopping callback. Note that train() will return a model from the best iteration.
This works with both metrics to minimize (L2, log loss, etc.) and to maximize (NDCG, AUC, etc.). Note that if you
specify more than one evaluation metric, all of them will be used for early stopping. However, you can change this
behavior and make LightGBM check only the first metric for early stopping by passing first_metric_only=True
in early_stopping callback constructor.
3.7 Prediction
A model that has been trained or loaded can perform predictions on datasets:
If early stopping is enabled during training, you can get predictions from the best iteration with bst.best_iteration:
3.7. Prediction 27
LightGBM, Release 4.0.0
FOUR
FEATURES
This is a conceptual overview of how LightGBM works[1]. We assume familiarity with decision tree boosting algo-
rithms to focus instead on aspects of LightGBM that may differ from other boosting packages. For detailed algorithms,
please refer to the citations or source code.
Many boosting tools use pre-sort-based algorithms[2, 3] (e.g. default algorithm in xgboost) for decision tree learning.
It is a simple solution, but not easy to optimize.
LightGBM uses histogram-based algorithms[4, 5, 6], which bucket continuous feature (attribute) values into discrete
bins. This speeds up training and reduces memory usage. Advantages of histogram-based algorithms include the
following:
• Reduced cost of calculating the gain for each split
– Pre-sort-based algorithms have time complexity O(#data)
– Computing the histogram has time complexity O(#data), but this involves only a fast sum-up operation.
Once the histogram is constructed, a histogram-based algorithm has time complexity O(#bins), and #bins
is far smaller than #data.
• Use histogram subtraction for further speedup
– To get one leaf’s histograms in a binary tree, use the histogram subtraction of its parent and its neighbor
– So it needs to construct histograms for only one leaf (with smaller #data than its neighbor). It then can get
histograms of its neighbor by histogram subtraction with small cost (O(#bins))
• Reduce memory usage
– Replaces continuous values with discrete bins. If #bins is small, can use small data type, e.g. uint8_t, to
store training data
– No need to store additional information for pre-sorting feature values
• Reduce communication cost for distributed learning
29
LightGBM, Release 4.0.0
Most decision tree learning algorithms grow trees by level (depth)-wise, like the following image:
LightGBM grows trees leaf-wise (best-first)[7]. It will choose the leaf with max delta loss to grow. Holding #leaf
fixed, leaf-wise algorithms tend to achieve lower loss than level-wise algorithms.
Leaf-wise may cause over-fitting when #data is small, so LightGBM includes the max_depth parameter to limit tree
depth. However, trees still grow leaf-wise even when max_depth is specified.
30 Chapter 4. Features
LightGBM, Release 4.0.0
It is common to represent categorical features with one-hot encoding, but this approach is suboptimal for tree learners.
Particularly for high-cardinality categorical features, a tree built on one-hot features tends to be unbalanced and needs
to grow very deep to achieve good accuracy.
Instead of one-hot encoding, the optimal solution is to split on a categorical feature by partitioning its categories into
2 subsets. If the feature has k categories, there are 2^(k-1) - 1 possible partitions. But there is an efficient solution
for regression trees[8]. It needs about O(k * log(k)) to find the optimal partition.
The basic idea is to sort the categories according to the training objective at each split. More specifically, LightGBM
sorts the histogram (for a categorical feature) according to its accumulated values (sum_gradient / sum_hessian)
and then finds the best split on the sorted histogram.
It only needs to use some collective communication algorithms, like “All reduce”, “All gather” and “Reduce scatter”,
in distributed learning of LightGBM. LightGBM implements state-of-the-art algorithms[9]. These collective commu-
nication algorithms can provide much better performance than point-to-point communication.
Traditional Algorithm
Feature parallel aims to parallelize the “Find Best Split” in the decision tree. The procedure of traditional feature
parallel is:
1. Partition data vertically (different machines have different feature set).
2. Workers find local best split point {feature, threshold} on local feature set.
3. Communicate local best splits with each other and get the best one.
4. Worker with best split to perform split, then send the split result of data to other workers.
5. Other workers split data according to received data.
The shortcomings of traditional feature parallel:
• Has computation overhead, since it cannot speed up “split”, whose time complexity is O(#data). Thus, feature
parallel cannot speed up well when #data is large.
• Need communication of split result, which costs about O(#data / 8) (one bit for one data).
Since feature parallel cannot speed up well when #data is large, we make a little change: instead of partitioning data
vertically, every worker holds the full data. Thus, LightGBM doesn’t need to communicate for split result of data since
every worker knows how to split data. And #data won’t be larger, so it is reasonable to hold the full data in every
machine.
The procedure of feature parallel in LightGBM:
1. Workers find local best split point {feature, threshold} on local feature set.
2. Communicate local best splits with each other and get the best one.
3. Perform best split.
However, this feature parallel algorithm still suffers from computation overhead for “split” when #data is large. So it
will be better to use data parallel when #data is large.
Traditional Algorithm
Data parallel aims to parallelize the whole decision learning. The procedure of data parallel is:
1. Partition data horizontally.
2. Workers use local data to construct local histograms.
3. Merge global histograms from all local histograms.
4. Find best split from merged global histograms, then perform splits.
The shortcomings of traditional data parallel:
• High communication cost. If using point-to-point communication algorithm, communication cost for one ma-
chine is about O(#machine * #feature * #bin). If using collective communication algorithm (e.g. “All
Reduce”), communication cost is about O(2 * #feature * #bin) (check cost of “All Reduce” in chapter 4.5
at [9]).
32 Chapter 4. Features
LightGBM, Release 4.0.0
Voting parallel further reduces the communication cost in Data Parallel to constant cost. It uses two-stage voting to
reduce the communication cost of feature histograms[10].
Thanks @huanzhang12 for contributing this feature. Please read [11] to get more details.
• GPU Installation
• GPU Tutorial
• Tweedie
For more details, please refer to Parameters.
4.9 References
[1] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, Tie-Yan Liu. “LightGBM:
A Highly Efficient Gradient Boosting Decision Tree.” Advances in Neural Information Processing Systems 30 (NIPS
2017), pp. 3149-3157.
[2] Mehta, Manish, Rakesh Agrawal, and Jorma Rissanen. “SLIQ: A fast scalable classifier for data mining.” Interna-
tional Conference on Extending Database Technology. Springer Berlin Heidelberg, 1996.
[3] Shafer, John, Rakesh Agrawal, and Manish Mehta. “SPRINT: A scalable parallel classifier for data mining.” Proc.
1996 Int. Conf. Very Large Data Bases. 1996.
[4] Ranka, Sanjay, and V. Singh. “CLOUDS: A decision tree classifier for large datasets.” Proceedings of the 4th
Knowledge Discovery and Data Mining Conference. 1998.
[5] Machado, F. P. “Communication and memory efficient parallel decision tree construction.” (2003).
[6] Li, Ping, Qiang Wu, and Christopher J. Burges. “Mcrank: Learning to rank using multiple classification and
gradient boosting.” Advances in Neural Information Processing Systems 20 (NIPS 2007).
[7] Shi, Haijian. “Best-first decision tree learning.” Diss. The University of Waikato, 2007.
[8] Walter D. Fisher. “On Grouping for Maximum Homogeneity.” Journal of the American Statistical Association.
Vol. 53, No. 284 (Dec., 1958), pp. 789-798.
[9] Thakur, Rajeev, Rolf Rabenseifner, and William Gropp. “Optimization of collective communication operations in
MPICH.” International Journal of High Performance Computing Applications 19.1 (2005), pp. 49-66.
34 Chapter 4. Features
LightGBM, Release 4.0.0
[10] Qi Meng, Guolin Ke, Taifeng Wang, Wei Chen, Qiwei Ye, Zhi-Ming Ma, Tie-Yan Liu. “A Communication-
Efficient Parallel Algorithm for Decision Tree.” Advances in Neural Information Processing Systems 29 (NIPS 2016),
pp. 1279-1287.
[11] Huan Zhang, Si Si and Cho-Jui Hsieh. “GPU Acceleration for Large-scale Tree Boosting.” SysML Conference,
2018.
4.9. References 35
LightGBM, Release 4.0.0
36 Chapter 4. Features
CHAPTER
FIVE
EXPERIMENTS
For the detailed experiment scripts and output logs, please refer to this repo.
5.1.1 History
08 Mar, 2020: update according to the latest master branch (1b97eaf for XGBoost, bcad692 for LightGBM).
(xgboost_exact is not updated for it is too slow.)
27 Feb, 2017: first version.
5.1.2 Data
We used 5 datasets to conduct our comparison experiments. Details of data are listed in the following table:
5.1.3 Environment
We ran all experiments on a single Linux server (Azure ND24s) with the following specifications:
OS CPU Memory
Ubuntu 16.04 LTS 2 * E5-2690 v4 448GB
37
LightGBM, Release 4.0.0
5.1.4 Baseline
5.1.5 Settings
We set up total 3 settings for experiments. The parameters of these settings are:
1. xgboost:
eta = 0.1
max_depth = 8
num_round = 500
nthread = 16
tree_method = exact
min_child_weight = 100
eta = 0.1
num_round = 500
nthread = 16
min_child_weight = 100
tree_method = hist
grow_policy = lossguide
max_depth = 0
max_leaves = 255
3. LightGBM:
learning_rate = 0.1
num_leaves = 255
num_trees = 500
num_threads = 16
min_data_in_leaf = 0
min_sum_hessian_in_leaf = 100
xgboost grows trees depth-wise and controls model complexity by max_depth. LightGBM uses a leaf-wise algorithm
instead and controls model complexity by num_leaves. So we cannot compare them in the exact same model setting.
For the tradeoff, we use xgboost with max_depth=8, which will have max number leaves to 255, to compare with
LightGBM with num_leaves=255.
Other parameters are default values.
38 Chapter 5. Experiments
LightGBM, Release 4.0.0
5.1.6 Result
Speed
We compared speed using only the training task without any test or metric output. We didn’t count the time for
IO. For the ranking tasks, since XGBoost and LightGBM implement different ranking objective functions, we used
regression objective for speed benchmark, for the fair comparison.
The following table is the comparison of time cost:
Accuracy
Memory Consumption
We monitored RES while running training task. And we set two_round=true (this will increase data-loading time
and reduce peak memory usage but not affect training speed or accuracy) in LightGBM to reduce peak memory usage.
5.2.1 History
5.2.2 Data
We used a terabyte click log dataset to conduct parallel experiments. Details are listed in following table:
This data contains 13 integer features and 26 categorical features for 24 days of click logs. We statisticized the click-
through rate (CTR) and count for these 26 categorical features from the first ten days. Then we used next ten days’ data,
after replacing the categorical features by the corresponding CTR and count, as training data. The processed training
data have a total of 1.7 billions records and 67 features.
5.2.3 Environment
5.2.4 Settings
learning_rate = 0.1
num_leaves = 255
num_trees = 100
num_thread = 16
tree_learner = data
We used data parallel here because this data is large in #data but small in #feature. Other parameters were default
values.
5.2.5 Results
40 Chapter 5. Experiments
LightGBM, Release 4.0.0
The results show that LightGBM achieves a linear speedup with distributed learning.
42 Chapter 5. Experiments
CHAPTER
SIX
PARAMETERS
The parameters format is key1=value1 key2=value2 .... Parameters can be set both in config file and command
line. By using command line, parameters should not have spaces before and after =. By using config files, one line can
only contain one parameter. You can use # to comment.
If one parameter appears in both command line and config file, LightGBM will use the parameter from the command
line.
For the Python and R packages, any parameters that accept a list of values (usually they have multi-xxx type,
e.g. multi-int or multi-double) can be specified in those languages’ default array types. For example,
monotone_constraints can be specified as follows.
Python
params = {
"monotone_constraints": [-1, 0, 1]
}
43
LightGBM, Release 4.0.0
44 Chapter 6. Parameters
LightGBM, Release 4.0.0
∗ cross_entropy, objective function for cross-entropy (with optional linear weights), aliases:
xentropy
∗ cross_entropy_lambda, alternative parameterization of cross-entropy, aliases: xentlambda
∗ label is anything in interval [0, 1]
– ranking application
∗ lambdarank, lambdarank objective. label_gain can be used to set the gain (weight) of int label and
all values in label must be smaller than number of elements in label_gain
∗ rank_xendcg, XE_NDCG_MART ranking objective function, aliases: xendcg, xe_ndcg,
xe_ndcg_mart, xendcg_mart
∗ rank_xendcg is faster than and achieves the similar performance as lambdarank
∗ label should be int type, and larger number represents the higher relevance (e.g. 0:bad, 1:fair, 2:good,
3:perfect)
• boosting , default = gbdt, type = enum, options: gbdt, rf, dart, aliases: boosting_type, boost
– gbdt, traditional Gradient Boosting Decision Tree, aliases: gbrt
– rf, Random Forest, aliases: random_forest
– dart, Dropouts meet Multiple Additive Regression Trees
∗ Note: internally, LightGBM uses gbdt mode for the first 1 / learning_rate iterations
• data_sample_strategy , default = bagging, type = enum, options: bagging, goss
– bagging, Randomly Bagging Sampling
∗ Note: bagging is only effective when bagging_freq > 0 and bagging_fraction < 1.0
– goss, Gradient-based One-Side Sampling
– New in 4.0.0
• data , default = "", type = string, aliases: train, train_data, train_data_file, data_filename
– path of training data, LightGBM will train from this data
– Note: can be used only in CLI version
• valid , default = "", type = string, aliases: test, valid_data, valid_data_file, test_data,
test_data_file, valid_filenames
– path(s) of validation/test data, LightGBM will output metrics for these data
– support multiple validation data, separated by ,
– Note: can be used only in CLI version
• num_iterations , default = 100, type = int, aliases: num_iteration, n_iter, num_tree, num_trees,
num_round, num_rounds, nrounds, num_boost_round, n_estimators, max_iter, constraints:
num_iterations >= 0
– number of boosting iterations
– Note: internally, LightGBM constructs num_class * num_iterations trees for multi-class classifica-
tion problems
• learning_rate , default = 0.1, type = double, aliases: shrinkage_rate, eta, constraints: learning_rate
> 0.0
– shrinkage rate
46 Chapter 6. Parameters
LightGBM, Release 4.0.0
– this seed has lower priority in comparison with other seeds, which means that it will be overridden, if you
set other seeds explicitly
• deterministic , default = false, type = bool
– used only with cpu device type
– setting this to true should ensure the stable results when using the same data and the same parameters (and
different num_threads)
– when you use the different seeds, different LightGBM versions, the binaries compiled by different compil-
ers, or in different systems, the results are expected to be different
– you can raise issues in LightGBM GitHub repo when you meet the unstable results
– Note: setting this to true may slow down the training
– Note: to avoid potential instability due to numerical issues, please set force_col_wise=true or
force_row_wise=true when setting deterministic=true
48 Chapter 6. Parameters
LightGBM, Release 4.0.0
50 Chapter 6. Parameters
LightGBM, Release 4.0.0
∗ intermediate, a more advanced method, which may slow the library very slightly. However, this
method is much less constraining than the basic method and should significantly improve the results
∗ advanced, an even more advanced method, which may slow the library. However, this method is even
less constraining than the intermediate method and should again significantly improve the results
• monotone_penalty , default = 0.0, type = double, aliases: monotone_splits_penalty, ms_penalty,
mc_penalty, constraints: monotone_penalty >= 0.0
– used only if monotone_constraints is set
– monotone penalty: a penalization parameter X forbids any monotone splits on the first X (rounded down)
level(s) of the tree. The penalty applied to monotone splits on a given depth is a continuous, increasing
function the penalization parameter
– if 0.0 (the default), no penalization is applied
• feature_contri , default = None, type = multi-double, aliases: feature_contrib, fc, fp,
feature_penalty
– used to control feature’s split gain, will use gain[i] = max(0, feature_contri[i]) * gain[i] to
replace the split gain of i-th feature
– you need to specify all features in order
• forcedsplits_filename , default = "", type = string, aliases: fs, forced_splits_filename,
forced_splits_file, forced_splits
– path to a .json file that specifies splits to force at the top of every decision tree before best-first learning
commences
– .json file can be arbitrarily nested, and each split contains feature, threshold fields, as well as left
and right fields representing subsplits
– categorical splits are forced in a one-hot fashion, with left representing the split containing the feature
value and right representing other values
– Note: the forced split logic will be ignored, if the split makes gain worse
– see this file as an example
• refit_decay_rate , default = 0.9, type = double, constraints: 0.0 <= refit_decay_rate <= 1.0
– decay rate of refit task, will use leaf_output = refit_decay_rate * old_leaf_output + (1.0
- refit_decay_rate) * new_leaf_output to refit trees
– used only in refit task in CLI version or as argument in refit function in language-specific package
• cegb_tradeoff , default = 1.0, type = double, constraints: cegb_tradeoff >= 0.0
– cost-effective gradient boosting multiplier for all penalties
• cegb_penalty_split , default = 0.0, type = double, constraints: cegb_penalty_split >= 0.0
– cost-effective gradient-boosting penalty for splitting a node
• cegb_penalty_feature_lazy , default = 0,0,...,0, type = multi-double
– cost-effective gradient boosting penalty for using a feature
– applied per data point
• cegb_penalty_feature_coupled , default = 0,0,...,0, type = multi-double
– cost-effective gradient boosting penalty for using a feature
– applied once per forest
52 Chapter 6. Parameters
LightGBM, Release 4.0.0
6.4 IO Parameters
54 Chapter 6. Parameters
LightGBM, Release 4.0.0
∗ it is recommended to rescale data before training so that features have similar mean and standard
deviation
∗ Note: only works with CPU and serial tree learner
∗ Note: regression_l1 objective is not supported with linear tree boosting
∗ Note: setting linear_tree=true significantly increases the memory use of LightGBM
∗ Note: if you specify monotone_constraints, constraints will be enforced when choosing the split
points, but not when fitting the linear models on leaves
• max_bin , default = 255, type = int, aliases: max_bins, constraints: max_bin > 1
– max number of bins that feature values will be bucketed in
– small number of bins may reduce training accuracy but may increase general power (deal with over-fitting)
– LightGBM will auto compress memory according to max_bin. For example, LightGBM will use uint8_t
for feature value if max_bin=255
• max_bin_by_feature , default = None, type = multi-int
– max number of bins for each feature
– if not specified, will use max_bin for all features
• min_data_in_bin , default = 3, type = int, constraints: min_data_in_bin > 0
– minimal number of data inside one bin
– use this to avoid one-data-one-bin (potential over-fitting)
• bin_construct_sample_cnt , default = 200000, type = int, aliases: subsample_for_bin, constraints:
bin_construct_sample_cnt > 0
– number of data that sampled to construct feature discrete bins
– setting this to larger value will give better training result, but may increase data loading time
– set this to larger value if data is very sparse
– Note: don’t set this to small values, otherwise, you may encounter unexpected errors and poor accuracy
• data_random_seed , default = 1, type = int, aliases: data_seed
– random seed for sampling data to construct histogram bins
• is_enable_sparse , default = true, type = bool, aliases: is_sparse, enable_sparse, sparse
– used to enable/disable sparse optimization
• enable_bundle , default = true, type = bool, aliases: is_enable_bundle, bundle
– set this to false to disable Exclusive Feature Bundling (EFB), which is described in LightGBM: A Highly
Efficient Gradient Boosting Decision Tree
– Note: disabling this may cause the slow training speed for sparse datasets
• use_missing , default = true, type = bool
– set this to false to disable the special handle of missing value
• zero_as_missing , default = false, type = bool
– set this to true to treat all zero as missing values (including the unshown values in LibSVM / sparse
matrices)
– set this to false to use na for representing missing values
6.4. IO Parameters 55
LightGBM, Release 4.0.0
56 Chapter 6. Parameters
LightGBM, Release 4.0.0
– Note: data should be grouped by query_id, for more information, see Query Data
– Note: index starts from 0 and it doesn’t count the label column when passing type is int, e.g. when label
is column_0 and query_id is column_1, the correct parameter is query=0
• ignore_column , default = "", type = multi-int or string, aliases: ignore_feature, blacklist
– used to specify some ignoring columns in training
– use number for index, e.g. ignore_column=0,1,2 means column_0, column_1 and column_2 will be
ignored
– add a prefix name: for column name, e.g. ignore_column=name:c1,c2,c3 means c1, c2 and c3 will be
ignored
– Note: works only in case of loading data directly from text file
– Note: index starts from 0 and it doesn’t count the label column when passing type is int
– Note: despite the fact that specified columns will be completely ignored during the training, they still should
have a valid format allowing LightGBM to load file successfully
• categorical_feature , default = "", type = multi-int or string, aliases: cat_feature,
categorical_column, cat_column, categorical_features
– used to specify categorical features
– use number for index, e.g. categorical_feature=0,1,2 means column_0, column_1 and column_2
are categorical features
– add a prefix name: for column name, e.g. categorical_feature=name:c1,c2,c3 means c1, c2 and c3
are categorical features
– Note: all values will be cast to int32 (integer codes will be extracted from pandas categoricals in the
Python-package)
– Note: index starts from 0 and it doesn’t count the label column when passing type is int
– Note: all values should be less than Int32.MaxValue (2147483647)
– Note: using large values could be memory consuming. Tree decision rule works best when categorical
features are presented by consecutive integers starting from zero
– Note: all negative values will be treated as missing values
– Note: the output cannot be monotonically constrained with respect to a categorical feature
– Note: floating point numbers in categorical features will be rounded towards 0
• forcedbins_filename , default = "", type = string
– path to a .json file that specifies bin upper bounds for some or all features
– .json file should contain an array of objects, each containing the word feature (integer feature index)
and bin_upper_bound (array of thresholds for binning)
– see this file as an example
• save_binary , default = false, type = bool, aliases: is_save_binary, is_save_binary_file
– if true, LightGBM will save the dataset (including validation data) to a binary file. This speed ups the
data loading for the next time
– Note: init_score is not saved in binary file
– Note: can be used only in CLI version; for language-specific packages you can use the correspondent
function
6.4. IO Parameters 57
LightGBM, Release 4.0.0
58 Chapter 6. Parameters
LightGBM, Release 4.0.0
6.4. IO Parameters 59
LightGBM, Release 4.0.0
60 Chapter 6. Parameters
LightGBM, Release 4.0.0
62 Chapter 6. Parameters
LightGBM, Release 4.0.0
∗ more precisely, the error on a sample is 0 if there are at least num_classes - multi_error_top_k
predictions strictly less than the prediction on the true class
– when multi_error_top_k=1 this is equivalent to the usual multi-error metric
• auc_mu_weights , default = None, type = multi-double
– used only with auc_mu metric
– list representing flattened matrix (in row-major order) giving loss weights for classification errors
– list should have n * n elements, where n is the number of classes
– the matrix co-ordinate [i, j] should correspond to the i * n + j-th element of the list
– if not specified, will use equal weights for all classes
6.9 Others
LightGBM supports continued training with initial scores. It uses an additional file to store these initial scores, like the
following:
0.5
-0.1
0.9
...
It means the initial score of the first data row is 0.5, second is -0.1, and so on. The initial score file corresponds with
data file line by line, and has per score per line.
And if the name of data file is train.txt, the initial score file should be named as train.txt.init and placed in
the same folder as the data file. In this case, LightGBM will auto load initial score file if it exists.
If binary data files exist for raw data file train.txt, for example in the name train.txt.bin, then the initial score
file should be named as train.txt.bin.init.
LightGBM supports weighted training. It uses an additional file to store weight data, like the following:
1.0
0.5
0.8
...
It means the weight of the first data row is 1.0, second is 0.5, and so on. Weights should be non-negative.
The weight file corresponds with data file line by line, and has per weight per line.
And if the name of data file is train.txt, the weight file should be named as train.txt.weight and placed in the
same folder as the data file. In this case, LightGBM will load the weight file automatically if it exists.
Also, you can include weight column in your data file. Please refer to the weight_column parameter in above.
64 Chapter 6. Parameters
LightGBM, Release 4.0.0
27
18
67
...
For wrapper libraries like in Python and R, this information can also be provided as an array-like via the Dataset
parameter group.
For example, if you have a 112-document dataset with group = [27, 18, 67], that means that you have 3 groups,
where the first 27 records are in the first group, records 28-45 are in the second group, and records 46-112 are in the
third group.
Note: data should be ordered by the query.
If the name of data file is train.txt, the query file should be named as train.txt.query and placed in the same
folder as the data file. In this case, LightGBM will load the query file automatically if it exists.
Also, you can include query/group id column in your data file. Please refer to the group_column parameter in above.
6.9. Others 65
LightGBM, Release 4.0.0
66 Chapter 6. Parameters
CHAPTER
SEVEN
PARAMETERS TUNING
LightGBM uses the leaf-wise tree growth algorithm, while many other popular tools use depth-wise tree growth. Com-
pared with depth-wise growth, the leaf-wise algorithm can converge much faster. However, the leaf-wise growth may
be over-fitting if not used with the appropriate parameters.
To get good results using a leaf-wise tree, these are some important parameters:
1. num_leaves. This is the main parameter to control the complexity of the tree model. Theoretically, we can
set num_leaves = 2^(max_depth) to obtain the same number of leaves as depth-wise tree. However, this
simple conversion is not good in practice. The reason is that a leaf-wise tree is typically much deeper than a
depth-wise tree for a fixed number of leaves. Unconstrained depth can induce over-fitting. Thus, when trying to
tune the num_leaves, we should let it be smaller than 2^(max_depth). For example, when the max_depth=7
the depth-wise tree can get good accuracy, but setting num_leaves to 127 may cause over-fitting, and setting it
to 70 or 80 may get better accuracy than depth-wise.
2. min_data_in_leaf. This is a very important parameter to prevent over-fitting in a leaf-wise tree. Its optimal
value depends on the number of training samples and num_leaves. Setting it to a large value can avoid growing
too deep a tree, but may cause under-fitting. In practice, setting it to hundreds or thousands is enough for a large
dataset.
3. max_depth. You also can use max_depth to limit the tree depth explicitly.
67
LightGBM, Release 4.0.0
On systems where it is available, LightGBM uses OpenMP to parallelize many operations. The maximum number
of threads used by LightGBM is controlled by the parameter num_threads. By default, this will defer to the default
behavior of OpenMP (one thread per real CPU core or the value in environment variable OMP_NUM_THREADS, if it is
set). For best performance, set this to the number of real CPU cores available.
You might be able to achieve faster training by moving to a machine with more available CPU cores.
Using distributed (multi-machine) training might also reduce training time. See the Distributed Learning Guide for
details.
You might find that training is faster using a GPU-enabled build of LightGBM. See the GPU Tutorial for details.
The total training time for LightGBM increases with the total number of tree nodes added. LightGBM comes with
several parameters that can be used to control the number of nodes per tree.
The suggestions below will speed up training, but might hurt training accuracy.
Decrease max_depth
This parameter is an integer that controls the maximum distance between the root node of each tree and a leaf node.
Decrease max_depth to reduce training time.
Decrease num_leaves
LightGBM adds nodes to trees based on the gain from adding that node, regardless of depth. This figure from the
feature documentation illustrates the process.
Because of this growth strategy, it isn’t straightforward to use max_depth alone to limit the complexity of trees. The
num_leaves parameter sets the maximum number of nodes per tree. Decrease num_leaves to reduce training time.
Increase min_gain_to_split
When adding a new tree node, LightGBM chooses the split point that has the largest gain. Gain is basically the re-
duction in training loss that results from adding a split point. By default, LightGBM sets min_gain_to_split to
0.0, which means “there is no improvement that is too small”. However, in practice you might find that very small
improvements in the training loss don’t have a meaningful impact on the generalization error of the model. Increase
min_gain_to_split to reduce training time.
Depending on the size of the training data and the distribution of features, it’s possible for LightGBM to add tree nodes
that only describe a small number of observations. In the most extreme case, consider the addition of a tree node that
only a single observation from the training data falls into. This is very unlikely to generalize well, and probably is a
sign of overfitting.
This can be prevented indirectly with parameters like max_depth and num_leaves, but LightGBM also offers param-
eters to help you directly avoid adding these overly-specific tree nodes.
• min_data_in_leaf: Minimum number of observations that must fall into a tree node for it to be added.
• min_sum_hessian_in_leaf: Minimum sum of the Hessian (second derivative of the objective function eval-
uated for each observation) for observations in a leaf. For some regression objectives, this is just the minimum
number of records that have to fall into each node. For classification objectives, it represents a sum over a distri-
bution of probabilities. See this Stack Overflow answer for a good description of how to reason about values of
this parameter.
Decrease num_iterations
The num_iterations parameter controls the number of boosting rounds that will be performed. Since LightGBM
uses decision trees as the learners, this can also be thought of as “number of trees”.
If you try changing num_iterations, change the learning_rate as well. learning_rate will not have any impact
on training time, but it will impact the training accuracy. As a general rule, if you reduce num_iterations, you should
increase learning_rate.
Choosing the right value of num_iterations and learning_rate is highly dependent on the data and objective, so
these parameters are often chosen from a set of possible values through hyperparameter tuning.
Decrease num_iterations to reduce training time.
If early stopping is enabled, after each boosting round the model’s training accuracy is evaluated against a validation
set that contains data not available to the training process. That accuracy is then compared to the accuracy as of the
previous boosting round. If the model’s accuracy fails to improve for some number of consecutive rounds, LightGBM
stops the training process.
That “number of consecutive rounds” is controlled by the parameter early_stopping_round. For example,
early_stopping_round=1 says “the first time accuracy on the validation set does not improve, stop training”.
Set early_stopping_round and provide a validation set to possibly reduce training time.
The parameters described in previous sections control how many trees are constructed and how many nodes are con-
structed per tree. Training time can be further reduced by reducing the amount of time needed to add a tree node to the
model.
The suggestions below will speed up training, but might hurt training accuracy.
By default, when a LightGBM Dataset object is constructed, some features will be filtered out based on the value of
min_data_in_leaf.
For a simple example, consider a 1000-observation dataset with a feature called feature_1. feature_1 takes on only
two values: 25.0 (995 observations) and 50.0 (5 observations). If min_data_in_leaf = 10, there is no split for this
feature which will result in a valid split at least one of the leaf nodes will only have 5 observations.
Instead of reconsidering this feature and then ignoring it every iteration, LightGBM filters this feature out at before
training, when the Dataset is constructed.
If this default behavior has been overridden by setting feature_pre_filter=False, set
feature_pre_filter=True to reduce training time.
LightGBM training buckets continuous features into discrete bins to improve training speed and reduce memory re-
quirements for training. This binning is done one time during Dataset construction. The number of splits considered
when adding a node is O(#feature * #bin), so reducing the number of bins per feature can reduce the number of
splits that need to be evaluated.
max_bin is controls the maximum number of bins that features will bucketed into. It is also possible to set this maxi-
mum feature-by-feature, by passing max_bin_by_feature.
Reduce max_bin or max_bin_by_feature to reduce training time.
Some bins might contain a small number of observations, which might mean that the effort of evaluating that bin’s
boundaries as possible split points isn’t likely to change the final model very much. You can control the granularity of
the bins by setting min_data_in_bin.
Increase min_data_in_bin to reduce training time.
Decrease feature_fraction
By default, LightGBM considers all features in a Dataset during the training process. This behavior can be changed
by setting feature_fraction to a value > 0 and <= 1.0. Setting feature_fraction to 0.5, for example, tells
LightGBM to randomly select 50% of features at the beginning of constructing each tree. This reduces the total number
of splits that have to be evaluated to add each tree node.
Decrease feature_fraction to reduce training time.
Decrease max_cat_threshold
LightGBM uses a custom approach for finding optimal splits for categorical features. In this process, LightGBM
explores splits that break a categorical feature into two groups. These are sometimes called “k-vs.-rest” splits. Higher
max_cat_threshold values correspond to more split points and larger possible group sizes to search.
Decrease max_cat_threshold to reduce training time.
Use Bagging
By default, LightGBM uses all observations in the training data for each iteration. It is possible to instead tell LightGBM
to randomly sample the training data. This process of training over multiple random samples without replacement is
called “bagging”.
Set bagging_freq to an integer greater than 0 to control how often a new sample is drawn. Set bagging_fraction
to a value > 0.0 and < 1.0 to control the size of the sample. For example, {"bagging_freq": 5,
"bagging_fraction": 0.75} tells LightGBM “re-sample without replacement every 5 iterations, and draw sam-
ples of 75% of the training data”.
Decrease bagging_fraction to reduce training time.
This only applies to the LightGBM CLI. If you pass parameter save_binary, the training dataset and all validations
sets will be saved in a binary format understood by LightGBM. This can speed up training next time, because binning
and other work done when constructing a Dataset does not have to be re-done.
EIGHT
C API
Copyright
Copyright (c) 2016 Microsoft Corporation. All rights reserved. Licensed under the MIT License. See LICENSE
file in the project root for license information.
Note: To avoid type conversion on large data, the most of our exposed interface supports both float32 and float64,
except the following:
1. gradient and Hessian;
2. current score for training and validation data.
The reason is that they are called frequently, and the type conversion on them may be time-cost.
Defines
C_API_DTYPE_FLOAT32 (0)
float32 (single precision float).
C_API_DTYPE_FLOAT64 (1)
float64 (double precision float).
C_API_DTYPE_INT32 (2)
int32.
C_API_DTYPE_INT64 (3)
int64.
C_API_FEATURE_IMPORTANCE_GAIN (1)
Gain type of feature importance.
C_API_FEATURE_IMPORTANCE_SPLIT (0)
Split type of feature importance.
73
LightGBM, Release 4.0.0
C_API_MATRIX_TYPE_CSC (1)
CSC sparse matrix type.
C_API_MATRIX_TYPE_CSR (0)
CSR sparse matrix type.
C_API_PREDICT_CONTRIB (3)
Predict feature contributions (SHAP values).
C_API_PREDICT_LEAF_INDEX (2)
Predict leaf index.
C_API_PREDICT_NORMAL (0)
Normal prediction, with transform (if needed).
C_API_PREDICT_RAW_SCORE (1)
Predict raw score.
INLINE_FUNCTION inline
Inline specifier.
THREAD_LOCAL thread_local
Thread local specifier.
Typedefs
74 Chapter 8. C API
LightGBM, Release 4.0.0
Functions
75
LightGBM, Release 4.0.0
76 Chapter 8. C API
LightGBM, Release 4.0.0
Note:
a. You should call LGBM_BoosterGetEvalNames first to get the names of evaluation metrics.
b. You should pre-allocate memory for out_results, you can get its length by
LGBM_BoosterGetEvalCounts.
Parameters
• handle – Handle of booster
• data_idx – Index of data, 0: training data, 1: 1st validation data, 2: 2nd validation data and
so on
• out_len – [out] Length of output result
• out_results – [out] Array with evaluation results
77
LightGBM, Release 4.0.0
Returns
0 when succeed, -1 when failure happens
78 Chapter 8. C API
LightGBM, Release 4.0.0
79
LightGBM, Release 4.0.0
Returns
0 when succeed, -1 when failure happens
LIGHTGBM_C_EXPORT int LGBM_BoosterGetNumFeature(BoosterHandle handle, int *out_len)
Get number of features.
Parameters
• handle – Handle of booster
• out_len – [out] Total number of features
Returns
0 when succeed, -1 when failure happens
LIGHTGBM_C_EXPORT int LGBM_BoosterGetNumPredict(BoosterHandle handle, int data_idx, int64_t
*out_len)
Get number of predictions for training data and validation data (this can be used to support customized evaluation
functions).
Parameters
• handle – Handle of booster
• data_idx – Index of data, 0: training data, 1: 1st validation data, 2: 2nd validation data and
so on
• out_len – [out] Number of predictions
Returns
0 when succeed, -1 when failure happens
LIGHTGBM_C_EXPORT int LGBM_BoosterGetPredict(BoosterHandle handle, int data_idx, int64_t *out_len,
double *out_result)
Get prediction for training data and validation data.
Note: You should pre-allocate memory for out_result, its length is equal to num_class * num_data.
Parameters
• handle – Handle of booster
• data_idx – Index of data, 0: training data, 1: 1st validation data, 2: 2nd validation data and
so on
• out_len – [out] Length of output result
• out_result – [out] Pointer to array with predictions
Returns
0 when succeed, -1 when failure happens
80 Chapter 8. C API
LightGBM, Release 4.0.0
Returns
0 when succeed, -1 when failure happens
LIGHTGBM_C_EXPORT int LGBM_BoosterLoadModelFromString(const char *model_str, int
*out_num_iterations, BoosterHandle *out)
Load an existing booster from string.
Parameters
• model_str – Model string
• out_num_iterations – [out] Number of iterations of this booster
• out – [out] Handle of created booster
Returns
0 when succeed, -1 when failure happens
LIGHTGBM_C_EXPORT int LGBM_BoosterMerge(BoosterHandle handle, BoosterHandle other_handle)
Merge model from other_handle into handle.
Parameters
• handle – Handle of booster, will merge another booster into this one
• other_handle – Other handle of booster
Returns
0 when succeed, -1 when failure happens
LIGHTGBM_C_EXPORT int LGBM_BoosterNumberOfTotalModel(BoosterHandle handle, int *out_models)
Get number of weak sub-models.
Parameters
• handle – Handle of booster
• out_models – [out] Number of weak sub-models
Returns
0 when succeed, -1 when failure happens
LIGHTGBM_C_EXPORT int LGBM_BoosterNumModelPerIteration(BoosterHandle handle, int
*out_tree_per_iteration)
Get number of trees per iteration.
Parameters
• handle – Handle of booster
• out_tree_per_iteration – [out] Number of trees per iteration
Returns
0 when succeed, -1 when failure happens
LIGHTGBM_C_EXPORT int LGBM_BoosterPredictForCSC(BoosterHandle handle, const void *col_ptr, int
col_ptr_type, const int32_t *indices, const void
*data, int data_type, int64_t ncol_ptr, int64_t nelem,
int64_t num_row, int predict_type, int
start_iteration, int num_iteration, const char
*parameter, int64_t *out_len, double *out_result)
Make prediction for a new dataset in CSC format.
81
LightGBM, Release 4.0.0
Parameters
• handle – Handle of booster
• col_ptr – Pointer to column headers
• col_ptr_type – Type of col_ptr, can be C_API_DTYPE_INT32 or C_API_DTYPE_INT64
• indices – Pointer to row indices
• data – Pointer to the data space
• data_type – Type of data pointer, can be C_API_DTYPE_FLOAT32 or
C_API_DTYPE_FLOAT64
• ncol_ptr – Number of columns in the matrix + 1
• nelem – Number of nonzero elements in the matrix
• num_row – Number of rows
• predict_type – What should be predicted
– C_API_PREDICT_NORMAL: normal prediction, with transform (if needed);
– C_API_PREDICT_RAW_SCORE: raw score;
– C_API_PREDICT_LEAF_INDEX: leaf index;
– C_API_PREDICT_CONTRIB: feature contributions (SHAP values)
• start_iteration – Start index of the iteration to predict
• num_iteration – Number of iteration for prediction, <= 0 means no limit
• parameter – Other parameters for prediction, e.g. early stopping for prediction
• out_len – [out] Length of output result
• out_result – [out] Pointer to array with predictions
Returns
0 when succeed, -1 when failure happens
82 Chapter 8. C API
LightGBM, Release 4.0.0
Parameters
• handle – Handle of booster
• indptr – Pointer to row headers
• indptr_type – Type of indptr, can be C_API_DTYPE_INT32 or C_API_DTYPE_INT64
• indices – Pointer to column indices
• data – Pointer to the data space
• data_type – Type of data pointer, can be C_API_DTYPE_FLOAT32 or
C_API_DTYPE_FLOAT64
• nindptr – Number of rows in the matrix + 1
• nelem – Number of nonzero elements in the matrix
• num_col – Number of columns
• predict_type – What should be predicted
– C_API_PREDICT_NORMAL: normal prediction, with transform (if needed);
– C_API_PREDICT_RAW_SCORE: raw score;
– C_API_PREDICT_LEAF_INDEX: leaf index;
– C_API_PREDICT_CONTRIB: feature contributions (SHAP values)
• start_iteration – Start index of the iteration to predict
• num_iteration – Number of iterations for prediction, <= 0 means no limit
• parameter – Other parameters for prediction, e.g. early stopping for prediction
• out_len – [out] Length of output result
• out_result – [out] Pointer to array with predictions
Returns
0 when succeed, -1 when failure happens
83
LightGBM, Release 4.0.0
• for feature contributions, its length is equal to num_class * num_data * (num_feature + 1).
Parameters
• handle – Handle of booster
• indptr – Pointer to row headers
• indptr_type – Type of indptr, can be C_API_DTYPE_INT32 or C_API_DTYPE_INT64
• indices – Pointer to column indices
• data – Pointer to the data space
• data_type – Type of data pointer, can be C_API_DTYPE_FLOAT32 or
C_API_DTYPE_FLOAT64
• nindptr – Number of rows in the matrix + 1
• nelem – Number of nonzero elements in the matrix
• num_col – Number of columns
• predict_type – What should be predicted
– C_API_PREDICT_NORMAL: normal prediction, with transform (if needed);
– C_API_PREDICT_RAW_SCORE: raw score;
– C_API_PREDICT_LEAF_INDEX: leaf index;
– C_API_PREDICT_CONTRIB: feature contributions (SHAP values)
• start_iteration – Start index of the iteration to predict
• num_iteration – Number of iterations for prediction, <= 0 means no limit
• parameter – Other parameters for prediction, e.g. early stopping for prediction
• out_len – [out] Length of output result
• out_result – [out] Pointer to array with predictions
Returns
0 when succeed, -1 when failure happens
84 Chapter 8. C API
LightGBM, Release 4.0.0
ent number of threads in other calls, you need to start the setup process over, or that number of threads will be
used for these calls as well.
Parameters
• fastConfig_handle – FastConfig object handle returned by
LGBM_BoosterPredictForCSRSingleRowFastInit
• indptr – Pointer to row headers
• indptr_type – Type of indptr, can be C_API_DTYPE_INT32 or C_API_DTYPE_INT64
• indices – Pointer to column indices
• data – Pointer to the data space
• nindptr – Number of rows in the matrix + 1
• nelem – Number of nonzero elements in the matrix
• out_len – [out] Length of output result
• out_result – [out] Pointer to array with predictions
Returns
0 when succeed, -1 when failure happens
85
LightGBM, Release 4.0.0
86 Chapter 8. C API
LightGBM, Release 4.0.0
• for feature contributions, its length is equal to num_class * num_data * (num_feature + 1).
Parameters
• handle – Handle of booster
• data – Pointer to the data space
• data_type – Type of data pointer, can be C_API_DTYPE_FLOAT32 or
C_API_DTYPE_FLOAT64
• nrow – Number of rows
• ncol – Number of columns
• is_row_major – 1 for row-major, 0 for column-major
• predict_type – What should be predicted
– C_API_PREDICT_NORMAL: normal prediction, with transform (if needed);
– C_API_PREDICT_RAW_SCORE: raw score;
– C_API_PREDICT_LEAF_INDEX: leaf index;
– C_API_PREDICT_CONTRIB: feature contributions (SHAP values)
• start_iteration – Start index of the iteration to predict
• num_iteration – Number of iteration for prediction, <= 0 means no limit
• parameter – Other parameters for prediction, e.g. early stopping for prediction
• out_len – [out] Length of output result
• out_result – [out] Pointer to array with predictions
Returns
0 when succeed, -1 when failure happens
Parameters
• handle – Handle of booster
• data – Pointer to the data space
• data_type – Type of data pointer, can be C_API_DTYPE_FLOAT32 or
C_API_DTYPE_FLOAT64
87
LightGBM, Release 4.0.0
Parameters
• handle – Handle of booster
• data – Pointer to the data space
• data_type – Type of data pointer, can be C_API_DTYPE_FLOAT32 or
C_API_DTYPE_FLOAT64
• ncol – Number columns
• is_row_major – 1 for row-major, 0 for column-major
• predict_type – What should be predicted
– C_API_PREDICT_NORMAL: normal prediction, with transform (if needed);
– C_API_PREDICT_RAW_SCORE: raw score;
– C_API_PREDICT_LEAF_INDEX: leaf index;
– C_API_PREDICT_CONTRIB: feature contributions (SHAP values)
88 Chapter 8. C API
LightGBM, Release 4.0.0
Parameters
• fastConfig_handle – FastConfig object handle returned by
LGBM_BoosterPredictForMatSingleRowFastInit
• data – Single-row array data (no other way than row-major form).
• out_len – [out] Length of output result
• out_result – [out] Pointer to array with predictions
Returns
0 when it succeeds, -1 when failure happens
89
LightGBM, Release 4.0.0
Note: The outputs are pre-allocated, as they can vary for each invocation, but the shape should be the same:
• for feature contributions, the shape of sparse matrix will be num_class * num_data * (num_feature
+ 1). The output indptr_type for the sparse matrix will be the same as the given input indptr_type. Call
LGBM_BoosterFreePredictSparse to deallocate resources.
Parameters
• handle – Handle of booster
• indptr – Pointer to row headers for CSR or column headers for CSC
• indptr_type – Type of indptr, can be C_API_DTYPE_INT32 or C_API_DTYPE_INT64
• indices – Pointer to column indices for CSR or row indices for CSC
• data – Pointer to the data space
• data_type – Type of data pointer, can be C_API_DTYPE_FLOAT32 or
C_API_DTYPE_FLOAT64
• nindptr – Number of entries in indptr
• nelem – Number of nonzero elements in the matrix
• num_col_or_row – Number of columns for CSR or number of rows for CSC
90 Chapter 8. C API
LightGBM, Release 4.0.0
91
LightGBM, Release 4.0.0
92 Chapter 8. C API
LightGBM, Release 4.0.0
Note: The length of the arrays referenced by grad and hess must be equal to num_class * num_train_data,
this is not verified by the library, the caller must ensure this.
Parameters
• handle – Handle of booster
• grad – The first order derivative (gradient) statistics
• hess – The second order derivative (Hessian) statistics
• is_finished – [out] 1 means the update was successfully finished (cannot split any more),
0 indicates failure
Returns
0 when succeed, -1 when failure happens
93
LightGBM, Release 4.0.0
Parameters
• handle – Handle of booster
• data_names – Array with the feature names in the data
• data_num_features – Number of features in the data
Returns
0 when succeed, -1 when failure happens
LIGHTGBM_C_EXPORT int LGBM_ByteBufferFree(ByteBufferHandle handle)
Free space for byte buffer.
Parameters
• handle – Handle of byte buffer to be freed
Returns
0 when succeed, -1 when failure happens
LIGHTGBM_C_EXPORT int LGBM_ByteBufferGetAt(ByteBufferHandle handle, int32_t index, uint8_t *out_val)
Get a ByteBuffer value at an index.
Parameters
• handle – Handle of byte buffer to be read
• index – Index of value to return
• out_val – [out] Byte value at index to return
Returns
0 when succeed, -1 when failure happens
LIGHTGBM_C_EXPORT int LGBM_DatasetAddFeaturesFrom(DatasetHandle target, DatasetHandle source)
Add features from source to target.
Parameters
• target – The handle of the dataset to add features to
• source – The handle of the dataset to take features from
Returns
0 when succeed, -1 when failure happens
LIGHTGBM_C_EXPORT int LGBM_DatasetCreateByReference(const DatasetHandle reference, int64_t
num_total_row, DatasetHandle *out)
Allocate the space for dataset and bucket feature bins according to reference dataset.
Parameters
• reference – Used to align bin mapper with other dataset
• num_total_row – Number of total rows
• out – [out] Created dataset
Returns
0 when succeed, -1 when failure happens
LIGHTGBM_C_EXPORT int LGBM_DatasetCreateFromCSC(const void *col_ptr, int col_ptr_type, const int32_t
*indices, const void *data, int data_type, int64_t
ncol_ptr, int64_t nelem, int64_t num_row, const
char *parameters, const DatasetHandle reference,
DatasetHandle *out)
94 Chapter 8. C API
LightGBM, Release 4.0.0
95
LightGBM, Release 4.0.0
96 Chapter 8. C API
LightGBM, Release 4.0.0
97
LightGBM, Release 4.0.0
98 Chapter 8. C API
LightGBM, Release 4.0.0
99
LightGBM, Release 4.0.0
101
LightGBM, Release 4.0.0
Parameters
• dataset – Handle of dataset
• data – Pointer to the data space
• data_type – Type of data pointer, can be C_API_DTYPE_FLOAT32 or
C_API_DTYPE_FLOAT64
• nrow – Number of rows
• ncol – Number of feature columns
• start_row – Row start index, i.e., the index at which to start inserting data
• label – Pointer to array with nrow labels
• weight – Optional pointer to array with nrow weights
• init_score – Optional pointer to array with nrow*nclasses initial scores, in column format
• query – Optional pointer to array with nrow query values
• tid – The id of the calling thread, from 0. . . N-1 threads
Returns
0 when succeed, -1 when failure happens
Note:
• group only works for C_API_DTYPE_INT32;
• label and weight only work for C_API_DTYPE_FLOAT32;
• init_score only works for C_API_DTYPE_FLOAT64.
Parameters
• handle – Handle of dataset
• field_name – Field name, can be label, weight, init_score, group
• field_data – Pointer to data vector
103
LightGBM, Release 4.0.0
105
LightGBM, Release 4.0.0
Returns
0 when succeed, -1 when failure happens
LIGHTGBM_C_EXPORT int LGBM_SampleIndices(int32_t num_total_row, const char *parameters, void *out,
int32_t *out_len)
Create sample indices for total number of rows.
Note: You should pre-allocate memory for out, you can get its length by LGBM_GetSampleCount.
Parameters
• num_total_row – Number of total rows
• parameters – Additional parameters, namely, bin_construct_sample_cnt and
data_random_seed are used to produce the output
• out – [out] Created indices, type is int32_t
• out_len – [out] Number of indices
Returns
0 when succeed, -1 when failure happens
Note: This will call unsafe sprintf when compiled using C standards before C99.
Parameters
• msg – Error message
NINE
PYTHON API
9.1.1 lightgbm.Dataset
107
LightGBM, Release 4.0.0
records are in the first group, records 11-30 are in the second group, records 31-70 are in
the third group, etc.
• init_score (list, list of lists (for multi-class task), numpy array,
pandas Series, pandas DataFrame (for multi-class task), or None,
optional (default=None)) – Init score for Dataset.
• feature_name (list of str, or 'auto', optional (default="auto")) – Fea-
ture names. If ‘auto’ and data is pandas DataFrame, data columns names are used.
• categorical_feature (list of str or int, or 'auto', optional
(default="auto")) – Categorical features. If list of int, interpreted as indices. If
list of str, interpreted as feature names (need to specify feature_name as well). If ‘auto’
and data is pandas DataFrame, pandas unordered categorical columns are used. All values
in categorical features will be cast to int32 and thus should be less than int32 max value
(2147483647). Large values could be memory consuming. Consider using consecutive
integers starting from zero. All negative values in categorical features will be treated
as missing values. The output cannot be monotonically constrained with respect to a
categorical feature. Floating point numbers in categorical features will be rounded towards
0.
• params (dict or None, optional (default=None)) – Other parameters for
Dataset.
• free_raw_data (bool, optional (default=True)) – If True, raw data is freed after
constructing inner Dataset.
Methods
add_features_from(other)
Add features from other Dataset to the current Dataset.
Both Datasets must be constructed before calling this method.
Parameters
other (Dataset) – The Dataset to take features from.
Returns
self – Dataset with the new features added.
Return type
Dataset
construct()
Lazy init.
Returns
self – Constructed Dataset object.
Return type
Dataset
create_valid(data, label=None, weight=None, group=None, init_score=None, params=None)
Create validation data align with current Dataset.
Parameters
• data (str, pathlib.Path , numpy array, pandas DataFrame, H2O
DataTable's Frame, scipy.sparse, Sequence, list of Sequence or list
of numpy array) – Data source of Dataset. If str or pathlib.Path, it represents the path
to a text file (CSV, TSV, or LibSVM) or a LightGBM Dataset binary file.
• label (list, numpy 1-D array, pandas Series / one-column DataFrame or
None, optional (default=None)) – Label of the data.
• weight (list, numpy 1-D array, pandas Series or None, optional
(default=None)) – Weight for each instance. Weights should be non-negative.
• group (list, numpy 1-D array, pandas Series or None, optional
(default=None)) – Group/query data. Only used in the learning-to-rank task.
sum(group) = n_samples. For example, if you have a 100-document dataset with group
= [10, 20, 40, 10, 10, 10], that means that you have 6 groups, where the first 10
records are in the first group, records 11-30 are in the second group, records 31-70 are in
the third group, etc.
• init_score (list, list of lists (for multi-class task), numpy array,
pandas Series, pandas DataFrame (for multi-class task), or None,
optional (default=None)) – Init score for Dataset.
• params (dict or None, optional (default=None)) – Other parameters for valida-
tion Dataset.
Returns
valid – Validation Dataset with reference to self.
Return type
Dataset
feature_num_bin(feature)
Get the number of bins for a feature.
New in version 4.0.0.
Parameters
feature (int or str) – Index or name of the feature.
Returns
number_of_bins – The number of constructed bins for the feature in the Dataset.
Return type
int
get_data()
Get the raw data of the Dataset.
Returns
data – Raw data used in the Dataset construction.
Return type
str, pathlib.Path, numpy array, pandas DataFrame, H2O DataTable’s Frame, scipy.sparse, Se-
quence, list of Sequence or list of numpy array or None
get_feature_name()
Get the names of columns (features) in the Dataset.
Returns
feature_names – The names of columns (features) in the Dataset.
Return type
list of str
get_field(field_name)
Get property from the Dataset.
Parameters
field_name (str) – The field name of the information.
Returns
info – A numpy array with information from the Dataset.
Return type
numpy array or None
get_group()
Get the group of the Dataset.
Returns
group – Group/query data. Only used in the learning-to-rank task. sum(group) = n_samples.
For example, if you have a 100-document dataset with group = [10, 20, 40, 10, 10,
10], that means that you have 6 groups, where the first 10 records are in the first group, records
11-30 are in the second group, records 31-70 are in the third group, etc.
Return type
numpy array or None
get_init_score()
Get the initial score of the Dataset.
Returns
init_score – Init score of Booster.
Return type
numpy array or None
get_label()
Get the label of the Dataset.
Returns
label – The label information from the Dataset.
Return type
numpy array or None
get_params()
Get the used parameters in the Dataset.
Returns
params – The used parameters in this Dataset object.
Return type
dict
get_ref_chain(ref_limit=100)
Get a chain of Dataset objects.
Starts with r, then goes to r.reference (if exists), then to r.reference.reference, etc. until we hit ref_limit
or a reference loop.
Parameters
ref_limit (int, optional (default=100)) – The limit number of references.
Returns
ref_chain – Chain of references of the Datasets.
Return type
set of Dataset
get_weight()
Get the weight of the Dataset.
Returns
weight – Weight for each data point from the Dataset. Weights should be non-negative.
Return type
numpy array or None
num_data()
Get the number of rows in the Dataset.
Returns
number_of_rows – The number of rows in the Dataset.
Return type
int
num_feature()
Get the number of columns (features) in the Dataset.
Returns
number_of_columns – The number of columns (features) in the Dataset.
Return type
int
save_binary(filename)
Save Dataset to a binary file.
Note: Please note that init_score is not saved in binary file. If you need it, please set it again after loading
Dataset.
Parameters
filename (str or pathlib.Path ) – Name of the output file.
Returns
self – Returns self.
Return type
Dataset
set_categorical_feature(categorical_feature)
Set categorical features.
Parameters
categorical_feature (list of str or int, or 'auto') – Names or indices of cate-
gorical features.
Returns
self – Dataset with set categorical features.
Return type
Dataset
set_feature_name(feature_name)
Set feature name.
Parameters
feature_name (list of str) – Feature names.
Returns
self – Dataset with set feature name.
Return type
Dataset
set_field(field_name, data)
Set property into the Dataset.
Parameters
• field_name (str) – The field name of the information.
• data (list, list of lists (for multi-class task), numpy array,
pandas Series, pandas DataFrame (for multi-class task), or None)
– The data to be set.
Returns
self – Dataset with set property.
Return type
Dataset
set_group(group)
Set group size of Dataset (used for ranking).
Parameters
group (list, numpy 1-D array, pandas Series or None) – Group/query data.
Only used in the learning-to-rank task. sum(group) = n_samples. For example, if you have
a 100-document dataset with group = [10, 20, 40, 10, 10, 10], that means that you
have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second
group, records 31-70 are in the third group, etc.
Returns
self – Dataset with set group.
Return type
Dataset
set_init_score(init_score)
Set init score of Booster to start from.
Parameters
init_score (list, list of lists (for multi-class task), numpy array,
pandas Series, pandas DataFrame (for multi-class task), or None) – Init
score for Booster.
Returns
self – Dataset with set init score.
Return type
Dataset
set_label(label)
Set label of Dataset.
Parameters
label (list, numpy 1-D array, pandas Series / one-column DataFrame or
None) – The label information to be set into Dataset.
Returns
self – Dataset with set label.
Return type
Dataset
set_reference(reference)
Set reference Dataset.
Parameters
reference (Dataset) – Reference that is used as a template to construct the current Dataset.
Returns
self – Dataset with set reference.
Return type
Dataset
set_weight(weight)
Set weight of each instance.
Parameters
weight (list, numpy 1-D array, pandas Series or None) – Weight to be set for
each data point. Weights should be non-negative.
Returns
self – Dataset with set weight.
Return type
Dataset
subset(used_indices, params=None)
Get subset of current Dataset.
Parameters
• used_indices (list of int) – Indices used to create the subset.
• params (dict or None, optional (default=None)) – These parameters will be
passed to Dataset constructor.
Returns
subset – Subset of the current Dataset.
Return type
Dataset
9.1.2 lightgbm.Booster
Methods
add_valid(data, name)
Add validation data.
Parameters
• data (Dataset) – Validation data.
• name (str) – Name of validation data.
Returns
self – Booster with set validation data.
Return type
Booster
current_iteration()
Get the index of the current iteration.
Returns
cur_iter – The index of the current iteration.
Return type
int
dump_model(num_iteration=None, start_iteration=0, importance_type='split', object_hook=None)
Dump Booster to JSON format.
Parameters
• num_iteration (int or None, optional (default=None)) – Index of the iteration
that should be dumped. If None, if the best iteration exists, it is dumped; otherwise, all
iterations are dumped. If <= 0, all iterations are dumped.
• start_iteration (int, optional (default=0)) – Start index of the iteration that
should be dumped.
• importance_type (str, optional (default="split")) – What type of feature im-
portance should be dumped. If “split”, result contains numbers of times the feature is used
in a model. If “gain”, result contains total gains of splits which use the feature.
• object_hook (callable or None, optional (default=None)) – If not None,
object_hook is a function called while parsing the json string returned by the C API.
It may be used to alter the json, to store specific values while building the json structure. It
avoids walking through the structure again. It saves a significant amount of time if the num-
ber of trees is huge. Signature is def object_hook(node: dict) -> dict. None is
equivalent to lambda node: node. See documentation of json.loads() for further
details.
Returns
json_repr – JSON format of Booster.
Return type
dict
eval_result
[float] The eval result.
is_higher_better
[bool] Is eval result higher better, e.g. AUC is is_higher_better.
Returns
result – List with (train_dataset_name, eval_name, eval_result, is_higher_better) tuples.
Return type
list
eval_valid(feval=None)
Evaluate for validation data.
Parameters
feval (callable, list of callable, or None, optional (default=None))
– Customized evaluation function. Each evaluation function should accept two parame-
ters: preds, eval_data, and return (eval_name, eval_result, is_higher_better) or list of such
tuples.
preds
[numpy 1-D array or numpy 2-D array (for multi-class task)] The predicted
values. For multi-class task, preds are numpy 2-D array of shape = [n_samples,
n_classes]. If custom objective function is used, predicted values are returned
before any transformation, e.g. they are raw margin instead of probability of
positive class for binary task in this case.
eval_data
[Dataset] The validation dataset.
eval_name
[str] The name of evaluation function (without whitespace).
eval_result
[float] The eval result.
is_higher_better
[bool] Is eval result higher better, e.g. AUC is is_higher_better.
Returns
result – List with (validation_dataset_name, eval_name, eval_result, is_higher_better) tu-
ples.
Return type
list
feature_importance(importance_type='split', iteration=None)
Get feature importances.
Parameters
• importance_type (str, optional (default="split")) – How the importance
is calculated. If “split”, result contains numbers of times the feature is used in a model.
If “gain”, result contains total gains of splits which use the feature.
• iteration (int or None, optional (default=None)) – Limit number of iter-
ations in the feature importance calculation. If None, if the best iteration exists, it is
used; otherwise, all trees are used. If <= 0, all trees are used (no limits).
Returns
result – Array with feature importances.
Return type
numpy array
feature_name()
Get names of features.
Returns
result – List with names of features.
Return type
list of str
free_dataset()
Free Booster’s Datasets.
Returns
self – Booster without Datasets.
Return type
Booster
free_network()
Free Booster’s network.
Returns
self – Booster with freed network.
Return type
Booster
get_leaf_output(tree_id, leaf_id)
Get the output of a leaf.
Parameters
• tree_id (int) – The index of the tree.
• leaf_id (int) – The index of the leaf in the tree.
Returns
result – The output of the leaf.
Return type
float
get_split_value_histogram(feature, bins=None, xgboost_style=False)
Get split value histogram for the specified feature.
Parameters
• feature (int or str) – The feature name or index the histogram is calculated for.
If int, interpreted as index. If str, interpreted as name.
Return type
int
num_model_per_iteration()
Get number of models per iteration.
Returns
model_per_iter – The number of models per iteration.
Return type
int
num_trees()
Get number of weak sub-models.
Returns
num_trees – The number of weak sub-models.
Return type
int
predict(data, start_iteration=0, num_iteration=None, raw_score=False, pred_leaf=False,
pred_contrib=False, data_has_header=False, validate_features=False, **kwargs)
Make a prediction.
Parameters
• data (str, pathlib.Path , numpy array, pandas DataFrame, H2O
DataTable's Frame or scipy.sparse) – Data source for prediction. If str
or pathlib.Path, it represents the path to a text file (CSV, TSV, or LibSVM).
• start_iteration (int, optional (default=0)) – Start index of the iteration
to predict. If <= 0, starts from the first iteration.
• num_iteration (int or None, optional (default=None)) – Total number of
iterations used in the prediction. If None, if the best iteration exists and start_iteration
<= 0, the best iteration is used; otherwise, all iterations from start_iteration are
used (no limits). If <= 0, all iterations from start_iteration are used (no limits).
• raw_score (bool, optional (default=False)) – Whether to predict raw
scores.
• pred_leaf (bool, optional (default=False)) – Whether to predict leaf index.
• pred_contrib (bool, optional (default=False)) – Whether to predict fea-
ture contributions.
Note: If you want to get more explanations for your model’s predictions using SHAP
values, like SHAP interaction values, you can install the shap package (https://github.
com/slundberg/shap). Note that unlike the shap package, with pred_contrib we
return a matrix with an extra column, where the last column is the expected value.
Returns
result – Prediction result. Can be sparse or a list of sparse objects (each element represents
predictions for one class) for feature contributions (when pred_contrib=True).
Return type
numpy array, scipy.sparse or list of scipy.sparse
refit(data, label, decay_rate=0.9, reference=None, weight=None, group=None, init_score=None,
feature_name='auto', categorical_feature='auto', dataset_params=None, free_raw_data=True,
validate_features=False, **kwargs)
Refit the existing Booster by new data.
Parameters
• data (str, pathlib.Path , numpy array, pandas DataFrame, H2O
DataTable's Frame, scipy.sparse, Sequence, list of Sequence or
list of numpy array) – Data source for refit. If str or pathlib.Path, it represents
the path to a text file (CSV, TSV, or LibSVM).
• label (list, numpy 1-D array or pandas Series / one-column
DataFrame) – Label for refit.
• decay_rate (float, optional (default=0.9)) – Decay rate of refit, will use
leaf_output = decay_rate * old_leaf_output + (1.0 - decay_rate)
* new_leaf_output to refit trees.
• reference (Dataset or None, optional (default=None)) – Reference for
data.
New in version 4.0.0.
• weight (list, numpy 1-D array, pandas Series or None, optional
(default=None)) – Weight for each data instance. Weights should be non-
negative.
New in version 4.0.0.
• group (list, numpy 1-D array, pandas Series or None, optional
(default=None)) – Group/query size for data. Only used in the learning-to-rank
task. sum(group) = n_samples. For example, if you have a 100-document dataset
with group = [10, 20, 40, 10, 10, 10], that means that you have 6 groups,
where the first 10 records are in the first group, records 11-30 are in the second group,
records 31-70 are in the third group, etc.
New in version 4.0.0.
• init_score (list, list of lists (for multi-class task), numpy
array, pandas Series, pandas DataFrame (for multi-class task),
or None, optional (default=None)) – Init score for data.
New in version 4.0.0.
• feature_name (list of str, or 'auto', optional (default="auto")) –
Feature names for data. If ‘auto’ and data is pandas DataFrame, data columns names
are used.
New in version 4.0.0.
• categorical_feature (list of str or int, or 'auto', optional
(default="auto")) – Categorical features for data. If list of int, interpreted
as indices. If list of str, interpreted as feature names (need to specify feature_name
as well). If ‘auto’ and data is pandas DataFrame, pandas unordered categorical
columns are used. All values in categorical features will be cast to int32 and thus
should be less than int32 max value (2147483647). Large values could be memory
consuming. Consider using consecutive integers starting from zero. All negative
values in categorical features will be treated as missing values. The output cannot
be monotonically constrained with respect to a categorical feature. Floating point
numbers in categorical features will be rounded towards 0.
New in version 4.0.0.
• dataset_params (dict or None, optional (default=None)) – Other param-
eters for Dataset data.
New in version 4.0.0.
• free_raw_data (bool, optional (default=True)) – If True, raw data is freed
after constructing inner Dataset for data.
New in version 4.0.0.
• validate_features (bool, optional (default=False)) – If True, ensure that
the features used to refit the model match the original ones. Used only if data is pandas
DataFrame.
New in version 4.0.0.
• **kwargs – Other parameters for refit. These parameters will be passed to predict
method.
Returns
result – Refitted Booster.
Return type
Booster
reset_parameter(params)
Reset parameters of Booster.
Parameters
params (dict) – New parameters for Booster.
Returns
self – Booster with new parameters.
Return type
Booster
rollback_one_iter()
Rollback one iteration.
Returns
self – Booster with rolled back one iteration.
Return type
Booster
save_model(filename, num_iteration=None, start_iteration=0, importance_type='split')
Save Booster to file.
Parameters
• filename (str or pathlib.Path ) – Filename to save Booster.
Returns
self – Booster with set training Dataset name.
Return type
Booster
shuffle_models(start_iteration=0, end_iteration=-1)
Shuffle models.
Parameters
• start_iteration (int, optional (default=0)) – The first iteration that will
be shuffled.
• end_iteration (int, optional (default=-1)) – The last iteration that will be
shuffled. If <= 0, means the last available iteration.
Returns
self – Booster with shuffled models.
Return type
Booster
trees_to_dataframe()
Parse the fitted model and return in an easy-to-read pandas DataFrame.
The returned DataFrame has the following columns.
• tree_index : int64, which tree a node belongs to. 0-based, so a value of 6, for example, means
“this node is in the 7th tree”.
• node_depth : int64, how far a node is from the root of the tree. The root node has a value of 1, its
direct children are 2, etc.
• node_index : str, unique identifier for a node.
• left_child : str, node_index of the child node to the left of a split. None for leaf nodes.
• right_child : str, node_index of the child node to the right of a split. None for leaf nodes.
• parent_index : str, node_index of this node’s parent. None for the root node.
• split_feature : str, name of the feature used for splitting. None for leaf nodes.
• split_gain : float64, gain from adding this split to the tree. NaN for leaf nodes.
• threshold : float64, value of the feature used to decide which side of the split a record will go
down. NaN for leaf nodes.
• decision_type : str, logical operator describing how to compare a value to threshold. For ex-
ample, split_feature = "Column_10", threshold = 15, decision_type = "<=" means
that records where Column_10 <= 15 follow the left side of the split, otherwise follows the right
side of the split. None for leaf nodes.
• missing_direction : str, split direction that missing values should go to. None for leaf nodes.
• missing_type : str, describes what types of values are treated as missing.
• value : float64, predicted value for this leaf node, multiplied by the learning rate.
• weight : float64 or int64, sum of Hessian (second-order derivative of objective), summed over
observations that fall in this node.
• count : int64, number of records in the training data that fall into this node.
Returns
result – Returns a pandas DataFrame of the parsed model.
Return type
pandas DataFrame
update(train_set=None, fobj=None)
Update Booster for one iteration.
Parameters
• train_set (Dataset or None, optional (default=None)) – Training data.
If None, last training data is used.
• fobj (callable or None, optional (default=None)) – Customized objec-
tive function. Should accept two parameters: preds, train_data, and return (grad, hess).
preds
[numpy 1-D array or numpy 2-D array (for multi-class task)] The predicted
values. Predicted values are returned before any transformation, e.g. they are
raw margin instead of probability of positive class for binary task.
train_data
[Dataset] The training dataset.
grad
[numpy 1-D array or numpy 2-D array (for multi-class task)] The value of the
first order derivative (gradient) of the loss with respect to the elements of preds
for each sample point.
hess
[numpy 1-D array or numpy 2-D array (for multi-class task)] The value of the
second order derivative (Hessian) of the loss with respect to the elements of
preds for each sample point.
For multi-class task, preds are numpy 2-D array of shape = [n_samples, n_classes],
and grad and hess should be returned in the same format.
Returns
is_finished – Whether the update was successfully finished.
Return type
bool
upper_bound()
Get upper bound value of a model.
Returns
upper_bound – Upper bound value of the model.
Return type
float
9.1.3 lightgbm.CVBooster
class lightgbm.CVBooster(model_file=None)
Bases: object
CVBooster in LightGBM.
Auxiliary data structure to hold and redirect all boosters of cv() function. This class has the same methods as
Booster class. All method calls, except for the following methods, are actually performed for underlying Boosters
and then all returned results are returned in a list.
• model_from_string()
• model_to_string()
• save_model()
boosters
The list of underlying fitted models.
Type
list of Booster
best_iteration
The best iteration of fitted model.
Type
int
__init__(model_file=None)
Initialize the CVBooster.
Parameters
model_file (str, pathlib.Path or None, optional (default=None)) – Path
to the CVBooster model file.
Methods
model_from_string(model_str)
Load CVBooster from a string.
Parameters
model_str (str) – Model will be loaded from this string.
Returns
self – Loaded CVBooster object.
Return type
CVBooster
9.1.4 lightgbm.Sequence
class lightgbm.Sequence
Bases: ABC
Generic data access interface.
Object should support the following operations:
• With random access, data sampling does not need to go through all data.
• With range data access, there’s no need to read all data into memory thus reduce memory usage.
Methods
__init__()
Attributes
batch_size
train(params, train_set[, num_boost_round, ...]) Perform the training with given parameters.
cv(params, train_set[, num_boost_round, ...]) Perform the cross-validation with given parameters.
9.2.1 lightgbm.train
model conversion performed during the internal call of model_to_string. You can still
use _InnerPredictor as init_model for future continue training.
• callbacks (list of callable, or None, optional (default=None)) – List
of callback functions that are applied at each iteration. See Callbacks in Python API for
more information.
Note: A custom objective function can be provided for the objective parameter. It should accept two param-
eters: preds, train_data and return (grad, hess).
preds
[numpy 1-D array or numpy 2-D array (for multi-class task)] The predicted values. Predicted
values are returned before any transformation, e.g. they are raw margin instead of probability
of positive class for binary task.
train_data
[Dataset] The training dataset.
grad
[numpy 1-D array or numpy 2-D array (for multi-class task)] The value of the first order deriva-
tive (gradient) of the loss with respect to the elements of preds for each sample point.
hess
[numpy 1-D array or numpy 2-D array (for multi-class task)] The value of the second order
derivative (Hessian) of the loss with respect to the elements of preds for each sample point.
For multi-class task, preds are numpy 2-D array of shape = [n_samples, n_classes], and grad and hess should be
returned in the same format.
Returns
booster – The trained Booster model.
Return type
Booster
9.2.2 lightgbm.cv
Note: A custom objective function can be provided for the objective parameter. It should accept two param-
eters: preds, train_data and return (grad, hess).
preds
[numpy 1-D array or numpy 2-D array (for multi-class task)] The predicted values. Predicted
values are returned before any transformation, e.g. they are raw margin instead of probability
of positive class for binary task.
train_data
[Dataset] The training dataset.
grad
[numpy 1-D array or numpy 2-D array (for multi-class task)] The value of the first order deriva-
tive (gradient) of the loss with respect to the elements of preds for each sample point.
hess
[numpy 1-D array or numpy 2-D array (for multi-class task)] The value of the second order
derivative (Hessian) of the loss with respect to the elements of preds for each sample point.
For multi-class task, preds are numpy 2-D array of shape = [n_samples, n_classes], and grad and hess should be
returned in the same format.
Returns
eval_hist – Evaluation history. The dictionary has the following format: {‘metric1-mean’:
[values], ‘metric1-stdv’: [values], ‘metric2-mean’: [values], ‘metric2-stdv’: [values], . . . }. If
return_cvbooster=True, also returns trained boosters wrapped in a CVBooster object via
cvbooster key.
Return type
dict
9.3.1 lightgbm.LGBMModel
parameter only for multi-class classification task; for binary classification task you
may use is_unbalance or scale_pos_weight parameters. Note, that the usage of
all these parameters will result in poor estimates of the individual class probabilities.
You may want to consider performing probability calibration (https://scikit-learn.org/
stable/modules/calibration.html) of your model. The ‘balanced’ mode uses the val-
ues of y to automatically adjust weights inversely proportional to class frequencies
in the input data as n_samples / (n_classes * np.bincount(y)). If None, all
classes are supposed to have weight one. Note, that these weights will be multiplied
with sample_weight (passed through the fit method) if sample_weight is speci-
fied.
• min_split_gain (float, optional (default=0.)) – Minimum loss reduction
required to make a further partition on a leaf node of the tree.
• min_child_weight (float, optional (default=1e-3)) – Minimum sum of
instance weight (Hessian) needed in a child (leaf).
• min_child_samples (int, optional (default=20)) – Minimum number of
data needed in a child (leaf).
• subsample (float, optional (default=1.)) – Subsample ratio of the training
instance.
• subsample_freq (int, optional (default=0)) – Frequency of subsample, <=0
means no enable.
• colsample_bytree (float, optional (default=1.)) – Subsample ratio of
columns when constructing each tree.
• reg_alpha (float, optional (default=0.)) – L1 regularization term on
weights.
• reg_lambda (float, optional (default=0.)) – L2 regularization term on
weights.
• random_state (int, RandomState object or None, optional
(default=None)) – Random number seed. If int, this number is used to seed
the C++ code. If RandomState object (numpy), a random integer is picked based on
its state to seed the C++ code. If None, default seeds in C++ code are used.
• n_jobs (int or None, optional (default=None)) – Number of parallel
threads to use for training (can be changed at prediction time by passing it as an extra
keyword argument).
For better performance, it is recommended to set this to the number of physical cores
in the CPU.
Negative integers are interpreted as following joblib’s formula (n_cpus + 1 + n_jobs),
just like scikit-learn (so e.g. -1 means using all threads). A value of zero corresponds
the default number of threads configured for OpenMP in the system. A value of None
(the default) corresponds to using the number of physical cores in the system (its cor-
rect detection requires either the joblib or the psutil util libraries to be installed).
Changed in version 4.0.0.
• importance_type (str, optional (default='split')) – The type of feature
importance to be filled into feature_importances_. If ‘split’, result contains num-
bers of times the feature is used in a model. If ‘gain’, result contains total gains of
splits which use the feature.
Note: A custom objective function can be provided for the objective parameter. In this case, it should
have the signature objective(y_true, y_pred) -> grad, hess, objective(y_true, y_pred,
weight) -> grad, hess or objective(y_true, y_pred, weight, group) -> grad, hess:
y_true
[numpy 1-D array of shape = [n_samples]] The target values.
y_pred
[numpy 1-D array of shape = [n_samples] or numpy 2-D array of shape = [n_samples,
n_classes] (for multi-class task)] The predicted values. Predicted values are returned before
any transformation, e.g. they are raw margin instead of probability of positive class for
binary task.
weight
[numpy 1-D array of shape = [n_samples]] The weight of samples. Weights should be
non-negative.
group
[numpy 1-D array] Group/query data. Only used in the learning-to-rank task. sum(group)
= n_samples. For example, if you have a 100-document dataset with group = [10, 20,
40, 10, 10, 10], that means that you have 6 groups, where the first 10 records are in
the first group, records 11-30 are in the second group, records 31-70 are in the third group,
etc.
grad
[numpy 1-D array of shape = [n_samples] or numpy 2-D array of shape = [n_samples,
n_classes] (for multi-class task)] The value of the first order derivative (gradient) of the
loss with respect to the elements of y_pred for each sample point.
hess
[numpy 1-D array of shape = [n_samples] or numpy 2-D array of shape = [n_samples,
n_classes] (for multi-class task)] The value of the second order derivative (Hessian) of the
loss with respect to the elements of y_pred for each sample point.
For multi-class task, y_pred is a numpy 2-D array of shape = [n_samples, n_classes], and grad and hess
should be returned in the same format.
Methods
Attributes
property best_iteration_
The best iteration of fitted model if early_stopping() callback has been specified.
Type
int
property best_score_
The best score of fitted model.
Type
dict
property booster_
The underlying Booster of this model.
Type
Booster
property evals_result_
The evaluation results if validation sets have been specified.
Type
dict
property feature_importances_
The feature importances (the higher, the more important).
Note: importance_type attribute is passed to the function to configure the type of importance values
to be extracted.
Type
array of shape = [n_features]
property feature_name_
The names of features.
Type
list of shape = [n_features]
fit(X, y, sample_weight=None, init_score=None, group=None, eval_set=None, eval_names=None,
eval_sample_weight=None, eval_class_weight=None, eval_init_score=None, eval_group=None,
eval_metric=None, feature_name='auto', categorical_feature='auto', callbacks=None,
init_model=None)
Build a gradient boosting model from the training set (X, y).
Parameters
• X (numpy array, pandas DataFrame, H2O DataTable's Frame , scipy.
sparse, list of lists of int or float of shape = [n_samples,
n_features]) – Input feature matrix.
• y (numpy array, pandas DataFrame, pandas Series, list of int or
float of shape = [n_samples]) – The target values (class labels in classifica-
tion, real numbers in regression).
• sample_weight (numpy array, pandas Series, list of int or float
of shape = [n_samples] or None, optional (default=None)) – Weights
of training data. Weights should be non-negative.
• init_score (numpy array, pandas DataFrame, pandas Series,
list of int or float of shape = [n_samples] or shape =
[n_samples * n_classes] (for multi-class task) or shape =
[n_samples, n_classes] (for multi-class task) or None, optional
(default=None)) – Init score of training data.
• group (numpy array, pandas Series, list of int or float, or None,
optional (default=None)) – Group/query data. Only used in the learning-to-rank
task. sum(group) = n_samples. For example, if you have a 100-document dataset with
group = [10, 20, 40, 10, 10, 10], that means that you have 6 groups, where
the first 10 records are in the first group, records 11-30 are in the second group, records
31-70 are in the third group, etc.
• eval_set (list or None, optional (default=None)) – A list of (X, y) tuple
pairs to use as validation sets.
• eval_names (list of str, or None, optional (default=None)) – Names
of eval_set.
• eval_sample_weight (list of array (same types as sample_weight supports), or
None, optional (default=None)) – Weights of eval data. Weights should be non-
negative.
Note: Custom eval function expects a callable with following signatures: func(y_true,
y_pred), func(y_true, y_pred, weight) or func(y_true, y_pred, weight, group) and re-
turns (eval_name, eval_result, is_higher_better) or list of (eval_name, eval_result, is_higher_better):
y_true
[numpy 1-D array of shape = [n_samples]] The target values.
y_pred
[numpy 1-D array of shape = [n_samples] or numpy 2-D array of shape = [n_samples,
n_classes] (for multi-class task)] The predicted values. In case of custom objective,
predicted values are returned before any transformation, e.g. they are raw margin instead
of probability of positive class for binary task in this case.
weight
[numpy 1-D array of shape = [n_samples]] The weight of samples. Weights should be
non-negative.
group
[numpy 1-D array] Group/query data. Only used in the learning-to-rank task. sum(group)
= n_samples. For example, if you have a 100-document dataset with group = [10, 20,
40, 10, 10, 10], that means that you have 6 groups, where the first 10 records are in
the first group, records 11-30 are in the second group, records 31-70 are in the third group,
etc.
eval_name
[str] The name of evaluation function (without whitespace).
eval_result
[float] The eval result.
is_higher_better
[bool] Is eval result higher better, e.g. AUC is is_higher_better.
get_metadata_routing()
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
Returns
routing – A MetadataRequest encapsulating routing information.
Return type
MetadataRequest
get_params(deep=True)
Get parameters for this estimator.
Parameters
deep (bool, optional (default=True)) – If True, will return the parameters for this
estimator and contained subobjects that are estimators.
Returns
params – Parameter names mapped to their values.
Return type
dict
property n_estimators_
True number of boosting iterations performed.
This might be less than parameter n_estimators if early stopping was enabled or if boosting stopped
early due to limits on complexity like min_gain_to_split.
New in version 4.0.0.
Type
int
property n_features_
The number of features of fitted model.
Type
int
property n_features_in_
The number of features of fitted model.
Type
int
property n_iter_
True number of boosting iterations performed.
This might be less than parameter n_estimators if early stopping was enabled or if boosting stopped
early due to limits on complexity like min_gain_to_split.
New in version 4.0.0.
Type
int
property objective_
The concrete objective used while fitting this model.
Type
str or callable
predict(X, raw_score=False, start_iteration=0, num_iteration=None, pred_leaf=False, pred_contrib=False,
validate_features=False, **kwargs)
Return the predicted value for each sample.
Parameters
• X (numpy array, pandas DataFrame, H2O DataTable's Frame , scipy.
sparse, list of lists of int or float of shape = [n_samples,
n_features]) – Input features matrix.
• raw_score (bool, optional (default=False)) – Whether to predict raw
scores.
• start_iteration (int, optional (default=0)) – Start index of the iteration
to predict. If <= 0, starts from the first iteration.
• num_iteration (int or None, optional (default=None)) – Total number of
iterations used in the prediction. If None, if the best iteration exists and start_iteration
<= 0, the best iteration is used; otherwise, all iterations from start_iteration are
used (no limits). If <= 0, all iterations from start_iteration are used (no limits).
• pred_leaf (bool, optional (default=False)) – Whether to predict leaf index.
• pred_contrib (bool, optional (default=False)) – Whether to predict fea-
ture contributions.
Note: If you want to get more explanations for your model’s predictions using SHAP
values, like SHAP interaction values, you can install the shap package (https://github.
com/slundberg/shap). Note that unlike the shap package, with pred_contrib we
return a matrix with an extra column, where the last column is the expected value.
Returns
• predicted_result (array-like of shape = [n_samples] or shape = [n_samples,
n_classes]) – The predicted values.
• X_leaves (array-like of shape = [n_samples, n_trees] or shape = [n_samples, n_trees
* n_classes]) – If pred_leaf=True, the predicted leaf of every tree for each sample.
• X_SHAP_values (array-like of shape = [n_samples, n_features + 1] or shape =
[n_samples, (n_features + 1) * n_classes] or list with n_classes length of such ob-
jects) – If pred_contrib=True, the feature contributions for each sample.
set_fit_request(*, callbacks='$UNCHANGED$', categorical_feature='$UNCHANGED$',
eval_class_weight='$UNCHANGED$', eval_group='$UNCHANGED$',
eval_init_score='$UNCHANGED$', eval_metric='$UNCHANGED$',
eval_names='$UNCHANGED$', eval_sample_weight='$UNCHANGED$',
eval_set='$UNCHANGED$', feature_name='$UNCHANGED$',
group='$UNCHANGED$', init_model='$UNCHANGED$',
init_score='$UNCHANGED$', sample_weight='$UNCHANGED$')
Request metadata passed to the fit method.
Note that this method is only relevant if enable_metadata_routing=True (see sklearn.
set_config()). Please see User Guide on how the routing mechanism works.
The options for each parameter are:
• True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not
provided.
• False: metadata is not requested and the meta-estimator will not pass it to fit.
• None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
• str: metadata should be passed to the meta-estimator with this given alias instead of the original
name.
The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows
you to change the request for some parameters and not others.
New in version 1.3.
Note: This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g.
used inside a pipeline.Pipeline. Otherwise it has no effect.
Parameters
• callbacks (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for callbacks parameter in
fit.
• categorical_feature (str, True, False, or None, default=sklearn.
utils.metadata_routing.UNCHANGED) – Metadata routing for
categorical_feature parameter in fit.
• eval_class_weight (str, True, False, or None, default=sklearn.
utils.metadata_routing.UNCHANGED) – Metadata routing for
eval_class_weight parameter in fit.
set_params(**params)
Set the parameters of this estimator.
Parameters
**params – Parameter names with their new values.
Returns
self – Returns self.
Return type
object
Note: This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g.
used inside a pipeline.Pipeline. Otherwise it has no effect.
Parameters
• num_iteration (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for num_iteration
parameter in predict.
• pred_contrib (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for pred_contrib parameter
in predict.
• pred_leaf (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for pred_leaf parameter in
predict.
• raw_score (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for raw_score parameter in
predict.
• start_iteration (str, True, False, or None, default=sklearn.
utils.metadata_routing.UNCHANGED) – Metadata routing for
start_iteration parameter in predict.
• validate_features (str, True, False, or None, default=sklearn.
utils.metadata_routing.UNCHANGED) – Metadata routing for
validate_features parameter in predict.
Returns
self – The updated object.
Return type
object
9.3.2 lightgbm.LGBMClassifier
Note: A custom objective function can be provided for the objective parameter. In this case, it should
have the signature objective(y_true, y_pred) -> grad, hess, objective(y_true, y_pred,
weight) -> grad, hess or objective(y_true, y_pred, weight, group) -> grad, hess:
y_true
[numpy 1-D array of shape = [n_samples]] The target values.
y_pred
[numpy 1-D array of shape = [n_samples] or numpy 2-D array of shape = [n_samples,
n_classes] (for multi-class task)] The predicted values. Predicted values are returned before
any transformation, e.g. they are raw margin instead of probability of positive class for
binary task.
weight
[numpy 1-D array of shape = [n_samples]] The weight of samples. Weights should be
non-negative.
group
[numpy 1-D array] Group/query data. Only used in the learning-to-rank task. sum(group)
= n_samples. For example, if you have a 100-document dataset with group = [10, 20,
40, 10, 10, 10], that means that you have 6 groups, where the first 10 records are in
the first group, records 11-30 are in the second group, records 31-70 are in the third group,
etc.
grad
[numpy 1-D array of shape = [n_samples] or numpy 2-D array of shape = [n_samples,
n_classes] (for multi-class task)] The value of the first order derivative (gradient) of the
loss with respect to the elements of y_pred for each sample point.
hess
[numpy 1-D array of shape = [n_samples] or numpy 2-D array of shape = [n_samples,
n_classes] (for multi-class task)] The value of the second order derivative (Hessian) of the
loss with respect to the elements of y_pred for each sample point.
For multi-class task, y_pred is a numpy 2-D array of shape = [n_samples, n_classes], and grad and hess
should be returned in the same format.
Methods
Attributes
property best_iteration_
The best iteration of fitted model if early_stopping() callback has been specified.
Type
int
property best_score_
The best score of fitted model.
Type
dict
property booster_
The underlying Booster of this model.
Type
Booster
property classes_
The class label array.
Type
array of shape = [n_classes]
property evals_result_
The evaluation results if validation sets have been specified.
Type
dict
property feature_importances_
The feature importances (the higher, the more important).
Note: importance_type attribute is passed to the function to configure the type of importance values
to be extracted.
Type
array of shape = [n_features]
property feature_name_
The names of features.
Type
list of shape = [n_features]
fit(X, y, sample_weight=None, init_score=None, eval_set=None, eval_names=None,
eval_sample_weight=None, eval_class_weight=None, eval_init_score=None, eval_metric=None,
feature_name='auto', categorical_feature='auto', callbacks=None, init_model=None)
Build a gradient boosting model from the training set (X, y).
Parameters
• X (numpy array, pandas DataFrame, H2O DataTable's Frame , scipy.
sparse, list of lists of int or float of shape = [n_samples,
n_features]) – Input feature matrix.
• y (numpy array, pandas DataFrame, pandas Series, list of int or
float of shape = [n_samples]) – The target values (class labels in classifica-
tion, real numbers in regression).
• sample_weight (numpy array, pandas Series, list of int or float
of shape = [n_samples] or None, optional (default=None)) – Weights
of training data. Weights should be non-negative.
• init_score (numpy array, pandas DataFrame, pandas Series,
list of int or float of shape = [n_samples] or shape =
[n_samples * n_classes] (for multi-class task) or shape =
[n_samples, n_classes] (for multi-class task) or None, optional
(default=None)) – Init score of training data.
• eval_set (list or None, optional (default=None)) – A list of (X, y) tuple
pairs to use as validation sets.
• eval_names (list of str, or None, optional (default=None)) – Names
of eval_set.
• eval_sample_weight (list of array (same types as sample_weight supports), or
None, optional (default=None)) – Weights of eval data. Weights should be non-
negative.
• eval_class_weight (list or None, optional (default=None)) – Class
weights of eval data.
• eval_init_score (list of array (same types as init_score supports), or None, op-
tional (default=None)) – Init score of eval data.
• eval_metric (str, callable, list or None, optional
(default=None)) – If str, it should be a built-in evaluation metric to use. If
callable, it should be a custom evaluation metric, see note below for more details. If
list, it can be a list of built-in metrics, a list of custom evaluation metrics, or a mix
of both. In either case, the metric from the model parameters will be evaluated and
used as well. Default: ‘l2’ for LGBMRegressor, ‘logloss’ for LGBMClassifier, ‘ndcg’
for LGBMRanker.
• feature_name (list of str, or 'auto', optional (default='auto')) –
Feature names. If ‘auto’ and data is pandas DataFrame, data columns names are used.
Note: Custom eval function expects a callable with following signatures: func(y_true,
y_pred), func(y_true, y_pred, weight) or func(y_true, y_pred, weight, group) and re-
turns (eval_name, eval_result, is_higher_better) or list of (eval_name, eval_result, is_higher_better):
y_true
[numpy 1-D array of shape = [n_samples]] The target values.
y_pred
[numpy 1-D array of shape = [n_samples] or numpy 2-D array of shape = [n_samples,
n_classes] (for multi-class task)] The predicted values. In case of custom objective,
predicted values are returned before any transformation, e.g. they are raw margin instead
of probability of positive class for binary task in this case.
weight
[numpy 1-D array of shape = [n_samples]] The weight of samples. Weights should be
non-negative.
group
[numpy 1-D array] Group/query data. Only used in the learning-to-rank task. sum(group)
= n_samples. For example, if you have a 100-document dataset with group = [10, 20,
40, 10, 10, 10], that means that you have 6 groups, where the first 10 records are in
the first group, records 11-30 are in the second group, records 31-70 are in the third group,
etc.
eval_name
[str] The name of evaluation function (without whitespace).
eval_result
[float] The eval result.
is_higher_better
[bool] Is eval result higher better, e.g. AUC is is_higher_better.
get_metadata_routing()
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
Returns
routing – A MetadataRequest encapsulating routing information.
Return type
MetadataRequest
get_params(deep=True)
Get parameters for this estimator.
Parameters
deep (bool, optional (default=True)) – If True, will return the parameters for this
estimator and contained subobjects that are estimators.
Returns
params – Parameter names mapped to their values.
Return type
dict
property n_classes_
The number of classes.
Type
int
property n_estimators_
True number of boosting iterations performed.
This might be less than parameter n_estimators if early stopping was enabled or if boosting stopped
early due to limits on complexity like min_gain_to_split.
New in version 4.0.0.
Type
int
property n_features_
The number of features of fitted model.
Type
int
property n_features_in_
The number of features of fitted model.
Type
int
property n_iter_
True number of boosting iterations performed.
This might be less than parameter n_estimators if early stopping was enabled or if boosting stopped
early due to limits on complexity like min_gain_to_split.
New in version 4.0.0.
Type
int
property objective_
The concrete objective used while fitting this model.
Type
str or callable
predict(X, raw_score=False, start_iteration=0, num_iteration=None, pred_leaf=False, pred_contrib=False,
validate_features=False, **kwargs)
Return the predicted value for each sample.
Parameters
• X (numpy array, pandas DataFrame, H2O DataTable's Frame , scipy.
sparse, list of lists of int or float of shape = [n_samples,
n_features]) – Input features matrix.
• raw_score (bool, optional (default=False)) – Whether to predict raw
scores.
• start_iteration (int, optional (default=0)) – Start index of the iteration
to predict. If <= 0, starts from the first iteration.
• num_iteration (int or None, optional (default=None)) – Total number of
iterations used in the prediction. If None, if the best iteration exists and start_iteration
<= 0, the best iteration is used; otherwise, all iterations from start_iteration are
used (no limits). If <= 0, all iterations from start_iteration are used (no limits).
• pred_leaf (bool, optional (default=False)) – Whether to predict leaf index.
• pred_contrib (bool, optional (default=False)) – Whether to predict fea-
ture contributions.
Note: If you want to get more explanations for your model’s predictions using SHAP
values, like SHAP interaction values, you can install the shap package (https://github.
com/slundberg/shap). Note that unlike the shap package, with pred_contrib we
return a matrix with an extra column, where the last column is the expected value.
Note: If you want to get more explanations for your model’s predictions using SHAP
values, like SHAP interaction values, you can install the shap package (https://github.
com/slundberg/shap). Note that unlike the shap package, with pred_contrib we
return a matrix with an extra column, where the last column is the expected value.
Return type
float
set_fit_request(*, callbacks='$UNCHANGED$', categorical_feature='$UNCHANGED$',
eval_class_weight='$UNCHANGED$', eval_init_score='$UNCHANGED$',
eval_metric='$UNCHANGED$', eval_names='$UNCHANGED$',
eval_sample_weight='$UNCHANGED$', eval_set='$UNCHANGED$',
feature_name='$UNCHANGED$', init_model='$UNCHANGED$',
init_score='$UNCHANGED$', sample_weight='$UNCHANGED$')
Request metadata passed to the fit method.
Note that this method is only relevant if enable_metadata_routing=True (see sklearn.
set_config()). Please see User Guide on how the routing mechanism works.
The options for each parameter are:
• True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not
provided.
• False: metadata is not requested and the meta-estimator will not pass it to fit.
• None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
• str: metadata should be passed to the meta-estimator with this given alias instead of the original
name.
The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows
you to change the request for some parameters and not others.
New in version 1.3.
Note: This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g.
used inside a pipeline.Pipeline. Otherwise it has no effect.
Parameters
• callbacks (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for callbacks parameter in
fit.
• categorical_feature (str, True, False, or None, default=sklearn.
utils.metadata_routing.UNCHANGED) – Metadata routing for
categorical_feature parameter in fit.
• eval_class_weight (str, True, False, or None, default=sklearn.
utils.metadata_routing.UNCHANGED) – Metadata routing for
eval_class_weight parameter in fit.
• eval_init_score (str, True, False, or None, default=sklearn.
utils.metadata_routing.UNCHANGED) – Metadata routing for
eval_init_score parameter in fit.
• eval_metric (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for eval_metric parameter
in fit.
• eval_names (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for eval_names parameter in
fit.
set_params(**params)
Set the parameters of this estimator.
Parameters
**params – Parameter names with their new values.
Returns
self – Returns self.
Return type
object
set_predict_proba_request(*, num_iteration='$UNCHANGED$', pred_contrib='$UNCHANGED$',
pred_leaf='$UNCHANGED$', raw_score='$UNCHANGED$',
start_iteration='$UNCHANGED$', validate_features='$UNCHANGED$')
Request metadata passed to the predict_proba method.
Note that this method is only relevant if enable_metadata_routing=True (see sklearn.
set_config()). Please see User Guide on how the routing mechanism works.
The options for each parameter are:
• True: metadata is requested, and passed to predict_proba if provided. The request is ignored if
metadata is not provided.
• False: metadata is not requested and the meta-estimator will not pass it to predict_proba.
• None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
• str: metadata should be passed to the meta-estimator with this given alias instead of the original
name.
Note: This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g.
used inside a pipeline.Pipeline. Otherwise it has no effect.
Parameters
• num_iteration (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for num_iteration
parameter in predict_proba.
• pred_contrib (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for pred_contrib parameter
in predict_proba.
• pred_leaf (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for pred_leaf parameter in
predict_proba.
• raw_score (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for raw_score parameter in
predict_proba.
• start_iteration (str, True, False, or None, default=sklearn.
utils.metadata_routing.UNCHANGED) – Metadata routing for
start_iteration parameter in predict_proba.
• validate_features (str, True, False, or None, default=sklearn.
utils.metadata_routing.UNCHANGED) – Metadata routing for
validate_features parameter in predict_proba.
Returns
self – The updated object.
Return type
object
Note: This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g.
used inside a pipeline.Pipeline. Otherwise it has no effect.
Parameters
• num_iteration (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for num_iteration
parameter in predict.
• pred_contrib (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for pred_contrib parameter
in predict.
• pred_leaf (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for pred_leaf parameter in
predict.
• raw_score (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for raw_score parameter in
predict.
• start_iteration (str, True, False, or None, default=sklearn.
utils.metadata_routing.UNCHANGED) – Metadata routing for
start_iteration parameter in predict.
• validate_features (str, True, False, or None, default=sklearn.
utils.metadata_routing.UNCHANGED) – Metadata routing for
validate_features parameter in predict.
Returns
self – The updated object.
Return type
object
set_score_request(*, sample_weight='$UNCHANGED$')
Request metadata passed to the score method.
Note that this method is only relevant if enable_metadata_routing=True (see sklearn.
set_config()). Please see User Guide on how the routing mechanism works.
The options for each parameter are:
• True: metadata is requested, and passed to score if provided. The request is ignored if metadata is
not provided.
• False: metadata is not requested and the meta-estimator will not pass it to score.
• None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
• str: metadata should be passed to the meta-estimator with this given alias instead of the original
name.
The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows
you to change the request for some parameters and not others.
Note: This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g.
used inside a pipeline.Pipeline. Otherwise it has no effect.
Parameters
sample_weight (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in
score.
Returns
self – The updated object.
Return type
object
9.3.3 lightgbm.LGBMRegressor
(the default) corresponds to using the number of physical cores in the system (its cor-
rect detection requires either the joblib or the psutil util libraries to be installed).
Changed in version 4.0.0.
• importance_type (str, optional (default='split')) – The type of feature
importance to be filled into feature_importances_. If ‘split’, result contains num-
bers of times the feature is used in a model. If ‘gain’, result contains total gains of
splits which use the feature.
• **kwargs – Other parameters for the model. Check http://lightgbm.readthedocs.io/
en/latest/Parameters.html for more parameters.
Note: A custom objective function can be provided for the objective parameter. In this case, it should
have the signature objective(y_true, y_pred) -> grad, hess, objective(y_true, y_pred,
weight) -> grad, hess or objective(y_true, y_pred, weight, group) -> grad, hess:
y_true
[numpy 1-D array of shape = [n_samples]] The target values.
y_pred
[numpy 1-D array of shape = [n_samples] or numpy 2-D array of shape = [n_samples,
n_classes] (for multi-class task)] The predicted values. Predicted values are returned before
any transformation, e.g. they are raw margin instead of probability of positive class for
binary task.
weight
[numpy 1-D array of shape = [n_samples]] The weight of samples. Weights should be
non-negative.
group
[numpy 1-D array] Group/query data. Only used in the learning-to-rank task. sum(group)
= n_samples. For example, if you have a 100-document dataset with group = [10, 20,
40, 10, 10, 10], that means that you have 6 groups, where the first 10 records are in
the first group, records 11-30 are in the second group, records 31-70 are in the third group,
etc.
grad
[numpy 1-D array of shape = [n_samples] or numpy 2-D array of shape = [n_samples,
n_classes] (for multi-class task)] The value of the first order derivative (gradient) of the
loss with respect to the elements of y_pred for each sample point.
hess
[numpy 1-D array of shape = [n_samples] or numpy 2-D array of shape = [n_samples,
n_classes] (for multi-class task)] The value of the second order derivative (Hessian) of the
loss with respect to the elements of y_pred for each sample point.
For multi-class task, y_pred is a numpy 2-D array of shape = [n_samples, n_classes], and grad and hess
should be returned in the same format.
Methods
Attributes
property best_iteration_
The best iteration of fitted model if early_stopping() callback has been specified.
Type
int
property best_score_
The best score of fitted model.
Type
dict
property booster_
The underlying Booster of this model.
Type
Booster
property evals_result_
The evaluation results if validation sets have been specified.
Type
dict
property feature_importances_
The feature importances (the higher, the more important).
Note: importance_type attribute is passed to the function to configure the type of importance values
to be extracted.
Type
array of shape = [n_features]
property feature_name_
The names of features.
Type
list of shape = [n_features]
fit(X, y, sample_weight=None, init_score=None, eval_set=None, eval_names=None,
eval_sample_weight=None, eval_init_score=None, eval_metric=None, feature_name='auto',
categorical_feature='auto', callbacks=None, init_model=None)
Build a gradient boosting model from the training set (X, y).
Parameters
• X (numpy array, pandas DataFrame, H2O DataTable's Frame , scipy.
sparse, list of lists of int or float of shape = [n_samples,
n_features]) – Input feature matrix.
• y (numpy array, pandas DataFrame, pandas Series, list of int or
float of shape = [n_samples]) – The target values (class labels in classifica-
tion, real numbers in regression).
• sample_weight (numpy array, pandas Series, list of int or float
of shape = [n_samples] or None, optional (default=None)) – Weights
of training data. Weights should be non-negative.
• init_score (numpy array, pandas DataFrame, pandas Series,
list of int or float of shape = [n_samples] or shape =
[n_samples * n_classes] (for multi-class task) or shape =
[n_samples, n_classes] (for multi-class task) or None, optional
(default=None)) – Init score of training data.
• eval_set (list or None, optional (default=None)) – A list of (X, y) tuple
pairs to use as validation sets.
• eval_names (list of str, or None, optional (default=None)) – Names
of eval_set.
• eval_sample_weight (list of array (same types as sample_weight supports), or
None, optional (default=None)) – Weights of eval data. Weights should be non-
negative.
• eval_init_score (list of array (same types as init_score supports), or None, op-
tional (default=None)) – Init score of eval data.
• eval_metric (str, callable, list or None, optional
(default=None)) – If str, it should be a built-in evaluation metric to use. If
callable, it should be a custom evaluation metric, see note below for more details. If
list, it can be a list of built-in metrics, a list of custom evaluation metrics, or a mix
of both. In either case, the metric from the model parameters will be evaluated and
used as well. Default: ‘l2’ for LGBMRegressor, ‘logloss’ for LGBMClassifier, ‘ndcg’
for LGBMRanker.
• feature_name (list of str, or 'auto', optional (default='auto')) –
Feature names. If ‘auto’ and data is pandas DataFrame, data columns names are used.
• categorical_feature (list of str or int, or 'auto', optional
(default='auto')) – Categorical features. If list of int, interpreted as indices.
If list of str, interpreted as feature names (need to specify feature_name as well). If
‘auto’ and data is pandas DataFrame, pandas unordered categorical columns are used.
All values in categorical features will be cast to int32 and thus should be less than int32
max value (2147483647). Large values could be memory consuming. Consider using
consecutive integers starting from zero. All negative values in categorical features
will be treated as missing values. The output cannot be monotonically constrained
with respect to a categorical feature. Floating point numbers in categorical features
will be rounded towards 0.
• callbacks (list of callable, or None, optional (default=None)) –
List of callback functions that are applied at each iteration. See Callbacks in Python
API for more information.
• init_model (str, pathlib.Path , Booster, LGBMModel or None,
optional (default=None)) – Filename of LightGBM model, Booster instance or
LGBMModel instance used for continue training.
Returns
self – Returns self.
Return type
LGBMRegressor
Note: Custom eval function expects a callable with following signatures: func(y_true,
y_pred), func(y_true, y_pred, weight) or func(y_true, y_pred, weight, group) and re-
turns (eval_name, eval_result, is_higher_better) or list of (eval_name, eval_result, is_higher_better):
y_true
[numpy 1-D array of shape = [n_samples]] The target values.
y_pred
[numpy 1-D array of shape = [n_samples] or numpy 2-D array of shape = [n_samples,
n_classes] (for multi-class task)] The predicted values. In case of custom objective,
predicted values are returned before any transformation, e.g. they are raw margin instead
of probability of positive class for binary task in this case.
weight
[numpy 1-D array of shape = [n_samples]] The weight of samples. Weights should be
non-negative.
group
[numpy 1-D array] Group/query data. Only used in the learning-to-rank task. sum(group)
= n_samples. For example, if you have a 100-document dataset with group = [10, 20,
40, 10, 10, 10], that means that you have 6 groups, where the first 10 records are in
the first group, records 11-30 are in the second group, records 31-70 are in the third group,
etc.
eval_name
[str] The name of evaluation function (without whitespace).
eval_result
[float] The eval result.
is_higher_better
[bool] Is eval result higher better, e.g. AUC is is_higher_better.
get_metadata_routing()
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
Returns
routing – A MetadataRequest encapsulating routing information.
Return type
MetadataRequest
get_params(deep=True)
Get parameters for this estimator.
Parameters
deep (bool, optional (default=True)) – If True, will return the parameters for this
estimator and contained subobjects that are estimators.
Returns
params – Parameter names mapped to their values.
Return type
dict
property n_estimators_
True number of boosting iterations performed.
This might be less than parameter n_estimators if early stopping was enabled or if boosting stopped
early due to limits on complexity like min_gain_to_split.
New in version 4.0.0.
Type
int
property n_features_
The number of features of fitted model.
Type
int
property n_features_in_
The number of features of fitted model.
Type
int
property n_iter_
True number of boosting iterations performed.
This might be less than parameter n_estimators if early stopping was enabled or if boosting stopped
early due to limits on complexity like min_gain_to_split.
Note: If you want to get more explanations for your model’s predictions using SHAP
values, like SHAP interaction values, you can install the shap package (https://github.
com/slundberg/shap). Note that unlike the shap package, with pred_contrib we
return a matrix with an extra column, where the last column is the expected value.
score(X, y, sample_weight=None)
Return the coefficient of determination of the prediction.
The coefficient of determination 𝑅2 is defined as (1− 𝑢𝑣 ), where 𝑢 is the residual sum of squares ((y_true
- y_pred)** 2).sum() and 𝑣 is the total sum of squares ((y_true - y_true.mean()) ** 2).
sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse).
A constant model that always predicts the expected value of y, disregarding the input features, would get a
𝑅2 score of 0.0.
Parameters
• X (array-like of shape (n_samples, n_features)) – Test samples. For
some estimators this may be a precomputed kernel matrix or a list of generic objects in-
stead with shape (n_samples, n_samples_fitted), where n_samples_fitted
is the number of samples used in the fitting for the estimator.
• y (array-like of shape (n_samples,) or (n_samples, n_outputs)) –
True values for X.
• sample_weight (array-like of shape (n_samples,), default=None) –
Sample weights.
Returns
score – 𝑅2 of self.predict(X) w.r.t. y.
Return type
float
Notes
The 𝑅2 score used when calling score on a regressor uses multioutput='uniform_average' from
version 0.23 to keep consistent with default value of r2_score(). This influences the score method of
all the multioutput regressors (except for MultiOutputRegressor).
set_fit_request(*, callbacks='$UNCHANGED$', categorical_feature='$UNCHANGED$',
eval_init_score='$UNCHANGED$', eval_metric='$UNCHANGED$',
eval_names='$UNCHANGED$', eval_sample_weight='$UNCHANGED$',
eval_set='$UNCHANGED$', feature_name='$UNCHANGED$',
init_model='$UNCHANGED$', init_score='$UNCHANGED$',
sample_weight='$UNCHANGED$')
Request metadata passed to the fit method.
Note that this method is only relevant if enable_metadata_routing=True (see sklearn.
set_config()). Please see User Guide on how the routing mechanism works.
The options for each parameter are:
• True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not
provided.
• False: metadata is not requested and the meta-estimator will not pass it to fit.
• None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
• str: metadata should be passed to the meta-estimator with this given alias instead of the original
name.
The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows
you to change the request for some parameters and not others.
New in version 1.3.
Note: This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g.
used inside a pipeline.Pipeline. Otherwise it has no effect.
Parameters
• callbacks (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for callbacks parameter in
fit.
• categorical_feature (str, True, False, or None, default=sklearn.
utils.metadata_routing.UNCHANGED) – Metadata routing for
categorical_feature parameter in fit.
• eval_init_score (str, True, False, or None, default=sklearn.
utils.metadata_routing.UNCHANGED) – Metadata routing for
eval_init_score parameter in fit.
• eval_metric (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for eval_metric parameter
in fit.
• eval_names (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for eval_names parameter in
fit.
• eval_sample_weight (str, True, False, or None, default=sklearn.
utils.metadata_routing.UNCHANGED) – Metadata routing for
eval_sample_weight parameter in fit.
• eval_set (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for eval_set parameter
in fit.
• feature_name (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for feature_name parameter
in fit.
• init_model (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for init_model parameter in
fit.
• init_score (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for init_score parameter in
fit.
• sample_weight (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for sample_weight
parameter in fit.
Returns
self – The updated object.
Return type
object
set_params(**params)
Set the parameters of this estimator.
Parameters
**params – Parameter names with their new values.
Returns
self – Returns self.
Return type
object
set_predict_request(*, num_iteration='$UNCHANGED$', pred_contrib='$UNCHANGED$',
pred_leaf='$UNCHANGED$', raw_score='$UNCHANGED$',
start_iteration='$UNCHANGED$', validate_features='$UNCHANGED$')
Request metadata passed to the predict method.
Note that this method is only relevant if enable_metadata_routing=True (see sklearn.
set_config()). Please see User Guide on how the routing mechanism works.
The options for each parameter are:
• True: metadata is requested, and passed to predict if provided. The request is ignored if metadata
is not provided.
• False: metadata is not requested and the meta-estimator will not pass it to predict.
• None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
• str: metadata should be passed to the meta-estimator with this given alias instead of the original
name.
The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows
you to change the request for some parameters and not others.
New in version 1.3.
Note: This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g.
used inside a pipeline.Pipeline. Otherwise it has no effect.
Parameters
• num_iteration (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for num_iteration
parameter in predict.
• pred_contrib (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for pred_contrib parameter
in predict.
• pred_leaf (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for pred_leaf parameter in
predict.
• raw_score (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for raw_score parameter in
predict.
• start_iteration (str, True, False, or None, default=sklearn.
utils.metadata_routing.UNCHANGED) – Metadata routing for
start_iteration parameter in predict.
set_score_request(*, sample_weight='$UNCHANGED$')
Request metadata passed to the score method.
Note that this method is only relevant if enable_metadata_routing=True (see sklearn.
set_config()). Please see User Guide on how the routing mechanism works.
The options for each parameter are:
• True: metadata is requested, and passed to score if provided. The request is ignored if metadata is
not provided.
• False: metadata is not requested and the meta-estimator will not pass it to score.
• None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
• str: metadata should be passed to the meta-estimator with this given alias instead of the original
name.
The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows
you to change the request for some parameters and not others.
New in version 1.3.
Note: This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g.
used inside a pipeline.Pipeline. Otherwise it has no effect.
Parameters
sample_weight (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in
score.
Returns
self – The updated object.
Return type
object
9.3.4 lightgbm.LGBMRanker
LightGBM ranker.
Warning: scikit-learn doesn’t support ranking applications yet, therefore this class is not really compatible
with the sklearn ecosystem. Please use this class mainly for training and applying ranking models in common
sklearnish way.
Note: A custom objective function can be provided for the objective parameter. In this case, it should
have the signature objective(y_true, y_pred) -> grad, hess, objective(y_true, y_pred,
weight) -> grad, hess or objective(y_true, y_pred, weight, group) -> grad, hess:
y_true
[numpy 1-D array of shape = [n_samples]] The target values.
y_pred
[numpy 1-D array of shape = [n_samples] or numpy 2-D array of shape = [n_samples,
n_classes] (for multi-class task)] The predicted values. Predicted values are returned before
any transformation, e.g. they are raw margin instead of probability of positive class for
binary task.
weight
[numpy 1-D array of shape = [n_samples]] The weight of samples. Weights should be
non-negative.
group
[numpy 1-D array] Group/query data. Only used in the learning-to-rank task. sum(group)
= n_samples. For example, if you have a 100-document dataset with group = [10, 20,
40, 10, 10, 10], that means that you have 6 groups, where the first 10 records are in
the first group, records 11-30 are in the second group, records 31-70 are in the third group,
etc.
grad
[numpy 1-D array of shape = [n_samples] or numpy 2-D array of shape = [n_samples,
n_classes] (for multi-class task)] The value of the first order derivative (gradient) of the
loss with respect to the elements of y_pred for each sample point.
hess
[numpy 1-D array of shape = [n_samples] or numpy 2-D array of shape = [n_samples,
n_classes] (for multi-class task)] The value of the second order derivative (Hessian) of the
loss with respect to the elements of y_pred for each sample point.
For multi-class task, y_pred is a numpy 2-D array of shape = [n_samples, n_classes], and grad and hess
should be returned in the same format.
Methods
Attributes
property best_iteration_
The best iteration of fitted model if early_stopping() callback has been specified.
Type
int
property best_score_
The best score of fitted model.
Type
dict
property booster_
The underlying Booster of this model.
Type
Booster
property evals_result_
The evaluation results if validation sets have been specified.
Type
dict
property feature_importances_
The feature importances (the higher, the more important).
Note: importance_type attribute is passed to the function to configure the type of importance values
to be extracted.
Type
array of shape = [n_features]
property feature_name_
The names of features.
Type
list of shape = [n_features]
fit(X, y, sample_weight=None, init_score=None, group=None, eval_set=None, eval_names=None,
eval_sample_weight=None, eval_init_score=None, eval_group=None, eval_metric=None, eval_at=(1, 2,
3, 4, 5), feature_name='auto', categorical_feature='auto', callbacks=None, init_model=None)
Build a gradient boosting model from the training set (X, y).
Parameters
• X (numpy array, pandas DataFrame, H2O DataTable's Frame , scipy.
sparse, list of lists of int or float of shape = [n_samples,
n_features]) – Input feature matrix.
• y (numpy array, pandas DataFrame, pandas Series, list of int or
float of shape = [n_samples]) – The target values (class labels in classifica-
tion, real numbers in regression).
• sample_weight (numpy array, pandas Series, list of int or float
of shape = [n_samples] or None, optional (default=None)) – Weights
of training data. Weights should be non-negative.
• init_score (numpy array, pandas DataFrame, pandas Series,
list of int or float of shape = [n_samples] or shape =
[n_samples * n_classes] (for multi-class task) or shape =
[n_samples, n_classes] (for multi-class task) or None, optional
(default=None)) – Init score of training data.
• group (numpy array, pandas Series, list of int or float, or None,
optional (default=None)) – Group/query data. Only used in the learning-to-rank
task. sum(group) = n_samples. For example, if you have a 100-document dataset with
group = [10, 20, 40, 10, 10, 10], that means that you have 6 groups, where
the first 10 records are in the first group, records 11-30 are in the second group, records
31-70 are in the third group, etc.
• eval_set (list or None, optional (default=None)) – A list of (X, y) tuple
pairs to use as validation sets.
• eval_names (list of str, or None, optional (default=None)) – Names
of eval_set.
• eval_sample_weight (list of array (same types as sample_weight supports), or
None, optional (default=None)) – Weights of eval data. Weights should be non-
negative.
• eval_init_score (list of array (same types as init_score supports), or None, op-
tional (default=None)) – Init score of eval data.
• eval_group (list of array (same types as group supports), or None, optional (de-
fault=None)) – Group data of eval data.
• eval_metric (str, callable, list or None, optional
(default=None)) – If str, it should be a built-in evaluation metric to use. If
callable, it should be a custom evaluation metric, see note below for more details. If
list, it can be a list of built-in metrics, a list of custom evaluation metrics, or a mix
of both. In either case, the metric from the model parameters will be evaluated and
used as well. Default: ‘l2’ for LGBMRegressor, ‘logloss’ for LGBMClassifier, ‘ndcg’
for LGBMRanker.
• eval_at (list or tuple of int, optional (default=(1, 2, 3, 4,
5))) – The evaluation positions of the specified metric.
Note: Custom eval function expects a callable with following signatures: func(y_true,
y_pred), func(y_true, y_pred, weight) or func(y_true, y_pred, weight, group) and re-
turns (eval_name, eval_result, is_higher_better) or list of (eval_name, eval_result, is_higher_better):
y_true
[numpy 1-D array of shape = [n_samples]] The target values.
y_pred
[numpy 1-D array of shape = [n_samples] or numpy 2-D array of shape = [n_samples,
n_classes] (for multi-class task)] The predicted values. In case of custom objective,
predicted values are returned before any transformation, e.g. they are raw margin instead
of probability of positive class for binary task in this case.
weight
[numpy 1-D array of shape = [n_samples]] The weight of samples. Weights should be
non-negative.
group
[numpy 1-D array] Group/query data. Only used in the learning-to-rank task. sum(group)
= n_samples. For example, if you have a 100-document dataset with group = [10, 20,
40, 10, 10, 10], that means that you have 6 groups, where the first 10 records are in
the first group, records 11-30 are in the second group, records 31-70 are in the third group,
etc.
eval_name
[str] The name of evaluation function (without whitespace).
eval_result
[float] The eval result.
is_higher_better
[bool] Is eval result higher better, e.g. AUC is is_higher_better.
get_metadata_routing()
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
Returns
routing – A MetadataRequest encapsulating routing information.
Return type
MetadataRequest
get_params(deep=True)
Get parameters for this estimator.
Parameters
deep (bool, optional (default=True)) – If True, will return the parameters for this
estimator and contained subobjects that are estimators.
Returns
params – Parameter names mapped to their values.
Return type
dict
property n_estimators_
True number of boosting iterations performed.
This might be less than parameter n_estimators if early stopping was enabled or if boosting stopped
early due to limits on complexity like min_gain_to_split.
New in version 4.0.0.
Type
int
property n_features_
The number of features of fitted model.
Type
int
property n_features_in_
The number of features of fitted model.
Type
int
property n_iter_
True number of boosting iterations performed.
This might be less than parameter n_estimators if early stopping was enabled or if boosting stopped
early due to limits on complexity like min_gain_to_split.
New in version 4.0.0.
Type
int
property objective_
The concrete objective used while fitting this model.
Type
str or callable
predict(X, raw_score=False, start_iteration=0, num_iteration=None, pred_leaf=False, pred_contrib=False,
validate_features=False, **kwargs)
Return the predicted value for each sample.
Parameters
• X (numpy array, pandas DataFrame, H2O DataTable's Frame , scipy.
sparse, list of lists of int or float of shape = [n_samples,
n_features]) – Input features matrix.
• raw_score (bool, optional (default=False)) – Whether to predict raw
scores.
• start_iteration (int, optional (default=0)) – Start index of the iteration
to predict. If <= 0, starts from the first iteration.
• num_iteration (int or None, optional (default=None)) – Total number of
iterations used in the prediction. If None, if the best iteration exists and start_iteration
<= 0, the best iteration is used; otherwise, all iterations from start_iteration are
used (no limits). If <= 0, all iterations from start_iteration are used (no limits).
• pred_leaf (bool, optional (default=False)) – Whether to predict leaf index.
• pred_contrib (bool, optional (default=False)) – Whether to predict fea-
ture contributions.
Note: If you want to get more explanations for your model’s predictions using SHAP
values, like SHAP interaction values, you can install the shap package (https://github.
com/slundberg/shap). Note that unlike the shap package, with pred_contrib we
return a matrix with an extra column, where the last column is the expected value.
Note: This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g.
used inside a pipeline.Pipeline. Otherwise it has no effect.
Parameters
• callbacks (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for callbacks parameter in
fit.
• categorical_feature (str, True, False, or None, default=sklearn.
utils.metadata_routing.UNCHANGED) – Metadata routing for
categorical_feature parameter in fit.
• eval_at (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for eval_at parameter
in fit.
• eval_group (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for eval_group parameter in
fit.
• eval_init_score (str, True, False, or None, default=sklearn.
utils.metadata_routing.UNCHANGED) – Metadata routing for
eval_init_score parameter in fit.
• eval_metric (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for eval_metric parameter
in fit.
set_params(**params)
Set the parameters of this estimator.
Parameters
**params – Parameter names with their new values.
Returns
self – Returns self.
Return type
object
set_predict_request(*, num_iteration='$UNCHANGED$', pred_contrib='$UNCHANGED$',
pred_leaf='$UNCHANGED$', raw_score='$UNCHANGED$',
start_iteration='$UNCHANGED$', validate_features='$UNCHANGED$')
Request metadata passed to the predict method.
Note that this method is only relevant if enable_metadata_routing=True (see sklearn.
set_config()). Please see User Guide on how the routing mechanism works.
The options for each parameter are:
• True: metadata is requested, and passed to predict if provided. The request is ignored if metadata
is not provided.
• False: metadata is not requested and the meta-estimator will not pass it to predict.
• None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
• str: metadata should be passed to the meta-estimator with this given alias instead of the original
name.
The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows
you to change the request for some parameters and not others.
New in version 1.3.
Note: This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g.
used inside a pipeline.Pipeline. Otherwise it has no effect.
Parameters
• num_iteration (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for num_iteration
parameter in predict.
• pred_contrib (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for pred_contrib parameter
in predict.
• pred_leaf (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for pred_leaf parameter in
predict.
• raw_score (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for raw_score parameter in
predict.
• start_iteration (str, True, False, or None, default=sklearn.
utils.metadata_routing.UNCHANGED) – Metadata routing for
start_iteration parameter in predict.
• validate_features (str, True, False, or None, default=sklearn.
utils.metadata_routing.UNCHANGED) – Metadata routing for
validate_features parameter in predict.
Returns
self – The updated object.
Return type
object
9.4.1 lightgbm.DaskLGBMClassifier
Note: A custom objective function can be provided for the objective parameter. In this case, it should
have the signature objective(y_true, y_pred) -> grad, hess, objective(y_true, y_pred,
weight) -> grad, hess or objective(y_true, y_pred, weight, group) -> grad, hess:
y_true
[numpy 1-D array of shape = [n_samples]] The target values.
y_pred
[numpy 1-D array of shape = [n_samples] or numpy 2-D array of shape = [n_samples,
n_classes] (for multi-class task)] The predicted values. Predicted values are returned before
any transformation, e.g. they are raw margin instead of probability of positive class for
binary task.
weight
[numpy 1-D array of shape = [n_samples]] The weight of samples. Weights should be
non-negative.
group
[numpy 1-D array] Group/query data. Only used in the learning-to-rank task. sum(group)
= n_samples. For example, if you have a 100-document dataset with group = [10, 20,
40, 10, 10, 10], that means that you have 6 groups, where the first 10 records are in
the first group, records 11-30 are in the second group, records 31-70 are in the third group,
etc.
grad
[numpy 1-D array of shape = [n_samples] or numpy 2-D array of shape = [n_samples,
n_classes] (for multi-class task)] The value of the first order derivative (gradient) of the
loss with respect to the elements of y_pred for each sample point.
hess
[numpy 1-D array of shape = [n_samples] or numpy 2-D array of shape = [n_samples,
n_classes] (for multi-class task)] The value of the second order derivative (Hessian) of the
loss with respect to the elements of y_pred for each sample point.
For multi-class task, y_pred is a numpy 2-D array of shape = [n_samples, n_classes], and grad and hess
should be returned in the same format.
Methods
Attributes
property best_iteration_
The best iteration of fitted model if early_stopping() callback has been specified.
Type
int
property best_score_
The best score of fitted model.
Type
dict
property booster_
The underlying Booster of this model.
Type
Booster
property classes_
The class label array.
Type
array of shape = [n_classes]
property client_
Dask client.
This property can be passed in the constructor or updated with model.set_params(client=client).
Type
dask.distributed.Client
property evals_result_
The evaluation results if validation sets have been specified.
Type
dict
property feature_importances_
The feature importances (the higher, the more important).
Note: importance_type attribute is passed to the function to configure the type of importance values
to be extracted.
Type
array of shape = [n_features]
property feature_name_
The names of features.
Type
list of shape = [n_features]
fit(X, y, sample_weight=None, init_score=None, eval_set=None, eval_names=None,
eval_sample_weight=None, eval_class_weight=None, eval_init_score=None, eval_metric=None,
**kwargs)
Build a gradient boosting model from the training set (X, y).
Parameters
• X (Dask Array or Dask DataFrame of shape = [n_samples,
n_features]) – Input feature matrix.
• y (Dask Array, Dask DataFrame or Dask Series of shape =
[n_samples]) – The target values (class labels in classification, real numbers
in regression).
• sample_weight (Dask Array or Dask Series of shape = [n_samples]
or None, optional (default=None)) – Weights of training data. Weights
should be non-negative.
Note: Custom eval function expects a callable with following signatures: func(y_true,
y_pred), func(y_true, y_pred, weight) or func(y_true, y_pred, weight, group) and re-
turns (eval_name, eval_result, is_higher_better) or list of (eval_name, eval_result, is_higher_better):
y_true
[numpy 1-D array of shape = [n_samples]] The target values.
y_pred
[numpy 1-D array of shape = [n_samples] or numpy 2-D array of shape = [n_samples,
n_classes] (for multi-class task)] The predicted values. In case of custom objective,
predicted values are returned before any transformation, e.g. they are raw margin instead
of probability of positive class for binary task in this case.
weight
[numpy 1-D array of shape = [n_samples]] The weight of samples. Weights should be
non-negative.
group
[numpy 1-D array] Group/query data. Only used in the learning-to-rank task. sum(group)
= n_samples. For example, if you have a 100-document dataset with group = [10, 20,
40, 10, 10, 10], that means that you have 6 groups, where the first 10 records are in
the first group, records 11-30 are in the second group, records 31-70 are in the third group,
etc.
eval_name
[str] The name of evaluation function (without whitespace).
eval_result
[float] The eval result.
is_higher_better
[bool] Is eval result higher better, e.g. AUC is is_higher_better.
get_metadata_routing()
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
Returns
routing – A MetadataRequest encapsulating routing information.
Return type
MetadataRequest
get_params(deep=True)
Get parameters for this estimator.
Parameters
deep (bool, optional (default=True)) – If True, will return the parameters for this
estimator and contained subobjects that are estimators.
Returns
params – Parameter names mapped to their values.
Return type
dict
property n_classes_
The number of classes.
Type
int
property n_estimators_
True number of boosting iterations performed.
This might be less than parameter n_estimators if early stopping was enabled or if boosting stopped
early due to limits on complexity like min_gain_to_split.
Note: If you want to get more explanations for your model’s predictions using SHAP
values, like SHAP interaction values, you can install the shap package (https://github.
Note: If you want to get more explanations for your model’s predictions using SHAP
values, like SHAP interaction values, you can install the shap package (https://github.
com/slundberg/shap). Note that unlike the shap package, with pred_contrib we
return a matrix with an extra column, where the last column is the expected value.
Note: This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g.
used inside a pipeline.Pipeline. Otherwise it has no effect.
Parameters
• eval_class_weight (str, True, False, or None, default=sklearn.
utils.metadata_routing.UNCHANGED) – Metadata routing for
eval_class_weight parameter in fit.
• eval_init_score (str, True, False, or None, default=sklearn.
utils.metadata_routing.UNCHANGED) – Metadata routing for
eval_init_score parameter in fit.
• eval_metric (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for eval_metric parameter
in fit.
• eval_names (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for eval_names parameter in
fit.
• eval_sample_weight (str, True, False, or None, default=sklearn.
utils.metadata_routing.UNCHANGED) – Metadata routing for
eval_sample_weight parameter in fit.
• eval_set (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for eval_set parameter
in fit.
• init_score (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for init_score parameter in
fit.
• sample_weight (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for sample_weight
parameter in fit.
Returns
self – The updated object.
Return type
object
set_params(**params)
Set the parameters of this estimator.
Parameters
**params – Parameter names with their new values.
Returns
self – Returns self.
Return type
object
set_predict_proba_request(*, num_iteration='$UNCHANGED$', pred_contrib='$UNCHANGED$',
pred_leaf='$UNCHANGED$', raw_score='$UNCHANGED$',
start_iteration='$UNCHANGED$', validate_features='$UNCHANGED$')
Note: This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g.
used inside a pipeline.Pipeline. Otherwise it has no effect.
Parameters
• num_iteration (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for num_iteration
parameter in predict_proba.
• pred_contrib (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for pred_contrib parameter
in predict_proba.
• pred_leaf (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for pred_leaf parameter in
predict_proba.
• raw_score (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for raw_score parameter in
predict_proba.
• start_iteration (str, True, False, or None, default=sklearn.
utils.metadata_routing.UNCHANGED) – Metadata routing for
start_iteration parameter in predict_proba.
• validate_features (str, True, False, or None, default=sklearn.
utils.metadata_routing.UNCHANGED) – Metadata routing for
validate_features parameter in predict_proba.
Returns
self – The updated object.
Return type
object
Note: This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g.
used inside a pipeline.Pipeline. Otherwise it has no effect.
Parameters
• num_iteration (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for num_iteration
parameter in predict.
• pred_contrib (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for pred_contrib parameter
in predict.
• pred_leaf (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for pred_leaf parameter in
predict.
• raw_score (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for raw_score parameter in
predict.
• start_iteration (str, True, False, or None, default=sklearn.
utils.metadata_routing.UNCHANGED) – Metadata routing for
start_iteration parameter in predict.
• validate_features (str, True, False, or None, default=sklearn.
utils.metadata_routing.UNCHANGED) – Metadata routing for
validate_features parameter in predict.
Returns
self – The updated object.
Return type
object
set_score_request(*, sample_weight='$UNCHANGED$')
Request metadata passed to the score method.
Note that this method is only relevant if enable_metadata_routing=True (see sklearn.
set_config()). Please see User Guide on how the routing mechanism works.
The options for each parameter are:
• True: metadata is requested, and passed to score if provided. The request is ignored if metadata is
not provided.
• False: metadata is not requested and the meta-estimator will not pass it to score.
• None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
• str: metadata should be passed to the meta-estimator with this given alias instead of the original
name.
The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows
you to change the request for some parameters and not others.
New in version 1.3.
Note: This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g.
used inside a pipeline.Pipeline. Otherwise it has no effect.
Parameters
sample_weight (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in
score.
Returns
self – The updated object.
Return type
object
to_local()
Create regular version of lightgbm.LGBMClassifier from the distributed version.
Returns
model – Local underlying model.
Return type
lightgbm.LGBMClassifier
9.4.2 lightgbm.DaskLGBMRegressor
Note: A custom objective function can be provided for the objective parameter. In this case, it should
have the signature objective(y_true, y_pred) -> grad, hess, objective(y_true, y_pred,
weight) -> grad, hess or objective(y_true, y_pred, weight, group) -> grad, hess:
y_true
[numpy 1-D array of shape = [n_samples]] The target values.
y_pred
[numpy 1-D array of shape = [n_samples] or numpy 2-D array of shape = [n_samples,
n_classes] (for multi-class task)] The predicted values. Predicted values are returned before
any transformation, e.g. they are raw margin instead of probability of positive class for
binary task.
weight
[numpy 1-D array of shape = [n_samples]] The weight of samples. Weights should be
non-negative.
group
[numpy 1-D array] Group/query data. Only used in the learning-to-rank task. sum(group)
= n_samples. For example, if you have a 100-document dataset with group = [10, 20,
40, 10, 10, 10], that means that you have 6 groups, where the first 10 records are in
the first group, records 11-30 are in the second group, records 31-70 are in the third group,
etc.
grad
[numpy 1-D array of shape = [n_samples] or numpy 2-D array of shape = [n_samples,
n_classes] (for multi-class task)] The value of the first order derivative (gradient) of the
loss with respect to the elements of y_pred for each sample point.
hess
[numpy 1-D array of shape = [n_samples] or numpy 2-D array of shape = [n_samples,
n_classes] (for multi-class task)] The value of the second order derivative (Hessian) of the
loss with respect to the elements of y_pred for each sample point.
For multi-class task, y_pred is a numpy 2-D array of shape = [n_samples, n_classes], and grad and hess
should be returned in the same format.
Methods
Attributes
property best_iteration_
The best iteration of fitted model if early_stopping() callback has been specified.
Type
int
property best_score_
The best score of fitted model.
Type
dict
property booster_
The underlying Booster of this model.
Type
Booster
property client_
Dask client.
This property can be passed in the constructor or updated with model.set_params(client=client).
Type
dask.distributed.Client
property evals_result_
The evaluation results if validation sets have been specified.
Type
dict
property feature_importances_
The feature importances (the higher, the more important).
Note: importance_type attribute is passed to the function to configure the type of importance values
to be extracted.
Type
array of shape = [n_features]
property feature_name_
The names of features.
Type
list of shape = [n_features]
fit(X, y, sample_weight=None, init_score=None, eval_set=None, eval_names=None,
eval_sample_weight=None, eval_init_score=None, eval_metric=None, **kwargs)
Build a gradient boosting model from the training set (X, y).
Parameters
• X (Dask Array or Dask DataFrame of shape = [n_samples,
n_features]) – Input feature matrix.
• y (Dask Array, Dask DataFrame or Dask Series of shape =
[n_samples]) – The target values (class labels in classification, real numbers
in regression).
• sample_weight (Dask Array or Dask Series of shape = [n_samples]
or None, optional (default=None)) – Weights of training data. Weights
should be non-negative.
• init_score (Dask Array or Dask Series of shape = [n_samples] or
None, optional (default=None)) – Init score of training data.
• eval_set (list or None, optional (default=None)) – A list of (X, y) tuple
pairs to use as validation sets.
• eval_names (list of str, or None, optional (default=None)) – Names
of eval_set.
• eval_sample_weight (list of Dask Array or Dask Series, or None,
optional (default=None)) – Weights of eval data. Weights should be non-
negative.
• eval_init_score (list of Dask Array or Dask Series, or None,
optional (default=None)) – Init score of eval data.
• eval_metric (str, callable, list or None, optional
(default=None)) – If str, it should be a built-in evaluation metric to use. If
callable, it should be a custom evaluation metric, see note below for more details. If
list, it can be a list of built-in metrics, a list of custom evaluation metrics, or a mix
of both. In either case, the metric from the model parameters will be evaluated and
used as well. Default: ‘l2’ for LGBMRegressor, ‘logloss’ for LGBMClassifier, ‘ndcg’
for LGBMRanker.
• feature_name (list of str, or 'auto', optional (default='auto')) –
Feature names. If ‘auto’ and data is pandas DataFrame, data columns names are used.
• categorical_feature (list of str or int, or 'auto', optional
(default='auto')) – Categorical features. If list of int, interpreted as indices.
If list of str, interpreted as feature names (need to specify feature_name as well). If
‘auto’ and data is pandas DataFrame, pandas unordered categorical columns are used.
All values in categorical features will be cast to int32 and thus should be less than int32
max value (2147483647). Large values could be memory consuming. Consider using
consecutive integers starting from zero. All negative values in categorical features
Note: Custom eval function expects a callable with following signatures: func(y_true,
y_pred), func(y_true, y_pred, weight) or func(y_true, y_pred, weight, group) and re-
turns (eval_name, eval_result, is_higher_better) or list of (eval_name, eval_result, is_higher_better):
y_true
[numpy 1-D array of shape = [n_samples]] The target values.
y_pred
[numpy 1-D array of shape = [n_samples] or numpy 2-D array of shape = [n_samples,
n_classes] (for multi-class task)] The predicted values. In case of custom objective,
predicted values are returned before any transformation, e.g. they are raw margin instead
of probability of positive class for binary task in this case.
weight
[numpy 1-D array of shape = [n_samples]] The weight of samples. Weights should be
non-negative.
group
[numpy 1-D array] Group/query data. Only used in the learning-to-rank task. sum(group)
= n_samples. For example, if you have a 100-document dataset with group = [10, 20,
40, 10, 10, 10], that means that you have 6 groups, where the first 10 records are in
the first group, records 11-30 are in the second group, records 31-70 are in the third group,
etc.
eval_name
[str] The name of evaluation function (without whitespace).
eval_result
[float] The eval result.
is_higher_better
[bool] Is eval result higher better, e.g. AUC is is_higher_better.
get_metadata_routing()
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
Returns
routing – A MetadataRequest encapsulating routing information.
Return type
MetadataRequest
get_params(deep=True)
Get parameters for this estimator.
Parameters
deep (bool, optional (default=True)) – If True, will return the parameters for this
estimator and contained subobjects that are estimators.
Returns
params – Parameter names mapped to their values.
Return type
dict
property n_estimators_
True number of boosting iterations performed.
This might be less than parameter n_estimators if early stopping was enabled or if boosting stopped
early due to limits on complexity like min_gain_to_split.
New in version 4.0.0.
Type
int
property n_features_
The number of features of fitted model.
Type
int
property n_features_in_
The number of features of fitted model.
Type
int
property n_iter_
True number of boosting iterations performed.
This might be less than parameter n_estimators if early stopping was enabled or if boosting stopped
early due to limits on complexity like min_gain_to_split.
New in version 4.0.0.
Type
int
property objective_
The concrete objective used while fitting this model.
Type
str or callable
predict(X, raw_score=False, start_iteration=0, num_iteration=None, pred_leaf=False, pred_contrib=False,
validate_features=False, **kwargs)
Return the predicted value for each sample.
Parameters
• X (Dask Array or Dask DataFrame of shape = [n_samples,
n_features]) – Input features matrix.
• raw_score (bool, optional (default=False)) – Whether to predict raw
scores.
Note: If you want to get more explanations for your model’s predictions using SHAP
values, like SHAP interaction values, you can install the shap package (https://github.
com/slundberg/shap). Note that unlike the shap package, with pred_contrib we
return a matrix with an extra column, where the last column is the expected value.
Notes
The 𝑅2 score used when calling score on a regressor uses multioutput='uniform_average' from
version 0.23 to keep consistent with default value of r2_score(). This influences the score method of
all the multioutput regressors (except for MultiOutputRegressor).
set_fit_request(*, eval_init_score='$UNCHANGED$', eval_metric='$UNCHANGED$',
eval_names='$UNCHANGED$', eval_sample_weight='$UNCHANGED$',
eval_set='$UNCHANGED$', init_score='$UNCHANGED$',
sample_weight='$UNCHANGED$')
Request metadata passed to the fit method.
Note that this method is only relevant if enable_metadata_routing=True (see sklearn.
set_config()). Please see User Guide on how the routing mechanism works.
The options for each parameter are:
• True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not
provided.
• False: metadata is not requested and the meta-estimator will not pass it to fit.
• None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
• str: metadata should be passed to the meta-estimator with this given alias instead of the original
name.
The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows
you to change the request for some parameters and not others.
New in version 1.3.
Note: This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g.
used inside a pipeline.Pipeline. Otherwise it has no effect.
Parameters
• eval_init_score (str, True, False, or None, default=sklearn.
utils.metadata_routing.UNCHANGED) – Metadata routing for
eval_init_score parameter in fit.
• eval_metric (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for eval_metric parameter
in fit.
• eval_names (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for eval_names parameter in
fit.
• eval_sample_weight (str, True, False, or None, default=sklearn.
utils.metadata_routing.UNCHANGED) – Metadata routing for
eval_sample_weight parameter in fit.
• eval_set (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for eval_set parameter
in fit.
• init_score (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for init_score parameter in
fit.
set_params(**params)
Set the parameters of this estimator.
Parameters
**params – Parameter names with their new values.
Returns
self – Returns self.
Return type
object
set_predict_request(*, num_iteration='$UNCHANGED$', pred_contrib='$UNCHANGED$',
pred_leaf='$UNCHANGED$', raw_score='$UNCHANGED$',
start_iteration='$UNCHANGED$', validate_features='$UNCHANGED$')
Request metadata passed to the predict method.
Note that this method is only relevant if enable_metadata_routing=True (see sklearn.
set_config()). Please see User Guide on how the routing mechanism works.
The options for each parameter are:
• True: metadata is requested, and passed to predict if provided. The request is ignored if metadata
is not provided.
• False: metadata is not requested and the meta-estimator will not pass it to predict.
• None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
• str: metadata should be passed to the meta-estimator with this given alias instead of the original
name.
The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows
you to change the request for some parameters and not others.
New in version 1.3.
Note: This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g.
used inside a pipeline.Pipeline. Otherwise it has no effect.
Parameters
• num_iteration (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for num_iteration
parameter in predict.
• pred_contrib (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for pred_contrib parameter
in predict.
set_score_request(*, sample_weight='$UNCHANGED$')
Request metadata passed to the score method.
Note that this method is only relevant if enable_metadata_routing=True (see sklearn.
set_config()). Please see User Guide on how the routing mechanism works.
The options for each parameter are:
• True: metadata is requested, and passed to score if provided. The request is ignored if metadata is
not provided.
• False: metadata is not requested and the meta-estimator will not pass it to score.
• None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
• str: metadata should be passed to the meta-estimator with this given alias instead of the original
name.
The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows
you to change the request for some parameters and not others.
New in version 1.3.
Note: This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g.
used inside a pipeline.Pipeline. Otherwise it has no effect.
Parameters
sample_weight (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in
score.
Returns
self – The updated object.
Return type
object
to_local()
Create regular version of lightgbm.LGBMRegressor from the distributed version.
Returns
model – Local underlying model.
Return type
lightgbm.LGBMRegressor
9.4.3 lightgbm.DaskLGBMRanker
parameter only for multi-class classification task; for binary classification task you
may use is_unbalance or scale_pos_weight parameters. Note, that the usage of
all these parameters will result in poor estimates of the individual class probabilities.
You may want to consider performing probability calibration (https://scikit-learn.org/
stable/modules/calibration.html) of your model. The ‘balanced’ mode uses the val-
ues of y to automatically adjust weights inversely proportional to class frequencies
in the input data as n_samples / (n_classes * np.bincount(y)). If None, all
classes are supposed to have weight one. Note, that these weights will be multiplied
with sample_weight (passed through the fit method) if sample_weight is speci-
fied.
• min_split_gain (float, optional (default=0.)) – Minimum loss reduction
required to make a further partition on a leaf node of the tree.
• min_child_weight (float, optional (default=1e-3)) – Minimum sum of
instance weight (Hessian) needed in a child (leaf).
• min_child_samples (int, optional (default=20)) – Minimum number of
data needed in a child (leaf).
• subsample (float, optional (default=1.)) – Subsample ratio of the training
instance.
• subsample_freq (int, optional (default=0)) – Frequency of subsample, <=0
means no enable.
• colsample_bytree (float, optional (default=1.)) – Subsample ratio of
columns when constructing each tree.
• reg_alpha (float, optional (default=0.)) – L1 regularization term on
weights.
• reg_lambda (float, optional (default=0.)) – L2 regularization term on
weights.
• random_state (int, RandomState object or None, optional
(default=None)) – Random number seed. If int, this number is used to seed
the C++ code. If RandomState object (numpy), a random integer is picked based on
its state to seed the C++ code. If None, default seeds in C++ code are used.
• n_jobs (int or None, optional (default=None)) – Number of parallel
threads to use for training (can be changed at prediction time by passing it as an extra
keyword argument).
For better performance, it is recommended to set this to the number of physical cores
in the CPU.
Negative integers are interpreted as following joblib’s formula (n_cpus + 1 + n_jobs),
just like scikit-learn (so e.g. -1 means using all threads). A value of zero corresponds
the default number of threads configured for OpenMP in the system. A value of None
(the default) corresponds to using the number of physical cores in the system (its cor-
rect detection requires either the joblib or the psutil util libraries to be installed).
Changed in version 4.0.0.
• importance_type (str, optional (default='split')) – The type of feature
importance to be filled into feature_importances_. If ‘split’, result contains num-
bers of times the feature is used in a model. If ‘gain’, result contains total gains of
splits which use the feature.
Note: A custom objective function can be provided for the objective parameter. In this case, it should
have the signature objective(y_true, y_pred) -> grad, hess, objective(y_true, y_pred,
weight) -> grad, hess or objective(y_true, y_pred, weight, group) -> grad, hess:
y_true
[numpy 1-D array of shape = [n_samples]] The target values.
y_pred
[numpy 1-D array of shape = [n_samples] or numpy 2-D array of shape = [n_samples,
n_classes] (for multi-class task)] The predicted values. Predicted values are returned before
any transformation, e.g. they are raw margin instead of probability of positive class for
binary task.
weight
[numpy 1-D array of shape = [n_samples]] The weight of samples. Weights should be
non-negative.
group
[numpy 1-D array] Group/query data. Only used in the learning-to-rank task. sum(group)
= n_samples. For example, if you have a 100-document dataset with group = [10, 20,
40, 10, 10, 10], that means that you have 6 groups, where the first 10 records are in
the first group, records 11-30 are in the second group, records 31-70 are in the third group,
etc.
grad
[numpy 1-D array of shape = [n_samples] or numpy 2-D array of shape = [n_samples,
n_classes] (for multi-class task)] The value of the first order derivative (gradient) of the
loss with respect to the elements of y_pred for each sample point.
hess
[numpy 1-D array of shape = [n_samples] or numpy 2-D array of shape = [n_samples,
n_classes] (for multi-class task)] The value of the second order derivative (Hessian) of the
loss with respect to the elements of y_pred for each sample point.
For multi-class task, y_pred is a numpy 2-D array of shape = [n_samples, n_classes], and grad and hess
should be returned in the same format.
Methods
Attributes
property best_iteration_
The best iteration of fitted model if early_stopping() callback has been specified.
Type
int
property best_score_
The best score of fitted model.
Type
dict
property booster_
The underlying Booster of this model.
Type
Booster
property client_
Dask client.
Note: importance_type attribute is passed to the function to configure the type of importance values
to be extracted.
Type
array of shape = [n_features]
property feature_name_
The names of features.
Type
list of shape = [n_features]
fit(X, y, sample_weight=None, init_score=None, group=None, eval_set=None, eval_names=None,
eval_sample_weight=None, eval_init_score=None, eval_group=None, eval_metric=None, eval_at=(1, 2,
3, 4, 5), **kwargs)
Build a gradient boosting model from the training set (X, y).
Parameters
• X (Dask Array or Dask DataFrame of shape = [n_samples,
n_features]) – Input feature matrix.
• y (Dask Array, Dask DataFrame or Dask Series of shape =
[n_samples]) – The target values (class labels in classification, real numbers
in regression).
• sample_weight (Dask Array or Dask Series of shape = [n_samples]
or None, optional (default=None)) – Weights of training data. Weights
should be non-negative.
• init_score (Dask Array or Dask Series of shape = [n_samples] or
None, optional (default=None)) – Init score of training data.
• group (Dask Array or Dask Series or None, optional
(default=None)) – Group/query data. Only used in the learning-to-rank task.
sum(group) = n_samples. For example, if you have a 100-document dataset with
group = [10, 20, 40, 10, 10, 10], that means that you have 6 groups, where
the first 10 records are in the first group, records 11-30 are in the second group,
records 31-70 are in the third group, etc.
• eval_set (list or None, optional (default=None)) – A list of (X, y) tuple
pairs to use as validation sets.
• eval_names (list of str, or None, optional (default=None)) – Names
of eval_set.
Note: Custom eval function expects a callable with following signatures: func(y_true,
y_pred), func(y_true, y_pred, weight) or func(y_true, y_pred, weight, group) and re-
turns (eval_name, eval_result, is_higher_better) or list of (eval_name, eval_result, is_higher_better):
y_true
[numpy 1-D array of shape = [n_samples]] The target values.
y_pred
[numpy 1-D array of shape = [n_samples] or numpy 2-D array of shape = [n_samples,
n_classes] (for multi-class task)] The predicted values. In case of custom objective,
predicted values are returned before any transformation, e.g. they are raw margin instead
of probability of positive class for binary task in this case.
weight
[numpy 1-D array of shape = [n_samples]] The weight of samples. Weights should be
non-negative.
group
[numpy 1-D array] Group/query data. Only used in the learning-to-rank task. sum(group)
= n_samples. For example, if you have a 100-document dataset with group = [10, 20,
40, 10, 10, 10], that means that you have 6 groups, where the first 10 records are in
the first group, records 11-30 are in the second group, records 31-70 are in the third group,
etc.
eval_name
[str] The name of evaluation function (without whitespace).
eval_result
[float] The eval result.
is_higher_better
[bool] Is eval result higher better, e.g. AUC is is_higher_better.
get_metadata_routing()
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
Returns
routing – A MetadataRequest encapsulating routing information.
Return type
MetadataRequest
get_params(deep=True)
Get parameters for this estimator.
Parameters
deep (bool, optional (default=True)) – If True, will return the parameters for this
estimator and contained subobjects that are estimators.
Returns
params – Parameter names mapped to their values.
Return type
dict
property n_estimators_
True number of boosting iterations performed.
This might be less than parameter n_estimators if early stopping was enabled or if boosting stopped
early due to limits on complexity like min_gain_to_split.
New in version 4.0.0.
Type
int
property n_features_
The number of features of fitted model.
Type
int
property n_features_in_
The number of features of fitted model.
Type
int
property n_iter_
True number of boosting iterations performed.
This might be less than parameter n_estimators if early stopping was enabled or if boosting stopped
early due to limits on complexity like min_gain_to_split.
New in version 4.0.0.
Type
int
property objective_
The concrete objective used while fitting this model.
Type
str or callable
predict(X, raw_score=False, start_iteration=0, num_iteration=None, pred_leaf=False, pred_contrib=False,
validate_features=False, **kwargs)
Return the predicted value for each sample.
Parameters
• X (Dask Array or Dask DataFrame of shape = [n_samples,
n_features]) – Input features matrix.
• raw_score (bool, optional (default=False)) – Whether to predict raw
scores.
• start_iteration (int, optional (default=0)) – Start index of the iteration
to predict. If <= 0, starts from the first iteration.
• num_iteration (int or None, optional (default=None)) – Total number of
iterations used in the prediction. If None, if the best iteration exists and start_iteration
<= 0, the best iteration is used; otherwise, all iterations from start_iteration are
used (no limits). If <= 0, all iterations from start_iteration are used (no limits).
• pred_leaf (bool, optional (default=False)) – Whether to predict leaf index.
• pred_contrib (bool, optional (default=False)) – Whether to predict fea-
ture contributions.
Note: If you want to get more explanations for your model’s predictions using SHAP
values, like SHAP interaction values, you can install the shap package (https://github.
com/slundberg/shap). Note that unlike the shap package, with pred_contrib we
return a matrix with an extra column, where the last column is the expected value.
Note: This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g.
used inside a pipeline.Pipeline. Otherwise it has no effect.
Parameters
• eval_at (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for eval_at parameter
in fit.
• eval_group (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for eval_group parameter in
fit.
• eval_init_score (str, True, False, or None, default=sklearn.
utils.metadata_routing.UNCHANGED) – Metadata routing for
eval_init_score parameter in fit.
• eval_metric (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for eval_metric parameter
in fit.
• eval_names (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for eval_names parameter in
fit.
set_params(**params)
Set the parameters of this estimator.
Parameters
**params – Parameter names with their new values.
Returns
self – Returns self.
Return type
object
set_predict_request(*, num_iteration='$UNCHANGED$', pred_contrib='$UNCHANGED$',
pred_leaf='$UNCHANGED$', raw_score='$UNCHANGED$',
start_iteration='$UNCHANGED$', validate_features='$UNCHANGED$')
Request metadata passed to the predict method.
Note that this method is only relevant if enable_metadata_routing=True (see sklearn.
set_config()). Please see User Guide on how the routing mechanism works.
The options for each parameter are:
• True: metadata is requested, and passed to predict if provided. The request is ignored if metadata
is not provided.
• False: metadata is not requested and the meta-estimator will not pass it to predict.
• None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
• str: metadata should be passed to the meta-estimator with this given alias instead of the original
name.
The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows
you to change the request for some parameters and not others.
New in version 1.3.
Note: This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g.
used inside a pipeline.Pipeline. Otherwise it has no effect.
Parameters
• num_iteration (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for num_iteration
parameter in predict.
• pred_contrib (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for pred_contrib parameter
in predict.
• pred_leaf (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for pred_leaf parameter in
predict.
• raw_score (str, True, False, or None, default=sklearn.utils.
metadata_routing.UNCHANGED) – Metadata routing for raw_score parameter in
predict.
• start_iteration (str, True, False, or None, default=sklearn.
utils.metadata_routing.UNCHANGED) – Metadata routing for
start_iteration parameter in predict.
• validate_features (str, True, False, or None, default=sklearn.
utils.metadata_routing.UNCHANGED) – Metadata routing for
validate_features parameter in predict.
Returns
self – The updated object.
Return type
object
to_local()
Create regular version of lightgbm.LGBMRanker from the distributed version.
Returns
model – Local underlying model.
Return type
lightgbm.LGBMRanker
9.5 Callbacks
9.5.1 lightgbm.early_stopping
9.5.2 lightgbm.log_evaluation
lightgbm.log_evaluation(period=1, show_stdv=True)
Create a callback that logs the evaluation results.
By default, standard output resource is used. Use register_logger() function to register a custom logger.
Parameters
• period (int, optional (default=1)) – The period to log the evaluation results. The
last boosting stage or the boosting stage found by using early_stopping callback is also
logged.
• show_stdv (bool, optional (default=True)) – Whether to log stdv (if provided).
Returns
callback – The callback that logs the evaluation results every period boosting iteration(s).
Return type
_LogEvaluationCallback
9.5.3 lightgbm.record_evaluation
lightgbm.record_evaluation(eval_result)
Create a callback that records the evaluation history into eval_result.
Parameters
eval_result (dict) – Dictionary used to store all evaluation results of all validation sets.
This should be initialized outside of your call to record_evaluation() and should be empty.
Any initial contents of the dictionary will be deleted.
Example
With two validation sets named ‘eval’ and ‘train’, and one evaluation metric named ‘logloss’
this dictionary after finishing a model training process will have the following structure:
{
'train':
{
'logloss': [0.48253, 0.35953, ...]
},
'eval':
{
'logloss': [0.480385, 0.357756, ...]
}
}
Returns
callback – The callback that records the evaluation history into the passed dictionary.
Return type
_RecordEvaluationCallback
9.5.4 lightgbm.reset_parameter
lightgbm.reset_parameter(**kwargs)
Create a callback that resets the parameter after the first iteration.
Note: The initial parameter will still take in-effect on first iteration.
Parameters
**kwargs (value should be list or callable) – List of parameters for each boosting
round or a callable that calculates the parameter in terms of current number of round (e.g.
yields learning rate decay). If list lst, parameter = lst[current_round]. If callable func, param-
eter = func(current_round).
Returns
callback – The callback that resets the parameter after the first iteration.
Return type
_ResetParameterCallback
9.6 Plotting
9.6.1 lightgbm.plot_importance
9.6.2 lightgbm.plot_split_value_histogram
9.6.3 lightgbm.plot_metric
Return type
matplotlib.axes.Axes
9.6.4 lightgbm.plot_tree
Note: It is preferable to use create_tree_digraph() because of its lossless quality and returned objects can
be also rendered and displayed directly inside a Jupyter notebook.
Parameters
• booster (Booster or LGBMModel) – Booster or LGBMModel instance to be plotted.
• ax (matplotlib.axes.Axes or None, optional (default=None)) – Target axes
instance. If None, new figure and axes will be created.
• tree_index (int, optional (default=0)) – The index of a target tree to plot.
• figsize (tuple of 2 elements or None, optional (default=None)) – Fig-
ure size.
• dpi (int or None, optional (default=None)) – Resolution of the figure.
• show_info (list of str, or None, optional (default=None)) – What infor-
mation should be shown in nodes.
– 'split_gain' : gain from adding this split to the model
– 'internal_value' : raw predicted value that would be produced by this node if it
was a leaf node
– 'internal_count' : number of records from the training data that fall into this
non-leaf node
– 'internal_weight' : total weight of all nodes that fall into this non-leaf node
– 'leaf_count' : number of records from the training data that fall into this leaf node
– 'leaf_weight' : total weight (sum of Hessian) of all observations that fall into this
leaf node
– 'data_percentage' : percentage of training data that fall into this node
• precision (int or None, optional (default=3)) – Used to restrict the display
of floating point values to a certain precision.
• orientation (str, optional (default='horizontal')) – Orientation of the tree.
Can be ‘horizontal’ or ‘vertical’.
9.6.5 lightgbm.create_tree_digraph
Parameters
• booster (Booster or LGBMModel) – Booster or LGBMModel instance to be con-
verted.
• tree_index (int, optional (default=0)) – The index of a target tree to convert.
• show_info (list of str, or None, optional (default=None)) – What infor-
mation should be shown in nodes.
– 'split_gain' : gain from adding this split to the model
– 'internal_value' : raw predicted value that would be produced by this node if it
was a leaf node
– 'internal_count' : number of records from the training data that fall into this
non-leaf node
– 'internal_weight' : total weight of all nodes that fall into this non-leaf node
– 'leaf_count' : number of records from the training data that fall into this leaf node
– 'leaf_weight' : total weight (sum of Hessian) of all observations that fall into this
leaf node
– 'data_percentage' : percentage of training data that fall into this node
Warning: Consider wrapping the SVG string of the tree graph with IPython.
display.HTML when running on JupyterLab to get the tooltip working right.
Example:
from IPython.display import HTML
9.7 Utilities
9.7.1 lightgbm.register_logger
TEN
This guide describes distributed learning in LightGBM. Distributed learning allows the use of multiple machines to
produce a single model.
Follow the Quick Start to know how to use LightGBM first.
This section describes how distributed learning in LightGBM works. To learn how to do this in various programming
languages and frameworks, please see Integrations.
These algorithms are suited for different scenarios, which is listed in the following table:
More details about these parallel algorithms can be found in optimization in distributed learning.
227
LightGBM, Release 4.0.0
10.2 Integrations
This section describes how to run distributed LightGBM training in various programming languages and frameworks.
To learn how distributed learning in LightGBM works generally, please see How Distributed LightGBM Works.
Apache Spark users can use SynapseML for machine learning workflows with LightGBM. This project is not main-
tained by LightGBM’s maintainers.
See this SynapseML example for additional information on using LightGBM on Spark.
Note: SynapseML is not maintained by LightGBM’s maintainers. Bug reports or feature requests should be directed
to https://github.com/microsoft/SynapseML/issues.
10.2.2 Dask
Dask Examples
This section contains detailed information on performing LightGBM distributed training using Dask.
Allocating Threads
When setting up a Dask cluster for training, give each Dask worker process at least two threads. If you do not do this,
training might be substantially slower because communication work and training work will block each other.
If you do not have other significant processes competing with Dask for resources, just accept the default nthreads
from your chosen dask.distributed cluster.
cluster = LocalCluster(n_workers=3)
client = Client(cluster)
Managing Memory
Use the Dask diagnostic dashboard or your preferred monitoring tool to monitor Dask workers’ memory consumption
during training. As described in the Dask worker documentation, Dask workers will automatically start spilling data to
disk if memory consumption gets too high. This can substantially slow down computations, since disk I/O is usually
much slower than reading the same data from memory.
At 60% of memory load, [Dask will] spill least recently used data to disk
To reduce the risk of hitting memory limits, consider restarting each worker process before running any data loading
or training code.
client.restart()
The estimators in lightgbm.dask expect that matrix-like or array-like data are provided in Dask DataFrame, Dask Ar-
ray, or (in some cases) Dask Series format. See the Dask DataFrame documentation and the Dask Array documentation
for more information on how to create such data structures.
While setting up for training, lightgbm will concatenate all of the partitions on a worker into a single dataset. Dis-
tributed training then proceeds with one LightGBM worker process per Dask worker.
When setting up data partitioning for LightGBM training with Dask, try to follow these suggestions:
• ensure that each worker in the cluster has some of the training data
• try to give each worker roughly the same amount of data, especially if your dataset is small
• if you plan to train multiple models (for example, to tune hyperparameters) on the same data, use client.
persist() before training to materialize the data one time
In most situations, you should not need to tell lightgbm.dask to use a specific Dask client. By default, the client
returned by distributed.default_client() will be used.
However, you might want to explicitly control the Dask client used by LightGBM if you have multiple active clients
in the same session. This is useful in more complex workflows like running multiple training jobs on different Dask
clusters.
LightGBM’s Dask estimators support setting an attribute client to control the client that is used.
cluster = LocalCluster()
client = Client(cluster)
At the beginning of training, lightgbm.dask sets up a LightGBM network where each Dask worker runs one long-
running task that acts as a LightGBM worker. During training, LightGBM workers communicate with each other over
TCP sockets. By default, random open ports are used when creating these sockets.
If the communication between Dask workers in the cluster used for training is restricted by firewall rules, you must tell
LightGBM exactly what ports to use.
Option 1: provide a specific list of addresses and ports
LightGBM supports a parameter machines, a comma-delimited string where each entry refers to one worker (host
name or IP) and a port that that worker will accept connections on. If you provide this parameter to the estimators in
lightgbm.dask, LightGBM will not search randomly for ports.
For example, consider the case where you are running one Dask worker process on each of the following IP addresses:
10.0.1.0
10.0.2.0
10.0.3.0
You could edit your firewall rules to allow traffic on one additional port on each of these hosts, then provide machines
directly.
machines = "10.0.1.0:12401,10.0.2.0:12402,10.0.3.0:15000"
dask_model = lgb.DaskLGBMRegressor(machines=machines)
If you are running multiple Dask worker processes on physical host in the cluster, be sure that there are multiple entries
for that IP address, with different ports. For example, if you were running a cluster with nprocs=2 (2 Dask worker
processes per machine), you might open two additional ports on each of these hosts, then provide machines as follows.
machines = ",".join([
"10.0.1.0:16000",
"10.0.1.0:16001",
"10.0.2.0:16000",
"10.0.2.0:16001",
])
dask_model = lgb.DaskLGBMRegressor(machines=machines)
Warning: Providing machines gives you complete control over the networking details of training, but it also
makes the training process fragile. Training will fail if you use machines and any of the following are true:
• any of the ports mentioned in machines are not open when training begins
• some partitions of the training data are held by machines that that are not present in machines
• some machines mentioned in machines do not hold any of the training data
10.0.1.0
10.0.2.0
10.0.3.0
You could edit your firewall rules to allow communication between any of the workers over one port, then provide that
port via parameter local_listen_port.
dask_model = lgb.DaskLGBMRegressor(local_listen_port=12400)
Warning: Providing local_listen_port is slightly less fragile than machines because LightGBM will auto-
matically figure out which workers have pieces of the training data. However, using this method, training can fail
if any of the following are true:
• the port local_listen_port is not open on any of the worker hosts
• any machine has multiple Dask worker processes running on it
Warning: Custom objective functions used with lightgbm.dask will be called by each worker process on only
that worker’s local data.
Follow the example below to use a custom implementation of the regression_l2 objective.
import dask.array as da
import lightgbm as lgb
import numpy as np
from distributed import Client, LocalCluster
cluster = LocalCluster(n_workers=2)
client = Client(cluster)
dask_model = lgb.DaskLGBMRegressor(
objective=custom_l2_obj
)
dask_model.fit(X, y)
The estimators from lightgbm.dask can be used to create predictions based on data stored in Dask collections. In
that interface, .predict() expects a Dask Array or Dask DataFrame, and returns a Dask Array of predictions.
See the Dask prediction example for some sample code that shows how to perform Dask-based prediction.
For model evaluation, consider using the metrics functions from dask-ml. Those functions are intended to provide the
same API as equivalent functions in sklearn.metrics, but they use distributed computation powered by Dask to
compute metrics without all of the input data ever needing to be on a single machine.
After training with Dask, you have several options for saving a fitted model.
Option 1: pickle the Dask estimator
LightGBM’s Dask estimators can be pickled directly with cloudpickle, joblib, or pickle.
import dask.array as da
import pickle
import lightgbm as lgb
from distributed import Client, LocalCluster
cluster = LocalCluster(n_workers=2)
client = Client(cluster)
dask_model = lgb.DaskLGBMRegressor()
dask_model.fit(X, y)
A model saved this way can then later be loaded with whichever serialization library you used to save it.
import pickle
with open("dask-model.pkl", "rb") as f:
dask_model = pickle.load(f)
Note: If you explicitly set a Dask client (see Using a Specific Dask Client), it will not be saved when pickling the
estimator. When loading a Dask estimator from disk, if you need to use a specific client you can add it after loading
with dask_model.set_params(client=client).
import dask.array as da
import joblib
import lightgbm as lgb
from distributed import Client, LocalCluster
cluster = LocalCluster(n_workers=2)
client = Client(cluster)
dask_model = lgb.DaskLGBMRegressor()
dask_model.fit(X, y)
print(type(sklearn_model))
#> lightgbm.sklearn.LGBMRegressor
joblib.dump(sklearn_model, "sklearn-model.joblib")
A model saved this way can then later be loaded with whichever serialization library you used to save it.
import joblib
sklearn_model = joblib.load("sklearn-model.joblib")
import dask.array as da
import lightgbm as lgb
from distributed import Client, LocalCluster
cluster = LocalCluster(n_workers=2)
client = Client(cluster)
From the point forward, you can use any of the following methods to save the Booster:
• serialize with cloudpickle, joblib, or pickle
• bst.dump_model(): dump the model to a dictionary which could be written out as JSON
• bst.model_to_string(): dump the model to a string in memory
• bst.save_model(): write the output of bst.model_to_string() to a text file
10.2.3 Kubeflow
Kubeflow Fairing supports LightGBM distributed training. These examples show how to get started with LightGBM
and Kubeflow Fairing in a hybrid cloud environment.
Kubeflow users can also use the Kubeflow XGBoost Operator for machine learning workflows with LightGBM. You
can see this example for more details.
Kubeflow integrations for LightGBM are not maintained by LightGBM’s maintainers.
Note: The Kubeflow integrations for LightGBM are not maintained by LightGBM’s maintainers. Bug reports
or feature requests should be directed to https://github.com/kubeflow/fairing/issues or https://github.com/kubeflow/
xgboost-operator/issues.
Preparation
Socket Version
It needs to collect IP of all machines that want to run distributed learning in and allocate one TCP port (assume 12345
here) for all machines, and change firewall rules to allow income of this port (12345). Then write these IP and ports in
one file (assume mlist.txt), like following:
machine1_ip 12345
machine2_ip 12345
MPI Version
It needs to collect IP (or hostname) of all machines that want to run distributed learning in. Then write these IP in one
file (assume mlist.txt) like following:
machine1_ip
machine2_ip
Note: For Windows users, need to start “smpd” to start MPI service. More details can be found here.
Socket Version
MPI Version
For Linux:
Example
10.2.5 Ray
Ray is a Python-based framework for distributed computing. The lightgbm_ray project, maintained within the official
Ray GitHub organization, can be used to perform distributed LightGBM training using ray.
See the lightgbm_ray documentation for usage examples.
Note: lightgbm_ray is not maintained by LightGBM’s maintainers. Bug reports or feature requests should be
directed to https://github.com/ray-project/lightgbm_ray/issues.
10.2.6 Mars
Mars is a tensor-based framework for large-scale data computation. LightGBM integration, maintained within the Mars
GitHub repository, can be used to perform distributed LightGBM training using pymars.
See the mars documentation for usage examples.
Note: Mars is not maintained by LightGBM’s maintainers. Bug reports or feature requests should be directed to
https://github.com/mars-project/mars/issues.
ELEVEN
The purpose of this document is to give you a quick step-by-step tutorial on GPU training.
For Windows, please see GPU Windows Tutorial.
We will use the GPU instance on Microsoft Azure cloud computing platform for demonstration, but you can use any
machine with modern AMD or NVIDIA GPUs.
You need to launch a NV type instance on Azure (available in East US, North Central US, South Central US, West
Europe and Southeast Asia zones) and select Ubuntu 16.04 LTS as the operating system.
For testing, the smallest NV6 type virtual machine is sufficient, which includes 1/2 M60 GPU, with 8 GB memory, 180
GB/s memory bandwidth and 4,825 GFLOPS peak computation power. Don’t use the NC type instance as the GPUs
(K80) are based on an older architecture (Kepler).
First we need to install minimal NVIDIA drivers and OpenCL development environment:
sudo init 6
237
LightGBM, Release 4.0.0
The NV6 GPU instance has a 320 GB ultra-fast SSD mounted at /mnt. Let’s use it as our workspace (skip this if you
are using your own machine):
Now we are ready to checkout LightGBM and compile it with GPU support:
make -j$(nproc)
cd ..
You will see two binaries are generated, lightgbm and lib_lightgbm.so.
If you are building on macOS, you probably need to remove macro BOOST_COMPUTE_USE_OFFLINE_CACHE in src/
treelearner/gpu_tree_learner.h to avoid a known crash bug in Boost.Compute.
If you want to use the Python interface of LightGBM, you can install it now (along with some necessary Python-package
dependencies):
You need to set an additional parameter "device" : "gpu" (along with your other options like learning_rate,
num_leaves, etc) to use GPU in Python.
You can read our Python-package Examples for more information on how to use the Python interface.
Now we create a configuration file for LightGBM by running the following commands (please copy the entire block
and run it as a whole):
GPU is enabled in the configuration file we just created by setting device=gpu. In this configuration we use
the first GPU installed on the system (gpu_platform_id=0 and gpu_device_id=0). If gpu_platform_id or
gpu_device_id is not set, the default platform and GPU will be selected. You might have multiple platforms
(AMD/Intel/NVIDIA) or GPUs. You can use the clinfo utility to identify the GPUs on each platform. On Ubuntu, you
can install clinfo by executing sudo apt-get install clinfo. If you have a discrete GPU by AMD/NVIDIA
and an integrated GPU by Intel, make sure to select the correct gpu_platform_id to use the discrete GPU.
Now train the same dataset on CPU using the following command. You should observe a similar AUC:
Now we can make a speed test on GPU without calculating AUC after each iteration.
11.7 Reference
Please kindly cite the following article in your publications if you find the GPU acceleration useful:
Huan Zhang, Si Si and Cho-Jui Hsieh. “GPU Acceleration for Large-scale Tree Boosting.” SysML Conference, 2018.
TWELVE
ADVANCED TOPICS
• LightGBM enables the missing value handle by default. Disable it by setting use_missing=false.
• LightGBM uses NA (NaN) to represent missing values by default. Change it to use zero by setting
zero_as_missing=true.
• When zero_as_missing=false (default), the unrecorded values in sparse matrices (and LightSVM) are treated
as zeros.
• When zero_as_missing=true, NA and zeros (including unrecorded values in sparse matrices (and
LightSVM)) are treated as missing.
• LightGBM offers good accuracy with integer-encoded categorical features. LightGBM applies Fisher (1958) to
find the optimal split over categories as described here. This often performs better than one-hot encoding.
• Use categorical_feature to specify the categorical features. Refer to the parameter categorical_feature
in Parameters.
• Categorical features will be cast to int32 (integer codes will be extracted from pandas categoricals in the Python-
package) so they must be encoded as non-negative integers (negative values will be treated as missing) less than
Int32.MaxValue (2147483647). It is best to use a contiguous range of integers started from zero. Floating
point numbers in categorical features will be rounded towards 0.
• Use min_data_per_group, cat_smooth to deal with over-fitting (when #data is small or #category is large).
• For a categorical feature with high cardinality (#category is large), it often works best to treat the feature as
numeric, either by simply ignoring the categorical interpretation of the integers or by embedding the categories
in a low-dimensional numeric space.
241
LightGBM, Release 4.0.0
12.3 LambdaRank
• The label should be of type int, such that larger numbers correspond to higher relevance (e.g. 0:bad, 1:fair,
2:good, 3:perfect).
• Use label_gain to set the gain(weight) of int label.
• Use lambdarank_truncation_level to truncate the max DCG.
Cost Efficient Gradient Boosting (CEGB) makes it possible to penalise boosting based on the cost of obtaining feature
values. CEGB penalises learning in the following ways:
• Each time a tree is split, a penalty of cegb_penalty_split is applied.
• When a feature is used for the first time, cegb_penalty_feature_coupled is applied. This penalty can be
different for each feature and should be specified as one double per feature.
• When a feature is used for the first time for a data row, cegb_penalty_feature_lazy is applied. Like
cegb_penalty_feature_coupled, this penalty is specified as one double per feature.
Each of the penalties above is scaled by cegb_tradeoff. Using this parameter, it is possible to change the overall
strength of the CEGB penalties by changing only one parameter.
THIRTEEN
LIGHTGBM FAQ
• Critical Issues
• General LightGBM Questions
• R-package
• Python-package
A critical issue could be a crash, prediction error, nonsense output, or something else requiring immediate attention.
Please post such an issue in the Microsoft/LightGBM repository.
You may also ping a member of the core team according to the relevant area of expertise by mentioning them with the
arabase (@) symbol:
• @guolinke Guolin Ke (C++ code / R-package / Python-package)
• @chivee Qiwei Ye (C++ code / Python-package)
• @shiyu1994 Yu Shi (C++ code / Python-package)
• @tongwu-msft Tong Wu (C++ code / Python-package)
• @hzy46 Zhiyuan He (C++ code / Python-package)
• @btrotta Belinda Trotta (C++ code)
• @Laurae2 Damien Soukhavong (R-package)
• @jameslamb James Lamb (R-package / Dask-package)
• @jmoralez José Morales (Dask-package)
• @wxchan Wenxuan Chen (Python-package)
• @henry0312 Tsukasa Omoto (Python-package)
• @StrikerRUS Nikita Titov (Python-package)
• @huanzhang12 Huan Zhang (GPU support)
243
LightGBM, Release 4.0.0
Please include as much of the following information as possible when submitting a critical issue:
• Is it reproducible on CLI (command line interface), R, and/or Python?
• Is it specific to a wrapper? (R or Python?)
• Is it specific to the compiler? (gcc or Clang version? MinGW or Visual Studio version?)
• Is it specific to your Operating System? (Windows? Linux? macOS?)
• Are you able to reproduce this issue with a simple case?
• Does the issue persist after removing all optimization flags and compiling LightGBM in debug mode?
When submitting issues, please keep in mind that this is largely a volunteer effort, and we may not be available 24/7 to
provide support.
13.2.2 2. On datasets with millions of features, training does not start (or starts
after a very long time).
Use a smaller value for bin_construct_sample_cnt and a larger value for min_data.
Multiple Solutions: set the histogram_pool_size parameter to the MB you want to use for LightGBM (his-
togram_pool_size + dataset size = approximately RAM used), lower num_leaves or lower max_bin (see Mi-
crosoft/LightGBM#562).
13.2.4 4. I am using Windows. Should I use Visual Studio or MinGW for compiling
LightGBM?
13.2.5 5. When using LightGBM GPU, I cannot reproduce results over several runs.
This is normal and expected behaviour, but you may try to use gpu_use_dp = true for reproducibility (see Mi-
crosoft/LightGBM#560). You may also use the CPU version.
LightGBM bagging is multithreaded, so its output depends on the number of threads used. There is no workaround
currently.
Starting from #2804 bagging result doesn’t depend on the number of threads. So this issue should be solved in the
latest version.
This is expected behaviour for arbitrary parameters. To enable Random Forest, you must use bagging_fraction and
feature_fraction different from 1, along with a bagging_freq. This thread includes an example.
13.2.8 8. CPU usage is low (like 10%) in Windows when using LightGBM on very
large datasets with many-core systems.
Please use Visual Studio as it may be 10x faster than MinGW especially for very large trees.
[LightGBM] [Warning] Met negative value in categorical features, will convert it to NaN
[LightGBM] [Warning] There are no meaningful features, as all feature values are␣
˓→constant.
The column you’re trying to pass via categorical_feature likely contains very large values. Categorical features in
LightGBM are limited by int32 range, so you cannot pass values that are greater than Int32.MaxValue (2147483647)
as categorical features (see Microsoft/LightGBM#1359). You should convert them to integers ranging from zero to the
number of categories first.
13.2.10 10. LightGBM crashes randomly with the error like: Initializing
libiomp5.dylib, but found libomp.dylib already initialized.
OMP: Error #15: Initializing libiomp5.dylib, but found libomp.dylib already initialized.
OMP: Hint: This means that multiple copies of the OpenMP runtime have been linked into␣
˓→the program. That is dangerous, since it can degrade performance or cause incorrect␣
˓→results. The best thing to do is to ensure that only a single OpenMP runtime is linked␣
˓→into the process, e.g. by avoiding static linking of the OpenMP runtime in any library.
˓→that may cause crashes or silently produce incorrect results. For more information,␣
Possible Cause: This error means that you have multiple OpenMP libraries installed on your machine and they conflict
with each other. (File extensions in the error message may differ depending on the operating system).
If you are using Python distributed by Conda, then it is highly likely that the error is caused by the numpy package
from Conda which includes the mkl package which in turn conflicts with the system-wide library. In this case you can
update the numpy package in Conda or replace the Conda’s OpenMP library instance with system-wide one by creating
a symlink to it in Conda environment folder $CONDA_PREFIX/lib.
Solution: Assuming you are using macOS with Homebrew, the command which overwrites OpenMP library files in
the current active Conda environment with symlinks to the system-wide library ones installed by Homebrew:
The described above fix worked fine before the release of OpenMP 8.0.0 version. Starting from 8.0.0 version, Home-
brew formula for OpenMP includes -DLIBOMP_INSTALL_ALIASES=OFF option which leads to that the fix doesn’t work
anymore. However, you can create symlinks to library aliases manually:
Another workaround would be removing MKL optimizations from Conda’s packages completely:
If this is not your case, then you should find conflicting OpenMP library installations on your own and leave only one
of them.
13.2.11 11. LightGBM hangs when multithreading (OpenMP) and using forking in
Linux at the same time.
Use nthreads=1 to disable multithreading of LightGBM. There is a bug with OpenMP which hangs forked sessions
with multithreading activated. A more expensive solution is to use new processes instead of using fork, however,
keep in mind it is creating new processes where you have to copy memory and load libraries (example: if you want
to fork 16 times your current process, then you will require to make 16 copies of your dataset in memory) (see Mi-
crosoft/LightGBM#1789).
An alternative, if multithreading is really necessary inside the forked sessions, would be to compile LightGBM with
Intel toolchain. Intel compilers are unaffected by this bug.
For C/C++ users, any OpenMP feature cannot be used before the fork happens. If an OpenMP feature is used before the
fork happens (example: using OpenMP for forking), OpenMP will hang inside the forked sessions. Use new processes
instead and copy memory as required by creating new processes instead of forking (or, use Intel compilers).
Cloud platform container services may cause LightGBM to hang, if they use Linux fork to run multiple containers on a
single instance. For example, LightGBM hangs in AWS Batch array jobs, which use the ECS agent to manage multiple
running jobs. Setting nthreads=1 mitigates the issue.
Early stopping involves choosing a validation set, a special type of holdout which is used to evaluate the current state
of the model after each iteration to see if training can stop.
In LightGBM, we have decided to require that users specify this set directly. Many options exist for splitting training
data into training, test, and validation sets.
The appropriate splitting strategy depends on the task and domain of the data, information that a modeler has but which
LightGBM as a general-purpose tool does not.
13.2.13 13. Does LightGBM support direct loading data from zero-based or one-
based LibSVM format file?
LightGBM supports loading data from zero-based LibSVM format file directly.
13.2.14 14. Why CMake cannot find the compiler when compiling LightGBM with
MinGW?
This is a known issue of CMake when using MinGW. The easiest solution is to run again your cmake command to
bypass the one time stopper from CMake. Or you can upgrade your version of CMake to at least version 3.17.0.
See Microsoft/LightGBM#3060 for more details.
You can find LightGBM’s logo in different file formats and resolutions here.
13.2.16 16. LightGBM crashes randomly or operating system hangs during or after
running LightGBM.
Possible Cause: This behavior may indicate that you have multiple OpenMP libraries installed on your machine and
they conflict with each other, similarly to the FAQ #10.
If you are using any Python package that depends on threadpoolctl, you also may see the following warning in your
logs in this case:
/root/miniconda/envs/test-env/lib/python3.8/site-packages/threadpoolctl.py:546:␣
˓→RuntimeWarning:
Detailed description of conflicts between multiple OpenMP instances is provided in the following document.
Solution: Assuming you are using LightGBM Python-package and conda as a package manager, we strongly rec-
ommend using conda-forge channel as the only source of all your Python package installations because it contains
built-in patches to workaround OpenMP conflicts. Some other workarounds are listed here.
If this is not your case, then you should find conflicting OpenMP library installations on your own and leave only one
of them.
13.3 R-package
• 1. Any training command using LightGBM does not work after an error occurred during the training of a
previous LightGBM model.
• 2. I used setinfo(), tried to print my lgb.Dataset, and now the R console froze!
• 3. error in data.table::data.table()...argument 2 is NULL
13.3.1 1. Any training command using LightGBM does not work after an error oc-
curred during the training of a previous LightGBM model.
In older versions of the R package (prior to v3.3.0), this could happen occasionally and the solution was to run lgb.
unloader(wipe = TRUE) to remove all LightGBM-related objects. Some conversation about this could be found in
Microsoft/LightGBM#698.
That is no longer necessary as of v3.3.0, and function lgb.unloader() has since been removed from the R package.
13.3.2 2. I used setinfo(), tried to print my lgb.Dataset, and now the R console
froze!
As of at least LightGBM v3.3.0, this issue has been resolved and printing a Dataset object does not cause the console
to freeze.
In older versions, avoid printing the Dataset after calling setinfo().
As of LightGBM v4.0.0, setinfo() has been replaced by a new method, set_field().
If you are experiencing this error when running lightgbm, you may be facing the same issue reported in #2715 and
later in #2989. We have seen that some in some situations, using data.table 1.11.x results in this error. To get around
this, you can upgrade your version of data.table to at least version 1.12.0.
13.4 Python-package
• 1. Error: setup script specifies an absolute path when installing from GitHub using python
setup.py install.
• 2. Error messages: Cannot ... before construct dataset.
• 3. I encounter segmentation faults (segfaults) randomly after installing LightGBM from PyPI using pip
install lightgbm.
• 4. I would like to install LightGBM from conda. What channel should I choose?
13.4.1 1. Error: setup script specifies an absolute path when installing from
GitHub using python setup.py install.
Note: As of v4.0.0, lightgbm does not support directly invoking setup.py. This answer refers only to versions of
lightgbm prior to v4.0.0.
This error should be solved in latest version. If you still meet this error, try to remove lightgbm.egg-info folder in
your Python-package and reinstall, or check this thread on stackoverflow.
Cannot set predictor/reference/categorical feature after freed raw data, set free_raw_
˓→data=False when construct Dataset to avoid this.
Solution: Because LightGBM constructs bin mappers to build trees, and train and valid Datasets within one Booster
share the same bin mappers, categorical features and feature names etc., the Dataset objects are constructed when
constructing a Booster. If you set free_raw_data=True (default), the raw data (with Python data struct) will be
freed. So, if you want to:
• get label (or weight/init_score/group/data) before constructing a dataset, it’s same as get self.label;
• set label (or weight/init_score/group) before constructing a dataset, it’s same as self.
label=some_label_array;
• get num_data (or num_feature) before constructing a dataset, you can get data with self.data. Then, if your
data is numpy.ndarray, use some code like self.data.shape. But do not do this after subsetting the Dataset,
because you’ll get always None;
• set predictor (or reference/categorical feature) after constructing a dataset, you should set
free_raw_data=False or init a Dataset object with the same raw data.
We are doing our best to provide universal wheels which have high running speed and are compatible with any hardware,
OS, compiler, etc. at the same time. However, sometimes it’s just impossible to guarantee the possibility of usage of
LightGBM in any specific environment (see Microsoft/LightGBM#1743).
Therefore, the first thing you should try in case of segfaults is compiling from the source using pip install
--no-binary lightgbm lightgbm. For the OS-specific prerequisites see this guide.
Also, feel free to post a new issue in our GitHub repository. We always look at each case individually and try to find a
root cause.
13.4.4 4. I would like to install LightGBM from conda. What channel should I
choose?
We strongly recommend installation from the conda-forge channel and not from the default one due to many
reasons. The main ones are less time delay for new releases, greater number of supported architectures and better
handling of dependency conflicts, especially workaround for OpenMP is crucial for LightGBM. More details can be
found in this comment.
FOURTEEN
DEVELOPMENT GUIDE
14.1 Algorithms
Class Description
Application The entrance of application, including training and prediction logic
Bin Data structure used for storing feature discrete values (converted from float values)
Boosting Boosting interface (GBDT, DART, etc.)
Config Stores parameters and configurations
Dataset Stores information of dataset
DatasetLoader Used to construct dataset
FeatureGroup Stores the data of feature, could be multiple features
Metric Evaluation metrics
Network Network interfaces and communication algorithms
ObjectiveFunction Objective functions used to train
Tree Stores information of tree model
TreeLearner Used to learn trees
253
LightGBM, Release 4.0.0
Path Description
./include Header files
./include/utils Some common functions
./src/application Implementations of training and prediction logic
./src/boosting Implementations of Boosting
./src/io Implementations of IO related classes, including Bin, Config, Dataset, DatasetLoader,
Feature and Tree
./src/metric Implementations of metrics
./src/network Implementations of network functions
./src/objective Implementations of objective functions
./src/treelearner Implementations of tree learners
14.4 C API
Refer to C API or the comments in c_api.h file, from which the documentation is generated.
14.5 Tests
C++ unit tests are located in the ./tests/cpp_tests folder and written with the help of Google Test framework. To
run tests locally first refer to the Installation Guide for how to build tests and then simply run compiled executable file.
It is highly recommended to build tests with sanitizers.
14.7 Questions
Refer to FAQ.
Also feel free to open issues if you met problems.
FIFTEEN
In LightGBM, the main computation cost during training is building the feature histograms. We use an efficient al-
gorithm on GPU to accelerate this process. The implementation is highly modular, and works for all learning tasks
(classification, ranking, regression, etc). GPU acceleration also works in distributed learning settings. GPU algorithm
implementation is based on OpenCL and can work with a wide range of GPUs.
We target AMD Graphics Core Next (GCN) architecture and NVIDIA Maxwell and Pascal architectures. Most AMD
GPUs released after 2012 and NVIDIA GPUs released after 2014 should be supported. We have tested the GPU
implementation on the following GPUs:
• AMD RX 480 with AMDGPU-pro driver 16.60 on Ubuntu 16.10
• AMD R9 280X (aka Radeon HD 7970) with fglrx driver 15.302.2301 on Ubuntu 16.10
• NVIDIA GTX 1080 with driver 375.39 and CUDA 8.0 on Ubuntu 16.10
• NVIDIA Titan X (Pascal) with driver 367.48 and CUDA 8.0 on Ubuntu 16.04
• NVIDIA Tesla M40 with driver 375.39 and CUDA 7.5 on Ubuntu 16.04
Using the following hardware is discouraged:
• NVIDIA Kepler (K80, K40, K20, most GeForce GTX 700 series GPUs) or earlier NVIDIA GPUs. They don’t
support hardware atomic operations in local memory space and thus histogram construction will be slow.
• AMD VLIW4-based GPUs, including Radeon HD 6xxx series and earlier GPUs. These GPUs have been dis-
continued for years and are rarely seen nowadays.
1. You want to run a few datasets that we have verified with good speedup (including Higgs, epsilon, Bosch,
etc) to ensure your setup is correct. If you have multiple GPUs, make sure to set gpu_platform_id and
gpu_device_id to use the desired GPU. Also make sure your system is idle (especially when using a shared
computer) to get accuracy performance measurements.
2. GPU works best on large scale and dense datasets. If dataset is too small, computing it on GPU is inefficient as
the data transfer overhead can be significant. If you have categorical features, use the categorical_column
option and input them into LightGBM directly; do not convert them into one-hot variables.
255
LightGBM, Release 4.0.0
3. To get good speedup with GPU, it is suggested to use a smaller number of bins. Setting max_bin=63 is rec-
ommended, as it usually does not noticeably affect training accuracy on large datasets, but GPU training can
be significantly faster than using the default bin size of 255. For some dataset, even using 15 bins is enough
(max_bin=15); using 15 bins will maximize GPU performance. Make sure to check the run log and verify that
the desired number of bins is used.
4. Try to use single precision training (gpu_use_dp=false) when possible, because most GPUs (especially
NVIDIA consumer GPUs) have poor double-precision performance.
We used the following hardware to evaluate the performance of LightGBM GPU training. Our CPU reference is a high-
end dual socket Haswell-EP Xeon server with 28 cores; GPUs include a budget GPU (RX 480) and a mainstream
(GTX 1080) GPU installed on the same server. It is worth mentioning that the GPUs used are not the best GPUs in
the market; if you are using a better GPU (like AMD RX 580, NVIDIA GTX 1080 Ti, Titan X Pascal, Titan Xp, Tesla
P100, etc), you are likely to get a better speedup.
During benchmarking on CPU we used only 28 physical cores of the CPU, and did not use hyper-threading cores,
because we found that using too many threads actually makes performance worse. The following shows the training
configuration we used:
max_bin = 63
num_leaves = 255
num_iterations = 500
learning_rate = 0.1
tree_learner = serial
task = train
is_training_metric = false
min_data_in_leaf = 1
min_sum_hessian_in_leaf = 100
(continues on next page)
We use the configuration shown above, except for the Bosch dataset, we use a smaller learning_rate=0.015 and set
min_sum_hessian_in_leaf=5. For all GPU training we vary the max number of bins (255, 63 and 15). The GPU
implementation is from commit 0bb4a82 of LightGBM, when the GPU support was just merged in.
The following table lists the accuracy on test set that CPU and GPU learner can achieve after 500 iterations. GPU
with the same number of bins can achieve a similar level of accuracy as on the CPU, despite using single precision
arithmetic. For most datasets, using 63 bins is sufficient.
We record the wall clock time after 500 iterations, as shown in the figure below:
When using a GPU, it is advisable to use a bin size of 63 rather than 255, because it can speed up training signifi-
cantly without noticeably affecting accuracy. On CPU, using a smaller bin size only marginally improves performance,
sometimes even slows down training, like in Higgs (we can reproduce the same slowdown on two different machines,
with different GCC versions). We found that GPU can achieve impressive acceleration on large and dense datasets like
Higgs and Epsilon. Even on smaller and sparse datasets, a budget GPU can still compete and be faster than a 28-core
Haswell server.
The next table shows GPU memory usage reported by nvidia-smi during training with 63 bins. We can see that
even the largest dataset just uses about 1 GB of GPU memory, indicating that our GPU implementation can scale to
huge datasets over 10x larger than Bosch or Epsilon. Also, we can observe that generally a larger dataset (using more
GPU memory, like Epsilon or Bosch) has better speedup, because the overhead of invoking GPU functions becomes
significant when the dataset is small.
You can find more details about the GPU algorithm and benchmarks in the following article:
Huan Zhang, Si Si and Cho-Jui Hsieh. GPU Acceleration for Large-scale Tree Boosting. SysML Conference, 2018.
SIXTEEN
OpenCL is a universal massively parallel programming framework that targets to multiple backends (GPU, CPU, FPGA,
etc). Basically, to use a device from a vendor, you have to install drivers from that specific vendor. Intel’s and AMD’s
OpenCL runtime also include x86 CPU target support. NVIDIA’s OpenCL runtime only supports NVIDIA GPU (no
CPU support). In general, OpenCL CPU backends are quite slow, and should be used for testing and debugging only.
You can find below a table of correspondence:
Legend:
* AMD APP SDK is deprecated. On Windows, OpenCL is included in AMD graphics driver. On Linux, newer
generation AMD cards are supported by the ROCm driver. You can download an archived copy of AMD APP SDK
from our GitHub repo (for Linux and for Windows).
Your system might have multiple GPUs from different vendors (“platforms”) installed. Setting up LightGBM GPU de-
vice requires two parameters: OpenCL Platform ID (gpu_platform_id) and OpenCL Device ID (gpu_device_id).
Generally speaking, each vendor provides an OpenCL platform, and devices from the same vendor have different de-
vice IDs under that platform. For example, if your system has an Intel integrated GPU and two discrete GPUs from
AMD, you will have two OpenCL platforms (with gpu_platform_id=0 and gpu_platform_id=1). If the platform
0 is Intel, it has one device (gpu_device_id=0) representing the Intel GPU; if the platform 1 is AMD, it has two
devices (gpu_device_id=0, gpu_device_id=1) representing the two AMD GPUs. If you have a discrete GPU by
AMD/NVIDIA and an integrated GPU by Intel, make sure to select the correct gpu_platform_id to use the discrete
GPU as it usually provides better performance.
On Windows, OpenCL devices can be queried using GPUCapsViewer, under the OpenCL tab. Note that the platform
and device IDs reported by this utility start from 1. So you should minus the reported IDs by 1.
On Linux, OpenCL devices can be listed using the clinfo command. On Ubuntu, you can install clinfo by executing
sudo apt-get install clinfo.
259
LightGBM, Release 4.0.0
16.3 Examples
We provide test R code below, but you can use the language of your choice with the examples of your choices:
library(lightgbm)
data(agaricus.train, package = "lightgbm")
train <- agaricus.train
train$data[, 1] <- 1:6513
dtrain <- lgb.Dataset(train$data, label = train$label)
data(agaricus.test, package = "lightgbm")
test <- agaricus.test
dtest <- lgb.Dataset.create.valid(dtrain, test$data, label = test$label)
valids <- list(test = dtest)
Make sure you list the OpenCL devices in your system and set gpu_platform_id and gpu_device_id correctly. In
the following examples, our system has 1 GPU platform (gpu_platform_id = 0) from AMD APP SDK. The first
device gpu_device_id = 0 is a GPU device (AMD Oland), and the second device gpu_device_id = 1 is the x86
CPU backend.
Example of using GPU (gpu_platform_id = 0 and gpu_device_id = 0 in our system):
260 Chapter 16. GPU SDK Correspondence and Device Targeting Table
LightGBM, Release 4.0.0
[LightGBM] [Info] No further splits with positive gain, best gain: -inf
[LightGBM] [Info] Trained a tree with leaves=16 and depth=8
[1]: test's rmse:1.10643e-17
[LightGBM] [Info] No further splits with positive gain, best gain: -inf
[LightGBM] [Info] Trained a tree with leaves=7 and depth=5
[2]: test's rmse:0
Running on OpenCL CPU backend devices is in generally slow, and we observe crashes on some Windows and macOS
systems. Make sure you check the Using GPU Device line in the log and it is not using a CPU. The above log shows
that we are using Oland GPU from AMD and not CPU.
Example of using CPU (gpu_platform_id = 0, gpu_device_id = 1). The GPU device reported is Intel(R)
Core(TM) i7-4600U CPU, so it is using the CPU backend rather than a real GPU.
[LightGBM] [Info] No further splits with positive gain, best gain: -inf
[LightGBM] [Info] Trained a tree with leaves=16 and depth=8
(continues on next page)
Known issues:
• Using a bad combination of gpu_platform_id and gpu_device_id can potentially lead to a crash due to
OpenCL driver issues on some machines (you will lose your entire session content). Beware of it.
• On some systems, if you have integrated graphics card (Intel HD Graphics) and a dedicated graphics card (AMD,
NVIDIA), the dedicated graphics card will automatically override the integrated graphics card. The workaround
is to disable your dedicated graphics card to be able to use your integrated graphics card.
262 Chapter 16. GPU SDK Correspondence and Device Targeting Table
CHAPTER
SEVENTEEN
This is for a vanilla installation of Boost, including full compilation steps from source without precompiled libraries.
Installation steps (depends on what you are going to do):
• Install the appropriate OpenCL SDK
• Install MinGW
• Install Boost
• Install Git
• Install CMake
• Create LightGBM binaries
• Debugging LightGBM in CLI (if GPU is crashing or any other crash reason)
If you wish to use another compiler like Visual Studio C++ compiler, you need to adapt the steps to your needs.
For this compilation tutorial, we are using AMD SDK for our OpenCL steps. However, you are free to use any OpenCL
SDK you want, you just need to adjust the PATH correctly.
You will also need administrator rights. This will not work without them.
At the end, you can restore your original PATH.
263
LightGBM, Release 4.0.0
To modify PATH, just follow the pictures after going to the Control Panel:
Does not apply to you if you do not use a third-party antivirus nor the default preinstalled antivirus on Windows.
Windows Defender or any other antivirus will have a significant impact on the speed you will be able to perform
the steps. It is recommended to turn them off temporarily until you finished with building and setting up everything,
then turn them back on, if you are using them.
Installing the appropriate OpenCL SDK requires you to download the correct vendor source SDK. You need to know
what you are going to use LightGBM!
• For running on Intel, get Intel SDK for OpenCL (NOT RECOMMENDED).
• For running on AMD, get AMD APP SDK (downloads for Linux and for Windows). You may want to replace
the OpenCL.dll from the GPU driver package with the one from the SDK, if the one shipped with the driver
lacks some functions.
• For running on NVIDIA, get CUDA Toolkit.
• Or you can try to use Khronos official OpenCL headers, the CMake module would automatically find the OpenCL
library used in your system, though the result may be not portable.
17.1. Install LightGBM GPU version in Windows (CLI / R / Python), using MinGW/gcc 265
LightGBM, Release 4.0.0
Further reading and correspondence table (especially if you intend to use cross-platform devices, like Intel CPU with
AMD APP SDK): GPU SDK Correspondence and Device Targeting Table.
Warning: using Intel OpenCL is not recommended and may crash your machine due to being non compliant to OpenCL
standards. If your objective is to use LightGBM + OpenCL on CPU, please use AMD APP SDK instead (it can run
also on Intel CPUs without any issues).
If you are expecting to use LightGBM without R, you need to install MinGW. Installing MinGW is straightforward,
download this.
Make sure you are using the x86_64 architecture, and do not modify anything else. You may choose a version other
than the most recent one if you need a previous MinGW version.
Then, add to your PATH the following (to adjust to your MinGW version):
C:\Program Files\mingw-w64\x86_64-5.3.0-posix-seh-rt_v4-rev0\mingw64\bin
To check whether you need 32-bit or 64-bit MinGW for R, install LightGBM as usual and check for the following:
If it says mingw_64 then you need the 64-bit version (PATH with c:\Rtools\bin;c:\Rtools\mingw_64\bin),
otherwise you need the 32-bit version (c:\Rtools\bin;c:\Rtools\mingw_32\bin), the latter being a very rare
and untested case.
NOTE: If you are using Rtools 4.0 or later, the path will have mingw64 instead of mingw_64 (PATH with
C:rtools40mingw64bin), and mingw32 instead of mingw_32 (C:rtools40mingw32bin). The 32-bit version remains
an unsupported solution under Rtools 4.0.
Download Prebuilt Boost x86_64 or Prebuilt Boost i686 and unpack them with 7zip, alternatively you can build Boost
from source.
Installing Boost requires to download Boost and to install it. It takes about 10 minutes to several hours depending on
your CPU speed and network speed.
We will assume an installation in C:\boost and a general installation (like in Unix variants: without versioning and
without type tags).
There is one mandatory step to check the compiler:
• Warning: if you want the R installation: If you have already MinGW in your PATH variable, get rid of it (you
will link to the wrong compiler otherwise).
• Warning: if you want the CLI installation: If you have already Rtools in your PATH variable, get rid of it (you
will link to the wrong compiler otherwise).
• R installation must have Rtools in PATH
• CLI / Python installation must have MinGW (not Rtools) in PATH
In addition, assuming you are going to use C:\boost for the folder path, you should add now already the following to
PATH: C:\boost\boost-build\bin, C:\boost\boost-build\include\boost. Adjust C:\boost if you install
it elsewhere.
We can now start downloading and compiling the required Boost libraries:
17.1. Install LightGBM GPU version in Windows (CLI / R / Python), using MinGW/gcc 267
LightGBM, Release 4.0.0
• Download Boost (for example, the filename for 1.63.0 version is boost_1_63_0.zip)
• Extract the archive to C:\boost
• Open a command prompt, and run
cd C:\boost\boost_1_63_0\tools\build
bootstrap.bat gcc
b2 install --prefix="C:\boost\boost-build" toolset=gcc
cd C:\boost\boost_1_63_0
To build the Boost libraries, you have two choices for command prompt:
• If you have only one single core, you can use the default
• If you want to do a multithreaded library building (faster), add -j N by replacing N by the number of cores/threads
you have. For instance, for 2 cores, you would do
Ignore all the errors popping up, like Python, etc., they do not matter for us.
Your folder should look like this at the end (not fully detailed):
- C
|--- boost
|------ boost_1_63_0
|--------- some folders and files
|------ boost-build
|--------- bin
|--------- include
|------------ boost
|--------- lib
|--------- share
This is what you should (approximately) get at the end of Boost compilation:
Now, we can fetch LightGBM repository for GitHub. Run Git Bash and the following command:
cd C:/
mkdir github_repos
cd github_repos
git clone --recursive https://github.com/microsoft/LightGBM
Your LightGBM repository copy should now be under C:\github_repos\LightGBM. You are free to use any folder
you want, but you have to adapt.
Keep Git Bash open.
17.1. Install LightGBM GPU version in Windows (CLI / R / Python), using MinGW/gcc 269
LightGBM, Release 4.0.0
17.1. Install LightGBM GPU version in Windows (CLI / R / Python), using MinGW/gcc 271
LightGBM, Release 4.0.0
• Click Configure
You should get (approximately) the following after clicking Configure:
Generating done
This is straightforward, as CMake is providing a large help into locating the correct elements.
17.1. Install LightGBM GPU version in Windows (CLI / R / Python), using MinGW/gcc 273
LightGBM, Release 4.0.0
Installation in CLI
cd C:/github_repos/LightGBM/build
• If you did not close the Git Bash console previously, run this to get to the build folder:
cd LightGBM/build
alias make='mingw32-make'
If everything was done correctly, you now compiled CLI LightGBM with GPU support!
Testing in CLI
You can now test LightGBM directly in CLI in a command prompt (not Git Bash):
cd C:/github_repos/LightGBM/examples/binary_classification
"../../lightgbm.exe" config=train.conf data=binary.train valid=binary.test␣
˓→objective=binary device=gpu
Now that you compiled LightGBM, you try it. . . and you always see a segmentation fault or an undocumented crash
with GPU support:
Please check if you are using the right device (Using GPU device: ...). You can find a list of your OpenCL devices
using GPUCapsViewer, and make sure you are using a discrete (AMD/NVIDIA) GPU if you have both integrated (Intel)
and discrete GPUs installed. Also, try to set gpu_device_id = 0 and gpu_platform_id = 0 or gpu_device_id
= -1 and gpu_platform_id = -1 to use the first platform and device or the default platform and device. If it still
does not work, then you should follow all the steps below.
17.1. Install LightGBM GPU version in Windows (CLI / R / Python), using MinGW/gcc 275
LightGBM, Release 4.0.0
You will have to redo the compilation steps for LightGBM to add debugging mode. This involves:
• Deleting C:/github_repos/LightGBM/build folder
• Deleting lightgbm.exe, lib_lightgbm.dll, and lib_lightgbm.dll.a files
Once you removed the file, go into CMake, and follow the usual steps. Before clicking “Generate”, click on “Add
Entry”:
And then, follow the regular LightGBM CLI installation from there.
17.1. Install LightGBM GPU version in Windows (CLI / R / Python), using MinGW/gcc 277
LightGBM, Release 4.0.0
Once you have installed LightGBM CLI, assuming your LightGBM is in C:\github_repos\LightGBM, open a com-
mand prompt and run the following:
gdb --args "../../lightgbm.exe" config=train.conf data=binary.train valid=binary.test␣
˓→objective=binary device=gpu
There, write backtrace and press the Enter key as many times as gdb requests two choices:
Program received signal SIGSEGV, Segmentation fault.
0x00007ffbb37c11f1 in strlen () from C:\Windows\system32\msvcrt.dll
(gdb) backtrace
#0 0x00007ffbb37c11f1 in strlen () from C:\Windows\system32\msvcrt.dll
#1 0x000000000048bbe5 in std::char_traits<char>::length (__s=0x0)
at C:/PROGRA~1/MINGW-~1/X86_64~1.0-P/mingw64/x86_64-w64-mingw32/include/c++/bits/
˓→char_traits.h:267
at C:/PROGRA~1/MINGW-~1/X86_64~1.0-P/mingw64/x86_64-w64-mingw32/include/c++/bits/
˓→basic_string.tcc:1157
#3 boost::compute::detail::appdata_path[abi:cxx11]() () at C:/boost/boost-build/include/
˓→boost/compute/detail/path.hpp:38
at C:/boost/boost-build/include/boost/compute/detail/path.hpp:46
#5 0x00000000004913de in boost::compute::program::load_program_binary (hash=
˓→"d27987d5bd61e2d28cd32b8d7a7916126354dc81", ctx=...)
at C:/boost/boost-build/include/boost/compute/program.hpp:605
#6 0x0000000000490ece in boost::compute::program::build_with_source (
source="\n#ifndef _HISTOGRAM_256_KERNEL_\n#define _HISTOGRAM_256_KERNEL_\n\n#pragma␣
(continues on next page)
l-fast-relaxed-math") at C:/boost/boost-build/include/boost/compute/program.hpp:549
#7 0x0000000000454339 in LightGBM::GPUTreeLearner::BuildGPUKernels () at C:\LightGBM\
˓→src\treelearner\gpu_tree_learner.cpp:583
#9 0x0000000000455e7e in LightGBM::GPUTreeLearner::BuildGPUKernels␣
˓→(this=this@entry=0x3b9cac0)
at C:\LightGBM\src\treelearner\gpu_tree_learner.cpp:569
#10 0x0000000000457b49 in LightGBM::GPUTreeLearner::InitGPU (this=0x3b9cac0, platform_id=
˓→<optimized out>, device_id=<optimized out>)
at C:\LightGBM\src\treelearner\gpu_tree_learner.cpp:720
#11 0x0000000000410395 in LightGBM::GBDT::ResetTrainingData (this=0x1f26c90, config=
˓→<optimized out>, train_data=0x1f28180, objective_function=0x1f280e0,
Right-click the command prompt, click “Mark”, and select all the text from the first line (with the command prompt
containing gdb) to the last line printed, containing all the log, such as:
17.1. Install LightGBM GPU version in Windows (CLI / R / Python), using MinGW/gcc 279
LightGBM, Release 4.0.0
at C:/PROGRA~1/MINGW-~1/X86_64~1.0-P/mingw64/x86_64-w64-mingw32/include/c++/bits/
˓→basic_string.tcc:1157
#3 boost::compute::detail::appdata_path[abi:cxx11]() () at C:/boost/boost-build/include/
˓→boost/compute/detail/path.hpp:38
at C:/boost/boost-build/include/boost/compute/detail/path.hpp:46
#5 0x00000000004913de in boost::compute::program::load_program_binary (hash=
˓→"d27987d5bd61e2d28cd32b8d7a7916126354dc81", ctx=...)
at C:/boost/boost-build/include/boost/compute/program.hpp:605
#6 0x0000000000490ece in boost::compute::program::build_with_source (
source="\n#ifndef _HISTOGRAM_256_KERNEL_\n#define _HISTOGRAM_256_KERNEL_\n\n#pragma␣
˓→OPENCL EXTENSION cl_khr_local_int32_base_atomics : enable\n#pragma OPENCL EXTENSION cl_
˓→relaxed-math") at C:/boost/boost-build/include/boost/compute/program.hpp:549
at C:\LightGBM\src\treelearner\gpu_tree_learner.cpp:569
#10 0x0000000000457b49 in LightGBM::GPUTreeLearner::InitGPU (this=0x3b9cac0, platform_id=
˓→<optimized out>, device_id=<optimized out>)
at C:\LightGBM\src\treelearner\gpu_tree_learner.cpp:720
#11 0x0000000000410395 in LightGBM::GBDT::ResetTrainingData (this=0x1f26c90, config=
˓→<optimized out>, train_data=0x1f28180, objective_function=0x1f280e0,
17.1. Install LightGBM GPU version in Windows (CLI / R / Python), using MinGW/gcc 281
LightGBM, Release 4.0.0
EIGHTEEN
It is recommended to use -O3 -mtune=native to achieve maximum speed during LightGBM training.
Using Intel Ivy Bridge CPU on 1M x 1K Bosch dataset, the performance increases as follow:
283
LightGBM, Release 4.0.0
NINETEEN
DOCUMENTATION
Documentation for LightGBM is generated using Sphinx and Breathe, which works on top of Doxygen output.
List of parameters and their descriptions in Parameters.rst is generated automatically from comments in config file by
this script.
After each commit on master, documentation is updated and published to Read the Docs.
19.1 Build
It is not necessary to re-build this documentation while modifying LightGBM’s source code. The HTML files generated
using Sphinx are not checked into source control. However, you may want to build them locally during development
to test changes.
19.1.1 Docker
The most reliable way to build the documentation locally is with Docker, using the same images Read the Docs uses.
Run the following from the root of this repository to pull the relevant image and run a container locally.
docker run \
--rm \
--user=0 \
-v $(pwd):/opt/LightGBM \
--env C_API=true \
--env CONDA=/opt/miniforge \
--env READTHEDOCS=true \
--workdir=/opt/LightGBM/docs \
--entrypoint="" \
readthedocs/build:ubuntu-20.04-2021.09.23 \
/bin/bash build-docs.sh
Note: The navigation in these locally-built docs does not link to the local copy of the R documentation. To view the
local version of the R docs, open docs/_build/html/R/index.html in your browser.
285
LightGBM, Release 4.0.0
You can build the documentation locally without Docker. Just install Doxygen and run in docs folder
Note that this will not build the R documentation. Consider using common R utilities for documentation generation, if
you need it. Or use the Docker-based approach described above to build the R documentation locally.
Optionally, you may also install scikit-learn and get richer documentation for the classes in Scikit-learn API.
If you faced any problems with Doxygen installation or you simply do not need documentation for C code, it is possible
to build the documentation without it:
TWENTY
• genindex
287
LightGBM, Release 4.0.0
289
LightGBM, Release 4.0.0
290 Index
LightGBM, Release 4.0.0
I LGBM_BoosterSaveModel (C function), 92
INLINE_FUNCTION (C macro), 74 LGBM_BoosterSaveModelToString (C function), 92
LGBM_BoosterSetLeafValue (C function), 92
L LGBM_BoosterShuffleModels (C function), 93
LastErrorMsg (C function), 75 LGBM_BoosterUpdateOneIter (C function), 93
LGBM_BoosterAddValidData (C function), 75 LGBM_BoosterUpdateOneIterCustom (C function), 93
LGBM_BoosterCalcNumPredict (C function), 75 LGBM_BoosterValidateFeatureNames (C function),
LGBM_BoosterCreate (C function), 75 93
LGBM_BoosterCreateFromModelfile (C function), 75 LGBM_ByteBufferFree (C function), 94
LGBM_BoosterDumpModel (C function), 76 LGBM_ByteBufferGetAt (C function), 94
LGBM_BoosterFeatureImportance (C function), 76 LGBM_DatasetAddFeaturesFrom (C function), 94
LGBM_BoosterFree (C function), 76 LGBM_DatasetCreateByReference (C function), 94
LGBM_BoosterFreePredictSparse (C function), 77 LGBM_DatasetCreateFromCSC (C function), 94
LGBM_BoosterGetCurrentIteration (C function), 77 LGBM_DatasetCreateFromCSR (C function), 95
LGBM_BoosterGetEval (C function), 77 LGBM_DatasetCreateFromCSRFunc (C function), 95
LGBM_BoosterGetEvalCounts (C function), 78 LGBM_DatasetCreateFromFile (C function), 96
LGBM_BoosterGetEvalNames (C function), 78 LGBM_DatasetCreateFromMat (C function), 96
LGBM_BoosterGetFeatureNames (C function), 78 LGBM_DatasetCreateFromMats (C function), 97
LGBM_BoosterGetLeafValue (C function), 78 LGBM_DatasetCreateFromSampledColumn (C func-
LGBM_BoosterGetLinear (C function), 79 tion), 97
LGBM_BoosterGetLoadedParam (C function), 79 LGBM_DatasetCreateFromSerializedReference (C
LGBM_BoosterGetLowerBoundValue (C function), 79 function), 97
LGBM_BoosterGetNumClasses (C function), 79 LGBM_DatasetDumpText (C function), 98
LGBM_BoosterGetNumFeature (C function), 80 LGBM_DatasetFree (C function), 98
LGBM_BoosterGetNumPredict (C function), 80 LGBM_DatasetGetFeatureNames (C function), 98
LGBM_BoosterGetPredict (C function), 80 LGBM_DatasetGetFeatureNumBin (C function), 99
LGBM_BoosterGetUpperBoundValue (C function), 80 LGBM_DatasetGetField (C function), 99
LGBM_BoosterLoadModelFromString (C function), 81 LGBM_DatasetGetNumData (C function), 99
LGBM_BoosterMerge (C function), 81 LGBM_DatasetGetNumFeature (C function), 99
LGBM_BoosterNumberOfTotalModel (C function), 81 LGBM_DatasetGetSubset (C function), 99
LGBM_BoosterNumModelPerIteration (C function), LGBM_DatasetInitStreaming (C function), 100
81 LGBM_DatasetMarkFinished (C function), 100
LGBM_BoosterPredictForCSC (C function), 81 LGBM_DatasetPushRows (C function), 100
LGBM_BoosterPredictForCSR (C function), 82 LGBM_DatasetPushRowsByCSR (C function), 101
LGBM_BoosterPredictForCSRSingleRow (C func- LGBM_DatasetPushRowsByCSRWithMetadata (C func-
tion), 83 tion), 101
LGBM_BoosterPredictForCSRSingleRowFast (C LGBM_DatasetPushRowsWithMetadata (C function),
function), 84 102
LGBM_BoosterPredictForCSRSingleRowFastInit LGBM_DatasetSaveBinary (C function), 103
(C function), 85 LGBM_DatasetSerializeReferenceToBinary (C
LGBM_BoosterPredictForFile (C function), 86 function), 103
LGBM_BoosterPredictForMat (C function), 86 LGBM_DatasetSetFeatureNames (C function), 103
LGBM_BoosterPredictForMats (C function), 87 LGBM_DatasetSetField (C function), 103
LGBM_BoosterPredictForMatSingleRow (C func- LGBM_DatasetSetWaitForManualFinish (C func-
tion), 88 tion), 104
LGBM_BoosterPredictForMatSingleRowFast (C LGBM_DatasetUpdateParamChecking (C function),
function), 89 104
LGBM_BoosterPredictForMatSingleRowFastInit LGBM_DumpParamAliases (C function), 104
(C function), 89 LGBM_FastConfigFree (C function), 104
LGBM_BoosterPredictSparseOutput (C function), 90 LGBM_GetLastError (C function), 104
LGBM_BoosterRefit (C function), 91 LGBM_GetSampleCount (C function), 104
LGBM_BoosterResetParameter (C function), 91 LGBM_NetworkFree (C function), 105
LGBM_BoosterResetTrainingData (C function), 91 LGBM_NetworkInit (C function), 105
LGBM_BoosterRollbackOneIter (C function), 91 LGBM_NetworkInitWithFunctions (C function), 105
Index 291
LightGBM, Release 4.0.0
N O
n_classes_ (lightgbm.DaskLGBMClassifier property), objective_ (lightgbm.DaskLGBMClassifier property),
187 188
n_classes_ (lightgbm.LGBMClassifier property), 151 objective_ (lightgbm.DaskLGBMRanker property),
n_estimators_ (lightgbm.DaskLGBMClassifier prop- 213
erty), 187 objective_ (lightgbm.DaskLGBMRegressor property),
n_estimators_ (lightgbm.DaskLGBMRanker prop- 201
erty), 212 objective_ (lightgbm.LGBMClassifier property), 151
n_estimators_ (lightgbm.DaskLGBMRegressor prop- objective_ (lightgbm.LGBMModel property), 141
erty), 201 objective_ (lightgbm.LGBMRanker property), 176
n_estimators_ (lightgbm.LGBMClassifier property), objective_ (lightgbm.LGBMRegressor property), 165
151
n_estimators_ (lightgbm.LGBMModel property), 140 P
n_estimators_ (lightgbm.LGBMRanker property), 176 plot_importance() (in module lightgbm), 219
n_estimators_ (lightgbm.LGBMRegressor property), plot_metric() (in module lightgbm), 221
164 plot_split_value_histogram() (in module light-
n_features_ (lightgbm.DaskLGBMClassifier property), gbm), 220
188 plot_tree() (in module lightgbm), 222
n_features_ (lightgbm.DaskLGBMRanker property), predict() (lightgbm.Booster method), 121
212 predict() (lightgbm.DaskLGBMClassifier method),
n_features_ (lightgbm.DaskLGBMRegressor prop- 188
erty), 201 predict() (lightgbm.DaskLGBMRanker method), 213
n_features_ (lightgbm.LGBMClassifier property), 151 predict() (lightgbm.DaskLGBMRegressor method),
n_features_ (lightgbm.LGBMModel property), 140 201
n_features_ (lightgbm.LGBMRanker property), 176 predict() (lightgbm.LGBMClassifier method), 152
n_features_ (lightgbm.LGBMRegressor property), 164 predict() (lightgbm.LGBMModel method), 141
n_features_in_ (lightgbm.DaskLGBMClassifier prop- predict() (lightgbm.LGBMRanker method), 177
erty), 188 predict() (lightgbm.LGBMRegressor method), 165
n_features_in_ (lightgbm.DaskLGBMRanker prop- predict_proba() (lightgbm.DaskLGBMClassifier
erty), 212 method), 189
n_features_in_ (lightgbm.DaskLGBMRegressor prop- predict_proba() (lightgbm.LGBMClassifier method),
erty), 201 152
n_features_in_ (lightgbm.LGBMClassifier property),
151 R
n_features_in_ (lightgbm.LGBMModel property), 140 record_evaluation() (in module lightgbm), 218
refit() (lightgbm.Booster method), 122
292 Index
LightGBM, Release 4.0.0
Index 293