[WIP] first cut at LambdaMART #2580

jwkvam · 2013-11-09T08:35:21Z

This PR is an attempt to implement LambdaMART [1]. I imagine the biggest hurdle will be coming to some conclusion over the correct API since we need to include the queries somehow. In my implementation I use an extra keyword argument, I don't know if this causes problems elsewhere. My hope is that this PR can serve as a catalyst to resolve that, and then I can finish up the PR.

TODO

Possibly for another PR:

abstract code to support measures other than NDCG

There was also a brief discussion on the mailing list [2]. Pinging @mblondel @ogrisel @pprett from that discussion.

[1] https://research.microsoft.com/en-us/um/people/cburges/tech_reports/msr-tr-2010-82.pdf
[2] http://sourceforge.net/mailarchive/forum.php?thread_name=CAP%2B3rpGVbSux5u4dZiatV3p1f1zUvndHXoi-oh4CjMRtSpjFsw%40mail.gmail.com&forum_name=scikit-learn-general

coveralls · 2013-11-09T08:42:50Z

Coverage remained the same when pulling 1afaf09 on jwkvam:lambdamart into c0e686b on scikit-learn:master.

agramfort · 2013-11-09T09:39:36Z

one option we considered with @fabianp to support the query without changing too much the API is to have a 2 columns y where the second column is the query index. It just means that y.ndim can be 2 now.

pprett · 2013-11-09T09:55:37Z

@agramfort I think that might be indeed a better solution than having an additional fit argument.

mblondel · 2013-11-11T07:17:43Z

one option we considered with @fabianp to support the query without changing too much the API is to have a 2 columns y where the second column is the query index. It just means that y.ndim can be 2 now.

Sounds good. Plus, such an API would be compatible with GridSearchCV.

In the future, we can create query-aware cross-validation generators.

@jwkvam I think I would use a dedicated class for LambaMART.

coveralls · 2013-11-11T07:37:42Z

Coverage remained the same when pulling b217597 on jwkvam:lambdamart into c0e686b on scikit-learn:master.

jwkvam · 2013-11-11T08:03:11Z

Thanks for the feedback, adding the queries into the labels seems reasonable to me.

If this has it's own class, I'm trying to think of the ways it would be different than GradientBoostingRegressor. I don't think the RegressorMixin makes much sense, I could write a score function which depends on the loss, maybe we could abstract that out later into a Mixin? Would it be worthwhile to leave the fit() in BaseGradientBoosting take an additional keyword argument, then I could split up y in LambdaMart's fit()? Or should I just leave all the fit's alone?

mblondel · 2013-11-13T06:11:32Z

In case of ranking with a single query, LambaMART could accept a 1d y too.

ogrisel · 2013-11-13T09:59:34Z

one option we considered with @fabianp to support the query without changing too much the API is to have a 2 columns y where the second column is the query index. It just means that y.ndim can be 2 now.
Sounds good. Plus, such an API would be compatible with GridSearchCV.

I am not strongly opposed to it but still I find it a bit weird to use integer coding for the relevance scores or floating point coding for the query ids.

In the future, we can create query-aware cross-validation generators.

Indeed we will need it at some point.

ogrisel · 2013-11-13T10:02:25Z

@jwkvam I think it's ok not to use a the regressor mixing and not introduce a new ranking mixin yet. If we ever implement more ranking models in the future I think it should be ok to devise a new ranking mixin to factorize the common code at that point.

mblondel · 2013-11-13T12:22:51Z

I am not strongly opposed to it but still I find it a bit weird to use integer coding for the relevance scores or floating point coding for the query ids.

Also 2d y could be used for multi-output ranking.

ogrisel · 2013-11-13T14:21:00Z

I think we need a general concept of "sample metadata" which are datastructures with a first axis that is n_samples and that should be sliced or resampled consistently with the first axis of the input data when doing cross-validation or resampling. Potential impacted data-structures:

sample groups: integer to encode group samples per-subject, per-experiment per-session, per-query: can be useful for scoring learn to rank models but also to do informed cross-validation to take info about non-IIDness of the data into account (e.g. with leave LeavePLabelOut for instance).
sample weights: floating point values to reweight the cost function e.g. for boosting or for dealing with imbalanced datasets
sample unique ids: integer or string ids to trace the provenance of individual samples, especially useful when re-sampling or nudging the dataset or introspecting prediction outcomes in cross-validation folds
the target variable y for supervised learning (integers for binary, multiclass or multilabel classification or floating point for regression with potentially several outputs or even a mix of the two for multitask learning).

There might be more cases that I forgot.

The policy until recently was to pass them as separate variables to the fit method, either as positional args (for the target variable(s) y) or as kwargs (e.g. for sample_weight).

However some parts of our API is currenty broken / too limited. For instance transformers currently do not transform (or preserve) y nor sample_weight making it impossible to pipeline transformers that change the number of samples (e.g. nudging transformers for data expansion, resampling transformer for instance).

For now I would thus keep the query ids as a separate argument to fit and the NDCG scoring function and open a separate issue for devising how to make the API evolve for better dealing with sample-wise auxiliary data.

There is also the problem of feature-wise metadata such as feature names that could be interesting to better preserve or trace when working with FeatureUnion or feature selection transformers but this is completely unrelated to this PR.

ogrisel · 2013-11-13T14:22:20Z

I am not strongly opposed to it but still I find it a bit weird to use integer coding for the relevance scores or floating point coding for the query ids.

Also 2d y could be used for multi-output ranking.

This is an even better reason not to implicitly stick the query membership info directly as a new column in y.

mblondel · 2013-11-13T14:49:17Z

I think it's ok not to use a the regressor mixing and not introduce a new ranking mixin yet

Alright, then LambdaMART will need to implement its own score method.

mblondel · 2013-11-13T15:05:12Z

For now I would thus keep the query ids as a separate argument to fit

In any case, query should be optional (if not provided, a single query is assumed). Another remark is that query is an information retrieval term. We could use the more neutral groups instead.

pprett · 2013-11-13T15:10:53Z

@mblondel makes sense for some ranking models (eg. RankingSVM) but does it make sense for lambdaMart, does it optimize AUC if you feed it binary relevance scores?

ogrisel · 2013-11-13T15:33:43Z

In any case, query should be optional (if not provided, a single query is assumed). Another remark is that query is an information retrieval term. We could use the more neutral groups instead.

+1 or maybe even sample_group to be consistent with sample_weight? This way we could use a naming convention in CV or resampling utilities to know which auxiliary data params should be consistently sliced and resampled with the input data and target value.

@mblondel makes sense for some ranking models (eg. RankingSVM) but does it make sense for lambdaMart, does it optimize AUC if you feed it binary relevance scores?

If you feed LambdaMART with binary relevance scores it just optimizes NDGC with rel in {0, 1} which is different from AUC (but a good ranking metric anyway).

mblondel · 2013-11-13T16:03:42Z

@mblondel makes sense for some ranking models (eg. RankingSVM) but does it make sense for lambdaMart, does it optimize AUC if you feed it binary relevance scores?

The way I see it, ranking with a single query is similar to ordinal regression in the case when the supervision is given as relevance scores (from which you can derive a ranking). In the case of pairwise or listwise methods, the supervision can also take the form of pairwise or listwise orderings of the samples, i.e., without the notion of relevance score.

NDCG is nice because it focuses on the top k elements in the ranking whereas AUC takes into account the entire ranking (usually you care more about the top elements).

ogrisel · 2013-11-21T08:50:49Z

sklearn/ensemble/gradient_boosting.py

+        ranking = self.loss in ('ndcg')
+        if ranking:
+            if query is None:
+                raise ValueError("query must not be none with ranking measure")


As @mblondel said, I think we should treat the X, y data as stemming from a single query in that case.

Thanks for the reminder, I will get around to it eventually.

jwkvam · 2014-01-02T08:09:15Z

Sorry for the lack of updates, admittedly I've been kind of lazy. I benchmarked the code (there are some unpushed changes) against GBM and the performance seems comparable outside of the execution times. I used 50 trees, a depth of 3 for GBRT and interaction.depth of 8 for GBM, learning rate is 0.1 and no subsampling of features or data. GBRT seems to be overfitting slightly more than GBM, this is just a guess but it could be because it's doing some sort of normalization of the gradient [1]. I tried doing something similar out of curiosity but didn't see much difference, it's possible I miscoded it however.

Dataset	Software	Training NDCG	Validation NDCG	Training Time
MQ2007	GBM	0.7206621488609668	0.6737084491077485	10.3 s
MQ2007	GBRT	0.7405864270405591	0.6777565353942717	30.6 s
MQ2008	GBM	0.7971141434532627	0.7728533219225161	2.1 s
MQ2008	GBRT	0.8105098974258365	0.7762440078297614	9.6 s

[1] https://code.google.com/p/gradientboostedmodels/source/browse/gbm/src/pairwise.cpp#786

ogrisel · 2014-01-13T10:08:10Z

sklearn/ensemble/gradient_boosting.py

@@ -585,6 +665,22 @@ def fit(self, X, y):
        X, = check_arrays(X, dtype=DTYPE, sparse_format="dense",
                          check_ccontiguous=True)
        y = column_or_1d(y, warn=True)
+        ranking = self.loss in ('ndcg')


You need a ('ndcg',) here for the single element tuple. I wonder how this code can work otherwise...

ogrisel · 2014-01-13T10:11:44Z

@jwkvam could you please rebase this branch on top of master to take @pprett's recent refactoring into account?

Also could you write a dataset loader for the MQ2007 and MQ2088 datasets and use them as part of a new example in the examples/applications folder and compare the results with point wise linear regression model?

ogrisel · 2014-01-13T10:12:51Z

Also could you please run the MQ2008 benchmark with line_profiler enabled on the fit method and report the results here?

jwkvam · 2014-01-25T02:43:49Z

@ogrisel I rebased so it's on top of @pprett's changes.

Other changes include:

The group argument is no longer necessary, if it isn't present the entire dataset is treated as one group.
Followed gbm's approach of not modifying the target values by (2**y - 1), if the user wants that then they will need to preprocess the labels. Unfortunately this can lead to degenerate NDCG cases, for example you could let the relevance values for two samples be log(2) and -log(3), the optimal DCG would be zero so this could easily lead to -Inf NDCG values. I like the power this gives to the user but it can lead to bad things happening.
If all the relevance values in a group are identical, gbm makes NDCG be NaN, I like this approach and was going to incorporate it since this can inflate the score of the overall set. I thought I'd check with others first though.
Modified ZeroEstimator so it can be used with LambdaMART with integer values. It was a quick hack so I expect people may want a different approach.

Running my same benchmark on MQ2008 takes 6.4 seconds now, I wasn't able to get line profiler working but here are the main culprits. I imagine I could speed up the NDCG code by presorting the labels and pushing the loops into Cython.

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       50    2.309    0.046    2.329    0.047 {method 'build' of 'sklearn.tree._tree.DepthFirstTreeBuilder' objects}
       50    2.165    0.043    2.942    0.059 gradient_boosting.py:394(negative_gradient)
      154    2.118    0.014    2.620    0.017 gradient_boosting.py:364(__call__)
        4    0.448    0.112    0.486    0.121 npyio.py:880(savetxt)
        1    0.436    0.436    9.121    9.121 parse.py:3(<module>)

I'll add writing up that example to my TODO list.

I added LambdaMART to all the tests of the form for Cls in, they all pass aside from test_warm_start_oob_switch I still need to figure out why that's failing.

ogrisel · 2014-01-26T16:36:57Z

Hi @jwkvam could you please re-run some bench using the max_leaf_nodes parameter instead of max_depth? It should be closer to GBM's behavior.

jwkvam · 2014-01-27T09:27:56Z

@ogrisel yes, thanks for your continued feedback.

I set max_leaf_nodes=9 which is equivalent to GBM's interaction.depth=8.

Dataset	Software	Training NDCG	Validation NDCG	Training Time
MQ2007	GBM	0.7206621488609668	0.6737084491077485	10.3 s
MQ2007	GBRT max_depth=3	0.7405864270405591	0.6777565353942717	21.7 s
MQ2007	GBRT max_leaf_nodes=9	0.730921178027	0.679968120327	21.6 s
MQ2008	GBM	0.7971141434532627	0.7728533219225161	2.1 s
MQ2008	GBRT max_depth=3	0.8105098974258365	0.7762440078297614	6.1 s
MQ2008	GBRT max_leaf_nodes=9	0.794975081183	0.773451965618	5.9 s

ogrisel · 2014-01-28T08:57:31Z

sklearn/ensemble/gradient_boosting.py

+                 warm_start=False):
+
+        super(LambdaMART, self).__init__(
+            loss, learning_rate, n_estimators, min_samples_split,


Please hardcode 'ndcg' for now as there is no alternative ranking loss implemented for lambdamart.

fulmicoton · 2014-01-28T09:26:05Z

I starting testing this implementation performance and validate its results against Ranklib yesterday evenning
on Yandex dataset.

It runs smoothly with subsample=1.0- but subsample < 1.0 fails with the following error.

[2014/01/28-01:09:29.870] [Exec-20] [INFO] [process]  -   File "/tmp/1390867768644-0/script.py", line 51, in <module>
[2014/01/28-01:09:29.870] [Exec-20] [INFO] [process]  -     model.fit(X, score, monitor=None, group=group)
[2014/01/28-01:09:29.870] [Exec-20] [INFO] [process]  -   File "/home/dataiku/scikit-learn/scikit-learn/sklearn/ensemble/gradient_boosting.py", line 1720, in fit
[2014/01/28-01:09:29.871] [Exec-20] [INFO] [process]  -     return super(LambdaMART, self).fit(X, y, monitor, group)
[2014/01/28-01:09:29.871] [Exec-20] [INFO] [process]  -   File "/home/dataiku/scikit-learn/scikit-learn/sklearn/ensemble/gradient_boosting.py", line 901, in fit
[2014/01/28-01:09:29.871] [Exec-20] [INFO] [process]  -     begin_at_stage, monitor, group)
[2014/01/28-01:09:29.871] [Exec-20] [INFO] [process]  -   File "/home/dataiku/scikit-learn/scikit-learn/sklearn/ensemble/gradient_boosting.py", line 958, in _fit_stages
[2014/01/28-01:09:29.872] [Exec-20] [INFO] [process]  -     y_pred[~sample_mask], group=group_oob)
[2014/01/28-01:09:29.872] [Exec-20] [INFO] [process]  -   File "/home/dataiku/scikit-learn/scikit-learn/sklearn/ensemble/gradient_boosting.py", line 369, in __call__
[2014/01/28-01:09:29.872] [Exec-20] [INFO] [process]  -     last_group = group[0]
[2014/01/28-01:09:29.872] [Exec-20] [INFO] [process]  - IndexError: index out of bounds

Has someone encounterred the same error?
Also, let me know if you want access to the dataset (around 11 GB).

ogrisel · 2014-01-28T10:07:24Z

@poulejapon I was about to report the exact same error when trying on MSLR-10K:

Traceback (most recent call last):

  File "<ipython-input-47-8b8bb31a05fb>", line 1, in <module>
    get_ipython().run_cell_magic(u'time', u'', u'\nfrom sklearn.ensemble import LambdaMART\n\nlmart= LambdaMART(n_estimators=100, max_leaf_nodes=7,\n                  learning_rate=0.03,\n                  subsample=0.5, random_state=1, verbose=1)\nlmart.fit(X_train_small, y_train_small, group=qid_train_small)')
  File "/volatile/ogrisel/envs/py27/local/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2129, in run_cell_magic
    result = fn(magic_arg_s, cell)
  File "<string>", line 2, in time
  File "/volatile/ogrisel/envs/py27/local/lib/python2.7/site-packages/IPython/core/magic.py", line 191, in <lambda>
    call = lambda f, *a, **k: f(*a, **k)
  File "/volatile/ogrisel/envs/py27/local/lib/python2.7/site-packages/IPython/core/magics/execution.py", line 1045, in time
    exec code in glob, local_ns
  File "<timed exec>", line 7, in <module>
  File "/volatile/ogrisel/code/scikit-learn/sklearn/ensemble/gradient_boosting.py", line 1720, in fit
    return super(LambdaMART, self).fit(X, y, monitor, group)
  File "/volatile/ogrisel/code/scikit-learn/sklearn/ensemble/gradient_boosting.py", line 901, in fit
    begin_at_stage, monitor, group)
  File "/volatile/ogrisel/code/scikit-learn/sklearn/ensemble/gradient_boosting.py", line 950, in _fit_stages
    random_state)
  File "_gradient_boosting.pyx", line 315, in sklearn.ensemble._gradient_boosting._ranked_random_sample_mask (sklearn/ensemble/_gradient_boosting.c:4479)
ValueError: Buffer dtype mismatch, expected 'int32' but got 'long'

fulmicoton · 2014-01-28T10:10:38Z

No quite the same. To get to mine, I think you need to .astype(np.int32) your group vector. :)

glouppe · 2014-02-27T08:56:09Z

Btw, sorry to bother you with this design issues. Your contribution is really great! But we really have to be careful in not complexifying our implementations. Otherwise, with time and additional features, it will blow up to the point where nobody can understand what is going on...

MeanEstimator.

jwkvam · 2014-02-27T09:12:07Z

No, I agree completely. I'm not too happy with the new ZeroEstimator either if you would like to help think of something better.

jwkvam · 2014-02-28T05:43:51Z

@glouppe When you have some time could you look at the refactoring done in jwkvam@12b54ca?

Basically I replaced sample_group with **kargs, it seems cleaner to me, the complexity introduced by LambdaMART is now in that class. Unfortunately I had to introduce a mask parameter to the loss functions. But if people don't like it, I'll revert the commit.

ogrisel · 2014-02-28T09:25:37Z

I don't see the point in adding **kwargs to the __call__ method of loss functions that don't use it (e.g. LeastSquares). Is it really necessary?

jwkvam · 2014-02-28T18:39:53Z

You're right, I removed them.

glouppe · 2014-03-05T07:42:12Z

@jwkvam I had a quick look and it really is less intrusive! Thanks for the refactoring :)

Since this is still a WIP, do you need a review of something specific? Do you need help for something? We can give a hand if needed, just ask for it :)

jwkvam · 2014-03-10T07:35:14Z

@glouppe I just haven't had much time available, I'm not looking for any reviews yet. I should probably edit the initial comment to reflect the current status. Basically the plan I had was to

Determine if caching maximum DCGs is worthwhile
Final benchmarks with ranklib and gbm.
Add yandex use example.
Fill gaps in documentation and tests.
Ask for final reviews.
Sorry for how slow this is taking, if you would like to help out, I'd welcome it. Maybe just let me know what you are tackling?

ogrisel · 2014-03-10T07:41:09Z

Note the yandex dataset is the small yandex 2009 dataset from:

http://imat2009.yandex.ru/en/datasets

not, the big 2013 kaggle dataset about personalized search.

jwkvam · 2014-03-10T07:42:18Z

Right, that's what I meant.

mblondel · 2014-07-11T09:35:54Z

sklearn/ensemble/_gradient_boosting.pyx

+            for j in range(i + 1, y_true.shape[0]):
+                if y_true[i] != y_true[j]:
+                    if j < max_rank:
+                        ndcg_diff = ((y_true[j] - y_true[i]) / log(2 + i)


You can precompute the logs in an array of size n_samples. Also make sure to use logarithms in base 2.

lazywei · 2015-06-08T09:45:48Z

How is this PR going? i have some use cases for LambdaMART, and hence would love to see this be merged.
Thanks

agramfort · 2015-06-08T11:37:49Z

see todo list on top. Also it needs rebase. Please take over if you can.

mblondel · 2015-06-08T11:50:52Z

I personally have bad experience with LambdaMART. In my experience, it performs either worse or comparably to random forests or gradient boosting.

Ulden · 2016-08-10T03:31:15Z

So how is things going ? Is LambdaMART available?

haiatn · 2023-07-29T07:50:04Z

Does anyone know if these commits are still relevant since it has been ~7 years? If not maybe it is better to close this and make an issue instead

agramfort · 2023-07-29T08:17:21Z

agreed. I don't see anyone resurrecting this anytime soon.

ogrisel reviewed Nov 21, 2013
View reviewed changes

ogrisel reviewed Jan 13, 2014
View reviewed changes

ogrisel reviewed Jan 28, 2014
View reviewed changes

It is just as good to use the ZeroEstimator for LambdaMART as using the

ee84b3d

MeanEstimator.

Refactor sample_group out of BaseGradientBoosting

12b54ca

remove **kargs from loss.__call__

86bc5e8

ogrisel mentioned this pull request Apr 5, 2014

Fix a broadcasting error. ogrisel/scikit-learn#9

Closed

mblondel reviewed Jul 11, 2014
View reviewed changes

larsmans force-pushed the master branch from 58a55ad to 4b82379 Compare August 25, 2014 21:50

MechCoder force-pushed the master branch from 6deaea0 to 3f49cee Compare November 3, 2014 12:36

amueller added New Feature Need Contributor labels Oct 13, 2016

lesteve added help wanted and removed Need Contributor labels Oct 18, 2017

github-actions bot added the module:ensemble label Mar 2, 2020

Base automatically changed from master to main January 22, 2021 10:48

thomasjpfan added the cython label Apr 13, 2021

adrinjalali closed this Jul 29, 2023

Uh oh!

[WIP] first cut at LambdaMART #2580

[WIP] first cut at LambdaMART #2580

Uh oh!

Conversation

jwkvam commented Nov 9, 2013

Uh oh!

coveralls commented Nov 9, 2013

Uh oh!

agramfort commented Nov 9, 2013

Uh oh!

pprett commented Nov 9, 2013

Uh oh!

mblondel commented Nov 11, 2013

Uh oh!

coveralls commented Nov 11, 2013

Uh oh!

jwkvam commented Nov 11, 2013

Uh oh!

mblondel commented Nov 13, 2013

Uh oh!

ogrisel commented Nov 13, 2013

Uh oh!

ogrisel commented Nov 13, 2013

Uh oh!

mblondel commented Nov 13, 2013

Uh oh!

ogrisel commented Nov 13, 2013

Uh oh!

ogrisel commented Nov 13, 2013

Uh oh!

mblondel commented Nov 13, 2013

Uh oh!

mblondel commented Nov 13, 2013

Uh oh!

pprett commented Nov 13, 2013

Uh oh!

ogrisel commented Nov 13, 2013

Uh oh!

mblondel commented Nov 13, 2013

Uh oh!

ogrisel Nov 21, 2013

Choose a reason for hiding this comment

Uh oh!

jwkvam Nov 22, 2013

Choose a reason for hiding this comment

Uh oh!

jwkvam commented Jan 2, 2014

Uh oh!

ogrisel Jan 13, 2014

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Jan 13, 2014

Uh oh!

ogrisel commented Jan 13, 2014

Uh oh!

jwkvam commented Jan 25, 2014

Uh oh!

ogrisel commented Jan 26, 2014

Uh oh!

jwkvam commented Jan 27, 2014

Uh oh!

ogrisel Jan 28, 2014

Choose a reason for hiding this comment

Uh oh!

fulmicoton commented Jan 28, 2014

Uh oh!

ogrisel commented Jan 28, 2014

Uh oh!

fulmicoton commented Jan 28, 2014

Uh oh!

glouppe commented Feb 27, 2014

Uh oh!

jwkvam commented Feb 27, 2014

Uh oh!

jwkvam commented Feb 28, 2014

Uh oh!

ogrisel commented Feb 28, 2014

Uh oh!

jwkvam commented Feb 28, 2014