Skip to content

[WIP] first cut at LambdaMART #2580

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 21 commits into from
Closed

Conversation

jwkvam
Copy link
Contributor

@jwkvam jwkvam commented Nov 9, 2013

This PR is an attempt to implement LambdaMART [1]. I imagine the biggest hurdle will be coming to some conclusion over the correct API since we need to include the queries somehow. In my implementation I use an extra keyword argument, I don't know if this causes problems elsewhere. My hope is that this PR can serve as a catalyst to resolve that, and then I can finish up the PR.

TODO

  • tests
  • docs
  • gbm comparison
  • Ranklib comparison
  • support an optional cutoff parameter for NDCG, usually denoted as NDCG@k
  • rewrite the example to use the yandex 2009 dataset instead of MQ2007/8 that would require the unrar command to execute
  • add a max_rank=10 cutoff parameter for NDCG and LambdaMART and use it to compute the lambdas too
  • factorize a public function ndcg_score in sklearn.metrics in pure Python first (see this comment)
  • benchmark python and cython DCG
  • implement pessimistic tie break for ndcg_score and bench it against no-tie break (default implementation sort order) and random tie break
  • clarify if the pessimistic tie break is only need for the stability ndcg_score score or for its derivatives in the LambdaMART optim loop as well (see this comment)
  • investigate if pre-computing (caching) the ideal DCG for each query on the training set as done by Ranklib can speed up learning

Possibly for another PR:

  • abstract code to support measures other than NDCG

There was also a brief discussion on the mailing list [2]. Pinging @mblondel @ogrisel @pprett from that discussion.

[1] https://research.microsoft.com/en-us/um/people/cburges/tech_reports/msr-tr-2010-82.pdf
[2] http://sourceforge.net/mailarchive/forum.php?thread_name=CAP%2B3rpGVbSux5u4dZiatV3p1f1zUvndHXoi-oh4CjMRtSpjFsw%40mail.gmail.com&forum_name=scikit-learn-general

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 1afaf09 on jwkvam:lambdamart into c0e686b on scikit-learn:master.

@agramfort
Copy link
Member

one option we considered with @fabianp to support the query without changing too much the API is to have a 2 columns y where the second column is the query index. It just means that y.ndim can be 2 now.

@pprett
Copy link
Member

pprett commented Nov 9, 2013

@agramfort I think that might be indeed a better solution than having an additional fit argument.

@mblondel
Copy link
Member

one option we considered with @fabianp to support the query without changing too much the API is to have a 2 columns y where the second column is the query index. It just means that y.ndim can be 2 now.

Sounds good. Plus, such an API would be compatible with GridSearchCV.

In the future, we can create query-aware cross-validation generators.

@jwkvam I think I would use a dedicated class for LambaMART.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling b217597 on jwkvam:lambdamart into c0e686b on scikit-learn:master.

@jwkvam
Copy link
Contributor Author

jwkvam commented Nov 11, 2013

Thanks for the feedback, adding the queries into the labels seems reasonable to me.

If this has it's own class, I'm trying to think of the ways it would be different than GradientBoostingRegressor. I don't think the RegressorMixin makes much sense, I could write a score function which depends on the loss, maybe we could abstract that out later into a Mixin? Would it be worthwhile to leave the fit() in BaseGradientBoosting take an additional keyword argument, then I could split up y in LambdaMart's fit()? Or should I just leave all the fit's alone?

@mblondel
Copy link
Member

In case of ranking with a single query, LambaMART could accept a 1d y too.

@ogrisel
Copy link
Member

ogrisel commented Nov 13, 2013

one option we considered with @fabianp to support the query without changing too much the API is to have a 2 columns y where the second column is the query index. It just means that y.ndim can be 2 now.
Sounds good. Plus, such an API would be compatible with GridSearchCV.

I am not strongly opposed to it but still I find it a bit weird to use integer coding for the relevance scores or floating point coding for the query ids.

In the future, we can create query-aware cross-validation generators.

Indeed we will need it at some point.

@ogrisel
Copy link
Member

ogrisel commented Nov 13, 2013

@jwkvam I think it's ok not to use a the regressor mixing and not introduce a new ranking mixin yet. If we ever implement more ranking models in the future I think it should be ok to devise a new ranking mixin to factorize the common code at that point.

@mblondel
Copy link
Member

I am not strongly opposed to it but still I find it a bit weird to use integer coding for the relevance scores or floating point coding for the query ids.

Also 2d y could be used for multi-output ranking.

@ogrisel
Copy link
Member

ogrisel commented Nov 13, 2013

I think we need a general concept of "sample metadata" which are datastructures with a first axis that is n_samples and that should be sliced or resampled consistently with the first axis of the input data when doing cross-validation or resampling. Potential impacted data-structures:

  • sample groups: integer to encode group samples per-subject, per-experiment per-session, per-query: can be useful for scoring learn to rank models but also to do informed cross-validation to take info about non-IIDness of the data into account (e.g. with leave LeavePLabelOut for instance).
  • sample weights: floating point values to reweight the cost function e.g. for boosting or for dealing with imbalanced datasets
  • sample unique ids: integer or string ids to trace the provenance of individual samples, especially useful when re-sampling or nudging the dataset or introspecting prediction outcomes in cross-validation folds
  • the target variable y for supervised learning (integers for binary, multiclass or multilabel classification or floating point for regression with potentially several outputs or even a mix of the two for multitask learning).

There might be more cases that I forgot.

The policy until recently was to pass them as separate variables to the fit method, either as positional args (for the target variable(s) y) or as kwargs (e.g. for sample_weight).

However some parts of our API is currenty broken / too limited. For instance transformers currently do not transform (or preserve) y nor sample_weight making it impossible to pipeline transformers that change the number of samples (e.g. nudging transformers for data expansion, resampling transformer for instance).

For now I would thus keep the query ids as a separate argument to fit and the NDCG scoring function and open a separate issue for devising how to make the API evolve for better dealing with sample-wise auxiliary data.

There is also the problem of feature-wise metadata such as feature names that could be interesting to better preserve or trace when working with FeatureUnion or feature selection transformers but this is completely unrelated to this PR.

@ogrisel
Copy link
Member

ogrisel commented Nov 13, 2013

I am not strongly opposed to it but still I find it a bit weird to use integer coding for the relevance scores or floating point coding for the query ids.

Also 2d y could be used for multi-output ranking.

This is an even better reason not to implicitly stick the query membership info directly as a new column in y.

@mblondel
Copy link
Member

I think it's ok not to use a the regressor mixing and not introduce a new ranking mixin yet

Alright, then LambdaMART will need to implement its own score method.

@mblondel
Copy link
Member

For now I would thus keep the query ids as a separate argument to fit

In any case, query should be optional (if not provided, a single query is assumed). Another remark is that query is an information retrieval term. We could use the more neutral groups instead.

@pprett
Copy link
Member

pprett commented Nov 13, 2013

@mblondel makes sense for some ranking models (eg. RankingSVM) but does it make sense for lambdaMart, does it optimize AUC if you feed it binary relevance scores?

@ogrisel
Copy link
Member

ogrisel commented Nov 13, 2013

In any case, query should be optional (if not provided, a single query is assumed). Another remark is that query is an information retrieval term. We could use the more neutral groups instead.

+1 or maybe even sample_group to be consistent with sample_weight? This way we could use a naming convention in CV or resampling utilities to know which auxiliary data params should be consistently sliced and resampled with the input data and target value.

@mblondel makes sense for some ranking models (eg. RankingSVM) but does it make sense for lambdaMart, does it optimize AUC if you feed it binary relevance scores?

If you feed LambdaMART with binary relevance scores it just optimizes NDGC with rel in {0, 1} which is different from AUC (but a good ranking metric anyway).

@mblondel
Copy link
Member

@mblondel makes sense for some ranking models (eg. RankingSVM) but does it make sense for lambdaMart, does it optimize AUC if you feed it binary relevance scores?

The way I see it, ranking with a single query is similar to ordinal regression in the case when the supervision is given as relevance scores (from which you can derive a ranking). In the case of pairwise or listwise methods, the supervision can also take the form of pairwise or listwise orderings of the samples, i.e., without the notion of relevance score.

NDCG is nice because it focuses on the top k elements in the ranking whereas AUC takes into account the entire ranking (usually you care more about the top elements).

ranking = self.loss in ('ndcg')
if ranking:
if query is None:
raise ValueError("query must not be none with ranking measure")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @mblondel said, I think we should treat the X, y data as stemming from a single query in that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reminder, I will get around to it eventually.

@jwkvam
Copy link
Contributor Author

jwkvam commented Jan 2, 2014

Sorry for the lack of updates, admittedly I've been kind of lazy. I benchmarked the code (there are some unpushed changes) against GBM and the performance seems comparable outside of the execution times. I used 50 trees, a depth of 3 for GBRT and interaction.depth of 8 for GBM, learning rate is 0.1 and no subsampling of features or data. GBRT seems to be overfitting slightly more than GBM, this is just a guess but it could be because it's doing some sort of normalization of the gradient [1]. I tried doing something similar out of curiosity but didn't see much difference, it's possible I miscoded it however.

Dataset Software Training NDCG Validation NDCG Training Time
MQ2007 GBM 0.7206621488609668 0.6737084491077485 10.3 s
MQ2007 GBRT 0.7405864270405591 0.6777565353942717 30.6 s
MQ2008 GBM 0.7971141434532627 0.7728533219225161 2.1 s
MQ2008 GBRT 0.8105098974258365 0.7762440078297614 9.6 s

[1] https://code.google.com/p/gradientboostedmodels/source/browse/gbm/src/pairwise.cpp#786

@@ -585,6 +665,22 @@ def fit(self, X, y):
X, = check_arrays(X, dtype=DTYPE, sparse_format="dense",
check_ccontiguous=True)
y = column_or_1d(y, warn=True)
ranking = self.loss in ('ndcg')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need a ('ndcg',) here for the single element tuple. I wonder how this code can work otherwise...

@ogrisel
Copy link
Member

ogrisel commented Jan 13, 2014

@jwkvam could you please rebase this branch on top of master to take @pprett's recent refactoring into account?

Also could you write a dataset loader for the MQ2007 and MQ2088 datasets and use them as part of a new example in the examples/applications folder and compare the results with point wise linear regression model?

@ogrisel
Copy link
Member

ogrisel commented Jan 13, 2014

Also could you please run the MQ2008 benchmark with line_profiler enabled on the fit method and report the results here?

@jwkvam
Copy link
Contributor Author

jwkvam commented Jan 25, 2014

@ogrisel I rebased so it's on top of @pprett's changes.

Other changes include:

  • The group argument is no longer necessary, if it isn't present the entire dataset is treated as one group.
  • Followed gbm's approach of not modifying the target values by (2**y - 1), if the user wants that then they will need to preprocess the labels. Unfortunately this can lead to degenerate NDCG cases, for example you could let the relevance values for two samples be log(2) and -log(3), the optimal DCG would be zero so this could easily lead to -Inf NDCG values. I like the power this gives to the user but it can lead to bad things happening.
  • If all the relevance values in a group are identical, gbm makes NDCG be NaN, I like this approach and was going to incorporate it since this can inflate the score of the overall set. I thought I'd check with others first though.
  • Modified ZeroEstimator so it can be used with LambdaMART with integer values. It was a quick hack so I expect people may want a different approach.

Running my same benchmark on MQ2008 takes 6.4 seconds now, I wasn't able to get line profiler working but here are the main culprits. I imagine I could speed up the NDCG code by presorting the labels and pushing the loops into Cython.

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       50    2.309    0.046    2.329    0.047 {method 'build' of 'sklearn.tree._tree.DepthFirstTreeBuilder' objects}
       50    2.165    0.043    2.942    0.059 gradient_boosting.py:394(negative_gradient)
      154    2.118    0.014    2.620    0.017 gradient_boosting.py:364(__call__)
        4    0.448    0.112    0.486    0.121 npyio.py:880(savetxt)
        1    0.436    0.436    9.121    9.121 parse.py:3(<module>)

I'll add writing up that example to my TODO list.

I added LambdaMART to all the tests of the form for Cls in, they all pass aside from test_warm_start_oob_switch I still need to figure out why that's failing.

@ogrisel
Copy link
Member

ogrisel commented Jan 26, 2014

Hi @jwkvam could you please re-run some bench using the max_leaf_nodes parameter instead of max_depth? It should be closer to GBM's behavior.

@jwkvam
Copy link
Contributor Author

jwkvam commented Jan 27, 2014

@ogrisel yes, thanks for your continued feedback.

I set max_leaf_nodes=9 which is equivalent to GBM's interaction.depth=8.

Dataset Software Training NDCG Validation NDCG Training Time
MQ2007 GBM 0.7206621488609668 0.6737084491077485 10.3 s
MQ2007 GBRT max_depth=3 0.7405864270405591 0.6777565353942717 21.7 s
MQ2007 GBRT max_leaf_nodes=9 0.730921178027 0.679968120327 21.6 s
MQ2008 GBM 0.7971141434532627 0.7728533219225161 2.1 s
MQ2008 GBRT max_depth=3 0.8105098974258365 0.7762440078297614 6.1 s
MQ2008 GBRT max_leaf_nodes=9 0.794975081183 0.773451965618 5.9 s

warm_start=False):

super(LambdaMART, self).__init__(
loss, learning_rate, n_estimators, min_samples_split,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please hardcode 'ndcg' for now as there is no alternative ranking loss implemented for lambdamart.

@fulmicoton
Copy link

I starting testing this implementation performance and validate its results against Ranklib yesterday evenning
on Yandex dataset.

It runs smoothly with subsample=1.0- but subsample < 1.0 fails with the following error.

[2014/01/28-01:09:29.870] [Exec-20] [INFO] [process]  -   File "/tmp/1390867768644-0/script.py", line 51, in <module>
[2014/01/28-01:09:29.870] [Exec-20] [INFO] [process]  -     model.fit(X, score, monitor=None, group=group)
[2014/01/28-01:09:29.870] [Exec-20] [INFO] [process]  -   File "/home/dataiku/scikit-learn/scikit-learn/sklearn/ensemble/gradient_boosting.py", line 1720, in fit
[2014/01/28-01:09:29.871] [Exec-20] [INFO] [process]  -     return super(LambdaMART, self).fit(X, y, monitor, group)
[2014/01/28-01:09:29.871] [Exec-20] [INFO] [process]  -   File "/home/dataiku/scikit-learn/scikit-learn/sklearn/ensemble/gradient_boosting.py", line 901, in fit
[2014/01/28-01:09:29.871] [Exec-20] [INFO] [process]  -     begin_at_stage, monitor, group)
[2014/01/28-01:09:29.871] [Exec-20] [INFO] [process]  -   File "/home/dataiku/scikit-learn/scikit-learn/sklearn/ensemble/gradient_boosting.py", line 958, in _fit_stages
[2014/01/28-01:09:29.872] [Exec-20] [INFO] [process]  -     y_pred[~sample_mask], group=group_oob)
[2014/01/28-01:09:29.872] [Exec-20] [INFO] [process]  -   File "/home/dataiku/scikit-learn/scikit-learn/sklearn/ensemble/gradient_boosting.py", line 369, in __call__
[2014/01/28-01:09:29.872] [Exec-20] [INFO] [process]  -     last_group = group[0]
[2014/01/28-01:09:29.872] [Exec-20] [INFO] [process]  - IndexError: index out of bounds

Has someone encounterred the same error?
Also, let me know if you want access to the dataset (around 11 GB).

@ogrisel
Copy link
Member

ogrisel commented Jan 28, 2014

@poulejapon I was about to report the exact same error when trying on MSLR-10K:

Traceback (most recent call last):

  File "<ipython-input-47-8b8bb31a05fb>", line 1, in <module>
    get_ipython().run_cell_magic(u'time', u'', u'\nfrom sklearn.ensemble import LambdaMART\n\nlmart= LambdaMART(n_estimators=100, max_leaf_nodes=7,\n                  learning_rate=0.03,\n                  subsample=0.5, random_state=1, verbose=1)\nlmart.fit(X_train_small, y_train_small, group=qid_train_small)')
  File "/volatile/ogrisel/envs/py27/local/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2129, in run_cell_magic
    result = fn(magic_arg_s, cell)
  File "<string>", line 2, in time
  File "/volatile/ogrisel/envs/py27/local/lib/python2.7/site-packages/IPython/core/magic.py", line 191, in <lambda>
    call = lambda f, *a, **k: f(*a, **k)
  File "/volatile/ogrisel/envs/py27/local/lib/python2.7/site-packages/IPython/core/magics/execution.py", line 1045, in time
    exec code in glob, local_ns
  File "<timed exec>", line 7, in <module>
  File "/volatile/ogrisel/code/scikit-learn/sklearn/ensemble/gradient_boosting.py", line 1720, in fit
    return super(LambdaMART, self).fit(X, y, monitor, group)
  File "/volatile/ogrisel/code/scikit-learn/sklearn/ensemble/gradient_boosting.py", line 901, in fit
    begin_at_stage, monitor, group)
  File "/volatile/ogrisel/code/scikit-learn/sklearn/ensemble/gradient_boosting.py", line 950, in _fit_stages
    random_state)
  File "_gradient_boosting.pyx", line 315, in sklearn.ensemble._gradient_boosting._ranked_random_sample_mask (sklearn/ensemble/_gradient_boosting.c:4479)
ValueError: Buffer dtype mismatch, expected 'int32' but got 'long'

@fulmicoton
Copy link

No quite the same. To get to mine, I think you need to .astype(np.int32) your group vector. :)

@glouppe
Copy link
Contributor

glouppe commented Feb 27, 2014

Btw, sorry to bother you with this design issues. Your contribution is really great! But we really have to be careful in not complexifying our implementations. Otherwise, with time and additional features, it will blow up to the point where nobody can understand what is going on...

@jwkvam
Copy link
Contributor Author

jwkvam commented Feb 27, 2014

No, I agree completely. I'm not too happy with the new ZeroEstimator either if you would like to help think of something better.

@jwkvam
Copy link
Contributor Author

jwkvam commented Feb 28, 2014

@glouppe When you have some time could you look at the refactoring done in jwkvam@12b54ca?

Basically I replaced sample_group with **kargs, it seems cleaner to me, the complexity introduced by LambdaMART is now in that class. Unfortunately I had to introduce a mask parameter to the loss functions. But if people don't like it, I'll revert the commit.

@ogrisel
Copy link
Member

ogrisel commented Feb 28, 2014

I don't see the point in adding **kwargs to the __call__ method of loss functions that don't use it (e.g. LeastSquares). Is it really necessary?

@jwkvam
Copy link
Contributor Author

jwkvam commented Feb 28, 2014

You're right, I removed them.

@glouppe
Copy link
Contributor

glouppe commented Mar 5, 2014

@jwkvam I had a quick look and it really is less intrusive! Thanks for the refactoring :)

Since this is still a WIP, do you need a review of something specific? Do you need help for something? We can give a hand if needed, just ask for it :)

@jwkvam
Copy link
Contributor Author

jwkvam commented Mar 10, 2014

@glouppe I just haven't had much time available, I'm not looking for any reviews yet. I should probably edit the initial comment to reflect the current status. Basically the plan I had was to

  1. Determine if caching maximum DCGs is worthwhile
  2. Final benchmarks with ranklib and gbm.
  3. Add yandex use example.
  4. Fill gaps in documentation and tests.
  5. Ask for final reviews.
    Sorry for how slow this is taking, if you would like to help out, I'd welcome it. Maybe just let me know what you are tackling?

@ogrisel
Copy link
Member

ogrisel commented Mar 10, 2014

Note the yandex dataset is the small yandex 2009 dataset from:

http://imat2009.yandex.ru/en/datasets

not, the big 2013 kaggle dataset about personalized search.

@jwkvam
Copy link
Contributor Author

jwkvam commented Mar 10, 2014

Right, that's what I meant.

for j in range(i + 1, y_true.shape[0]):
if y_true[i] != y_true[j]:
if j < max_rank:
ndcg_diff = ((y_true[j] - y_true[i]) / log(2 + i)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can precompute the logs in an array of size n_samples. Also make sure to use logarithms in base 2.

@lazywei
Copy link
Contributor

lazywei commented Jun 8, 2015

How is this PR going? i have some use cases for LambdaMART, and hence would love to see this be merged.
Thanks

@agramfort
Copy link
Member

see todo list on top. Also it needs rebase. Please take over if you can.

@mblondel
Copy link
Member

mblondel commented Jun 8, 2015

I personally have bad experience with LambdaMART. In my experience, it performs either worse or comparably to random forests or gradient boosting.

@Ulden
Copy link

Ulden commented Aug 10, 2016

So how is things going ? Is LambdaMART available?

@haiatn
Copy link
Contributor

haiatn commented Jul 29, 2023

Does anyone know if these commits are still relevant since it has been ~7 years? If not maybe it is better to close this and make an issue instead

@agramfort
Copy link
Member

agreed. I don't see anyone resurrecting this anytime soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.