Page MenuHomePhabricator

Allow customization of morelike for experimentation
Closed, ResolvedPublic

Description

Right now cirrus supports morelike:<pageName> for finding similar pages. It uses the more_like_this query from Elasticsearch (and in turn MoreLikeThis from Lucene). Anyway, we're using some pretty default parameters for it and readership would like to experiment with changing it. We should allow that. Here is what I wrote in an email that we can do:

$wgCirrusSearchMoreLikeThisConfig = array(

'min_doc_freq' => 2,              // Minimum number of documents (per shard) that need a term for it to be considered
'max_query_terms' => 25,
'min_term_freq' => 2,
'percent_terms_to_match' => 0.3,
'min_word_len' => 0,
'max_word_len' => 0,

);

Here is the reference for what they mean and any more we might be able to set: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html

We only use the "text" field of the articles - no weighting based on, well, anything. See the text field in https://en.wikipedia.org/wiki/Barack_Obama?action=cirrusdump for example.

Stuff we could do really, really easily:
1. Add url parameters that override each of those options for easy experimenting.
2. Add url parameters to use different fields like our weighted all field, the wikitext, or intro paragraphs (don't ask how we extract into paragraphs - its a horrible hack), or the section headers, or the "secondary" text like the inforboxes and image subtitles.

This task is to do #1 and #2. #1 is harder because we need to come up with reasonable limits on the values of the parameters.

  • Stakeholders: (1) Readership and (2) Editing
  • Benefits: (1) Allows readership to experiment with options to improve related article recommendations and (2) allows editing to experiment with options to improve next article to edit recommendations
  • Estimate: One or two days

Event Timeline

Manybubbles raised the priority of this task from to Needs Triage.
Manybubbles updated the task description. (Show Details)
Manybubbles moved this task to Search on the Discovery-ARCHIVED board.
Manybubbles subscribed.

The most obvious source for "more like this" referrals is the "See also" section, which is now ignored completely. Is there a way to take that into account?

The most obvious source for "more like this" referrals is the "See also" section, which is now ignored completely. Is there a way to take that into account?

Sure but its much more work. We could index a new field for it. I'd love to be able to do it in a way that works across all wikis but I don't see that happening.

Its not fair to say its completely ignored - its just that Cirrus doesn't think of that section as any different from the rest of the article text.

Why not? Just tell tech ambassadors to list the possible names for "See also"-type sections on some MediaWiki page.

Change 220825 had a related patch set uploaded (by DCausse):
WIP: Add options to customize MoreLikeThis queries

https://gerrit.wikimedia.org/r/220825

Change 220825 merged by jenkins-bot:
Add options to customize MoreLikeThis queries

https://gerrit.wikimedia.org/r/220825