Right now cirrus supports morelike:<pageName> for finding similar pages. It uses the more_like_this query from Elasticsearch (and in turn MoreLikeThis from Lucene). Anyway, we're using some pretty default parameters for it and readership would like to experiment with changing it. We should allow that. Here is what I wrote in an email that we can do:
$wgCirrusSearchMoreLikeThisConfig = array(
'min_doc_freq' => 2, // Minimum number of documents (per shard) that need a term for it to be considered 'max_query_terms' => 25, 'min_term_freq' => 2, 'percent_terms_to_match' => 0.3, 'min_word_len' => 0, 'max_word_len' => 0,
);
Here is the reference for what they mean and any more we might be able to set: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html
We only use the "text" field of the articles - no weighting based on, well, anything. See the text field in https://en.wikipedia.org/wiki/Barack_Obama?action=cirrusdump for example. Stuff we could do really, really easily: 1. Add url parameters that override each of those options for easy experimenting. 2. Add url parameters to use different fields like our weighted all field, the wikitext, or intro paragraphs (don't ask how we extract into paragraphs - its a horrible hack), or the section headers, or the "secondary" text like the inforboxes and image subtitles.
This task is to do #1 and #2. #1 is harder because we need to come up with reasonable limits on the values of the parameters.
- Stakeholders: (1) Readership and (2) Editing
- Benefits: (1) Allows readership to experiment with options to improve related article recommendations and (2) allows editing to experiment with options to improve next article to edit recommendations
- Estimate: One or two days