Page MenuHomePhabricator

Citation Usage: run third round of data collection
Closed, ResolvedPublic

Description

A/C

  • Resolve subtasks. The latest change will go out to production on February 7.
  • Collect data in the beta cluster. Data collection has been enabled on the beta cluster.
  • If everything looks good, then deploy.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@bmansurov thanks for announcing the timeline. Is it unrealistic to have the patch reviewed and a deployment scheduled before All Hands week? This would give our collaborators two more weeks to start the analysis.

@DarTar The patches are up for a review (I pinged @EBernhardson too). I may need your help to expedite this. Once they are reviewed, we probably need to deploy to the beta cluster in order to check that we don't have any regressions (this step can be skipped if we don't have time — I can check data locally, but it won't be as comprehensive). We are also talking about some issues (in the Google document and comments at T212937) from the previous round. Hopefully they should be resolved soon too.

@bmansurov thanks for the detailed explanation. I defer to @tizianopiccardi and @Miriam on the best strategy to check for potential regressions from a data consumer perspective (it might be tricky to reproduce the various workflows that may trigger errors locally). @EBernhardson if there's anything I can do to help make a case for expediting this, let me know (bribe_mode="on").

Copying @RyanSteinberg @Lauren.maggio @toddleroux @Afandian for visibility.

Update: the patches have been merged yesterday. We just need to close the open conversation about other remaining items.

@bmansurov I forgot to post a big thank you here. For the remaining items you mention, is there anything @RyanSteinberg and team or we can do to help?

@DarTar, yes I'm waiting for the list of identifiers besides ISBN and ISSN. More info: T212937#4893106.

Also waiting for @Miriam on T213969#4891216.

Sorry @bmansurov and thanks for the explanation ;) I think we should do the test in the beta cluster before deployment. @RyanSteinberg I might need your help to do these tests, as I you are more familiar with the last changes requested.

Change 486329 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[operations/mediawiki-config@master] Labs: set wgWMECitationUsagePageLoadPopulationSize at 33.3%

https://gerrit.wikimedia.org/r/486329

Thanks @Miriam. I'm deploying the above patch later today. This will allow us to collect data on the beta cluster at the same sampling rate as in production (100% for CitationUsage and 33.3% for CitationUsagePageLoad). @RyanSteinberg, in order to test the data quality, you'll need to visit pages on the beta cluster and perform actions related to citations. Then verify that those actions are correctly captured in Hive. You may also want to do a bigger picture analysis of the data as you did with the previous round of data collection in production.

Thank you for the testing instructions @bmansurov. I will plan to review beta cluster pages and test data tomorrow. I'm still waiting to hear back from my team on T212937#4893106 and whether or not citation_identifier_label data is useful. In the meantime, I will update that task with a more comprehensive list of identifiers that I think should be used if citation_identifier_label remains. Sorry this has taken so long.

Change 486329 merged by jenkins-bot:
[operations/mediawiki-config@master] Labs: set wgWMECitationUsagePageLoadPopulationSize at 33.3%

https://gerrit.wikimedia.org/r/486329

Hi @bmansurov I interacted with a beta cluster page and expected to see usage data flow into event.citationusage. Am I looking in the right place or do I just need to be more patient?

Hi @RyanSteinberg, I forgot to mention that you need to follow these instructions for testing.

@bmansurov I don't think I have access to deployment-eventlog05.deployment-prep.eqiad.wmflabs or any of the wmflabs machines.

@RyanSteinberg I see. Let's see if @Miriam and @tizianopiccardi can verify the data. I think they should have access.

@RyanSteinberg @bmansurov

I can see all events on the client side. I'll do some tests there.
On the server side, I can see in the client-side-events.log file all the events generated by my session_token . However, I can't find the same events in the MySQL log database. My understanding is that there should be a table recording events from the last version of the Citation Usage Schema) called CitationUsage_18810892, but I can't find it. Am I doing something wrong @Ottomata @elukey ? Thanks!

@Miriam, the eventlogging MySQL stuff in beta is very flaky. I just bounced it there, can you try again?

@Ottomata thanks you're the best! I now see the table correponding to the new version of the schema, but I don't see the events I generated :/.
I'll try to see if that works from another session, but if you say it's flaky let's maybe not spend too much time on it? I can just parse the log file for the test events I generate, to double check that on the server side everything looks good.

Hm, the consumer is now inserting events, and I see a few in your new table, but I don't know if current ones are coming in or not...they should be. Parsing the log file is probably better anyway, since these events will be imported into Hive (not MySQL) in prod.

Yess it now works after starting a new session on the client. Thanks @Ottomata !

Hi @RyanSteinberg ! We have decided to turn on the data collection for a short period of time, so that you have real data samples to perform all the quality checks you might need on your side. @bmansurov suggested we can collect data for one day, at a sampling rate of maybe 1% for both schemas. Would that sound good to you? Would it be OK if we switch on this small data-collection in the coming days? Thanks.

Hi @Miriam. This sounds good to me. I should have some time to look at the sampled data this Friday. Thank you!

Change 492344 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[operations/mediawiki-config@master] Enable logging for CitationUsage and CitationUsagePageLoad

https://gerrit.wikimedia.org/r/492344

Change 492345 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[operations/mediawiki-config@master] Stop collecting data for CitaitonUsage and CitationUsagePageLoad

https://gerrit.wikimedia.org/r/492345

Data collection at 1% will happen from 02/25, 14:30 EST til 02/27, 12:30 EST.

Change 492344 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable logging for CitationUsage and CitationUsagePageLoad

https://gerrit.wikimedia.org/r/492344

@Ottomata, @Nuria a heads up that we've just started collecting data for CitatoinUsage and CitationUsagePageLoad schemas at 1%. For comparison, previously, the sampling rate was 100% and 33% respectively.

Change 492345 merged by jenkins-bot:
[operations/mediawiki-config@master] Stop collecting data for CitaitonUsage and CitationUsagePageLoad

https://gerrit.wikimedia.org/r/492345

Mentioned in SAL (#wikimedia-operations) [2019-02-27T17:34:38Z] <niharika29@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Stop collecting data for CitaitonUsage and CitationUsagePageLoad T213969 (duration: 00m 55s)

@RyanSteinberg 2-day data collection is complete. Please check the data quality and let me know if everything looks good. I'll then start collecting data for one month.

Hi @bmansurov. I reviewed data today, specifically looking at section_id and freely_accessible elements.

section_id data looks improved: 16% of extClick actions in the old sample lacked section_id; today it's down to 8%. I manually spot-checked some of these NULL values and didn't find any links falling outside of the main page section.

freely_accessible also looks improved: old rate was 0.01%, new is 0.2%. Even though this is still a tiny number, it seems more reasonable.

I noticed a couple of other potential issues: page_id and revision_id are both missing less than .5% of the time. Although small, this might be worth looking into as it seems odd. I also noticed page_title data is missing entirely but I believe this may have been a deliberate choice.

Details of my review are available here: https://github.com/ryanmax/wiki-citation-usage/blob/master/data-regression-2019-03-01.ipynb

Hi @RyanSteinberg. Thanks for the analysis.

{"action":"extClick","citation_in_text_refs":null,"dom_interactive_time":1551142403587,"event_offset_time":4692,"ext_position":5,"footnote_number":null,"freely_accessible":false,"in_infobox":false,"link_occurrence":2,"link_text":"search for Monin (company) in Wikipedia","link_url":"https://en.wikipedia.org/w/index.php?search=Monin+%28company%29&title=Special%3ASearch&fulltext=1","mode":"desktop","namespace_id":0,"page_id":0,"page_title":null,"page_token":"cec842cf2f5bf35662a3","referrer":"https://en.wikipedia.org/wiki/Monin","revision_id":0,"section_id":null,"session_token":"885d58303ce425ca8ecc","skin":"vector","citation_identifier_label":null}

And the page in question Monin_(company) hasn't been created yet. Perhaps it's best to ignore pages when the page_id is 0?

@RyanSteinberg also note that some internal links are being marked as external (T217567). It maybe something to keep in mind while doing the analysis.

Please let me know when we're ready to start collecting data. Everything is ready on my end.

Just to make sure I'm tracking this correctly, it seems that some of the internal clicks are being mislabeled / incorrectly identified as external - is that correct?

I'm concerned this may be an issue unless we have a consistent approach (perhaps by "en.wikipedia.org") to distinguish those internal links that get miscoded and remove them from the data set. Is there a sense of volume of these incorrectly assigned links?

Just to make sure I'm tracking this correctly, it seems that some of the internal clicks are being mislabeled / incorrectly identified as external - is that correct?

Yes, that's correct.

I'm concerned this may be an issue unless we have a consistent approach (perhaps by "en.wikipedia.org") to distinguish those internal links that get miscoded and remove them from the data set. Is there a sense of volume of these incorrectly assigned links?

These links can be filtered out by excluding link URLs that start with https://en.wikipedia.org. I'm not sure about the volume. Maybe Ryan can calculate it?

Sorry guys, it's hard for me to find time for this project during the week.
New review: https://github.com/ryanmax/wiki-citation-usage/blob/master/data-regression-2019-03-05.ipynb

@Lauren.maggio, about 1.8% of extClick events in the sample are likely miscoded as external (start with https://en.wikipedia.org). Not sure if that number is too high. If so, perhaps instead of relying on markup for this notion of external-ness, we should consider calculating it based on each link's href attribute? @bmansurov, what do you think? It seems like it would solve the problem of miscoded internal and external links but might create others.

Ignoring page_id = 0 events seems reasonable. We can do that during analysis. Even combining these page_id 0 events with the miscoded external links only excludes 1.8% of the sample extClicks.

I also took a look at link_occurrence and ext_position data. @Lauren.maggio when an external link occurs more than once on a page, the position of the last occurrence is reported, potentially biasing data to indicate clicking occurs lower on the page. I don't think there's a better solution for this (counting the first occurrence would just skew for clicks higher on the page) but it's worth noting for methods, etc. Any use of ext_position data might just need to exclude links that occur more than once.

If so, perhaps instead of relying on markup for this notion of external-ness, we should consider calculating it based on each link's href attribute? @bmansurov, what do you think?

I think we should use both signals (i.e. the 'external' flag and the link URL) because other than the bug mentioned (T217567), the 'external' flag is pretty accurate. It's unfortunate that a related bug (T13477) has been open for many years and won't be fixed any time soon.

The instrumentation code only reports extClick events on links explicitly coded with class external. It's simple to exclude internal links that were miscoded as external, but what about the reverse? Links that are coded as internal but are really external won't be represented in click data at all. It looks like interwiki links are a potential problem here. For example interwiki doi links get the class extiw not external so would be missed. See ref 5 on Diamantane for an example. Interwiki doi alone represents a good number of links that my team would surely think of as external: see first 500. And reviewing the interwikimap, I see other base interwiki hostnames that seem "external": merriam-webster.com, handle.net, google.com, etc.

Instead of using the external class to define external links, could we instead define them based on whether or not the link's hostname matches the baseURI hostname? This would end up including data from clicks to other language wikis, wikimedia.org, wikidata.org, etc., but I think that would be a better notion of "external". What do others think?

I was not aware of the DOI case. Thanks for bringing it up. I think in that case it makes sense to use the URL only and ignore the external flag. It will probably take a couple of weeks to get this adjustment made in code and shipped to production. If time is of concern, then let's derive whether a link is external or not during analysis. Please let me know what you prefer.

My team discussed this today and reached consensus that comparing links with the document's hostname is preferred. This new definition of external seems cleaner and well worth the wait. Of course I'm happy to hear from others if there are objections to this change. Thank you!

@RyanSteinberg Is this something you can address at the analysis time? This is increase in the scope of fixing the schema and we don't have resources at the moment to make it happen. If it breaks your research, let us know and we will reevaluate. Otherwise, let's plan to start the data collection with the fixes already in place soon.

Hi @leila. I don't think we ever collectively defined what an external link was in our schema. Using the external class, in my opinion, is a large problem that negatively impacts the strength of our research. I'm not sure how to account for it in analysis since the data for extClick events would miss a considerable number of links that editors clearly intend to be external reference links.

Count of interwiki links by interwiki prefix: these 37,735 links are all clearly external, but clicks on them would not be captured with our current tooling. I would argue that other link types (e.g. commons, wikidata, wiktionary, etc.) should in fact be included (anything except links back to en.wikipedia.org). Here's an unlimited report with page/link counts and below are a few examples from the 767,158 commons links that clearly should be included as extClicks:

https://en.wikipedia.org/?curid=3043886#cite_note-24
https://en.wikipedia.org/?curid=14920951#cite_note-4
https://en.wikipedia.org/?curid=4837705#cite_note-JtfGtmoAssessmentIsn143-9 (File:ISN link)
https://en.wikipedia.org/?curid=876106#cite_note-52

If changing the tooling code to compare each clicked link to the wikipedia hostname is too resource-intensive right now, what about a modest change to include the extiw CSS class? If I'm understanding wiki structure correctly, I think this would cover a broad swath of missing extClick events.

Thank you for your assistance!

@RyanSteinberg Lauren and I talked yesterday and based on the latest analysis, the effect of this particular bug is evenly distributed across articles and the decision is not to fix it and move forward with data collection. I'll send an email with more info about it today.

@bmansurov Please go ahead and schedule the data collection to start as early as 2019-03-20. Please ping me here if everything is clear and I'll let the enwiki folks know. Please schedule it for 1 month, similar to the previous time.

Change 496857 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[operations/mediawiki-config@master] Enable logging for CitationUsage and CitationUsagePageLoad

https://gerrit.wikimedia.org/r/496857

@Ottomata a heads up that we'll be collecting citation data starting March 20th which lasts one month. The sampling rate is 100% for the schema CitationUsage and 33.3% for the schema CitationUsagePageLoad as before (similar to the second round of data collection: T203253).

@bmansurov thanks! How is the staging done? In the second round, we gradually increased data collection from 1-15% per the village pump announcement, and then to 100%. Some staging is recommended for this round as well.

@Lauren.maggio can you make sure someone on your team is ready to start assessing the quality of the data when we start the staging and data collection on 2019-03-20? We should aim to catch any major issues in the first 24 hours, please.

@Miriam Can you send a village pump announcement about this? I read your previous one and you will do a much better job than I as you know more details.

@leila we did staging because we wanted to make sure that the back end can handle the load. Now that we know it can, we can safely use the intended sampling rates. I'm not sure of other reasons why staging is needed. Maybe @Miriam knows?

Talked to Miriam, and she made an announcement today. We'll wait two days and deploy on Thursday if everything is fine by then.

Yes, announcement just posted!

Change 496857 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[operations/mediawiki-config@master] Enable logging for CitationUsage and CitationUsagePageLoad

https://gerrit.wikimedia.org/r/496857

Change 496857 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable logging for CitationUsage and CitationUsagePageLoad

https://gerrit.wikimedia.org/r/496857

@Lauren.maggio can you make sure someone on your team is ready to start assessing the quality of the data when we start the staging and data collection on 2019-03-20? We should aim to catch any major issues in the first 24 hours, please.

I reviewed the initial data and do not see any surprises.

@Miriam, an FYI that I'll be turning off data collection tomorrow. It's been a month.

Mentioned in SAL (#wikimedia-operations) [2019-04-23T11:43:11Z] <kartik@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:505643]] Turn off logging for CitationUsage and CitationUsagePageLoad (T213969) (duration: 00m 53s)

bmansurov claimed this task.
bmansurov moved this task from Backlog to Done (current quarter) on the Research board.

Data collection is over.

Change 626016 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/extensions/WikimediaEvents@master] citationUsage: Remove unused campaign code

https://gerrit.wikimedia.org/r/626016

Change 626019 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/mediawiki-config@master] labs: Remove old wgWMECitationUsage* settings for Beta Cluster

https://gerrit.wikimedia.org/r/626019

Krinkle claimed this task.
Krinkle added a project: Performance-Team.

Change 626016 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] citationUsage: Remove unused campaign code

https://gerrit.wikimedia.org/r/626016

Change 626019 merged by jenkins-bot:
[operations/mediawiki-config@master] labs: Remove old wgWMECitationUsage* settings for Beta Cluster

https://gerrit.wikimedia.org/r/626019