Page MenuHomePhabricator

Nuria (Nuria)
Disabled

User Details

User Since
Nov 26 2014, 3:04 AM (521 w, 4 h)
Roles
Disabled
LDAP User
Nuria
MediaWiki User
Unknown

Recent Activity

Mon, Nov 4

Milimetric awarded T258511: Data Lake incremental Data Updates a Love token.
Mon, Nov 4, 5:20 PM · Patch-For-Review, Analytics, Epic, Product-Analytics

May 21 2023

Nuria added a comment to T207171: Have a way to show the most popular pages per country.

This data was released. Due to various technical factors, there are three distinct datasets:
https://analytics.wikimedia.org/published/datasets/

May 21 2023, 5:00 PM · Data-Engineering, Data-Engineering-Wikistats, Privacy Engineering, Inuka-Team, Language-strategy, Tool-Pageviews
Nuria added a comment to T267283: Evaluate a differentially private solution to release wikipedia's project-title-country data.

This data was released. Due to various technical factors, there are three distinct datasets:
https://analytics.wikimedia.org/published/datasets/

May 21 2023, 4:53 PM · Data-Engineering, Research, Privacy Engineering, Privacy, Data-release

May 4 2023

Nuria created T335958: The soon-to-be-released pageview datasets should be linked from dumps page .
May 4 2023, 1:33 PM · Privacy Engineering, Data-Engineering

Feb 22 2023

Nuria closed T267283: Evaluate a differentially private solution to release wikipedia's project-title-country data, a subtask of T189339: An expert panel to produce recommendations on open data sharing for public good, as Resolved.
Feb 22 2023, 3:13 PM · Data-Engineering-Icebox, Analytics-Radar, Privacy Engineering, Privacy, Data-release
Nuria closed T267283: Evaluate a differentially private solution to release wikipedia's project-title-country data as Resolved.
Feb 22 2023, 3:13 PM · Data-Engineering, Research, Privacy Engineering, Privacy, Data-release

Jan 30 2023

Jdlrobson awarded T248884: Documentation of client side error logging capabilities on mediawiki a Love token.
Jan 30 2023, 5:19 PM · Instrument-ClientError, Observability-Logging, observability, Analytics-Radar, Documentation, Performance-Team (Radar), Wikimedia-Logstash, Better Use Of Data
phuedx awarded T248884: Documentation of client side error logging capabilities on mediawiki a Mountain of Wealth token.
Jan 30 2023, 11:06 AM · Instrument-ClientError, Observability-Logging, observability, Analytics-Radar, Documentation, Performance-Team (Radar), Wikimedia-Logstash, Better Use Of Data

Sep 1 2022

Nuria added a comment to T263908: Article on Carles Puigdemont has inflated pageviews in many projects.

This is a bot that was crawling this page and realized that it was detected as a bot and it shifted its pattern a bit: (notice these are automated pageviews a while back)

Screen Shot 2022-09-01 at 3.53.45 PM.png (1×2 px, 2 MB)

Sep 1 2022, 11:06 PM · Analytics-Kanban, Pageviews-Anomaly

Nov 2 2021

Nuria added a comment to T120242: Eventually Consistent MediaWiki State Change Events.

New developments in this are of interest: watermark change data capture framework from netflix that aims to do what this task is about, streaming data from source A to source B taking into account an initial snapshot: https://arxiv.org/pdf/2010.12597v1.pdf

Nov 2 2021, 4:02 PM · Data-Engineering, Analytics, DBA, WMF-Architecture-Team, Platform Team Legacy (Later), Event-Platform, Services (later)

Oct 11 2021

Nuria added a comment to T280385: Apache Beam go prototype code for DP evaluation.

Love openDP @Htriedman

Oct 11 2021, 8:25 PM · Research-Freezer, Data-Engineering, Privacy Engineering, Privacy, Data-release

Jun 18 2021

Nuria added a comment to T280385: Apache Beam go prototype code for DP evaluation.

These path is road less travelled it seems, but that is not a reason not to attempt it.

Nice

Jun 18 2021, 7:35 PM · Research-Freezer, Data-Engineering, Privacy Engineering, Privacy, Data-release

Jun 4 2021

Nuria added a comment to T280385: Apache Beam go prototype code for DP evaluation.

It might also be possible to do a differentially-private count on a single machine within the Analytics cluster (likely on stat1007, which I think has a lot of RAM). This could be either with Privacy on Beam >(using the local runner) or with Google’s Java/Go implementations of DP.

Part of this task is to make data releases of this type part of the cycle of data releases at WMF so I do not think we should pursue the option of treating this project like a one off data release, rather we should think of it running it as any other data flow as a core requirement.

Jun 4 2021, 9:12 PM · Research-Freezer, Data-Engineering, Privacy Engineering, Privacy, Data-release

May 26 2021

Nuria updated subscribers of T280385: Apache Beam go prototype code for DP evaluation.

My thoughts on the proposals:

May 26 2021, 4:25 PM · Research-Freezer, Data-Engineering, Privacy Engineering, Privacy, Data-release

May 14 2021

Nuria added a comment to T282584: Clean up EventLogging Schema: pages on meta.

There is no third party (to our knowledge in the last 7 years) that is ever used this system so makes sense to deprecate it via notification.

May 14 2021, 11:10 PM · Analytics-Kanban, MediaWiki-extensions-EventLogging, Documentation, Analytics, Wikimedia-Developer-Portal

May 6 2021

Nuria created T282195: ApacheBeam prototype for DP noise addition with pageview privacy units on top of Spark.
May 6 2021, 10:59 PM · Research-Freezer, Data-Engineering-Radar, Privacy Engineering, Privacy, Data-release

May 3 2021

Nuria added a comment to T280385: Apache Beam go prototype code for DP evaluation.

what it would take to migrate some of this to the cluster where the Apache Spark runner could be tested

We probably do not want to install beam on the cluster just for this experiment so can we use jupyter rather and run beam on python? https://beam.apache.org/get-started/quickstart-py/

May 3 2021, 8:59 PM · Research-Freezer, Data-Engineering, Privacy Engineering, Privacy, Data-release

Apr 20 2021

Nuria added a comment to T280385: Apache Beam go prototype code for DP evaluation.

User is more standard and has stronger guarantees but more complicated

Also, our privacy policy prevent us from keeping data at the user level, so DP notions that are user centric will not really serve our use case. I doubt they serve the case of any service you can use while not authenticated.

Apr 20 2021, 10:07 PM · Research-Freezer, Data-Engineering, Privacy Engineering, Privacy, Data-release

Apr 16 2021

Nuria created T280385: Apache Beam go prototype code for DP evaluation.
Apr 16 2021, 5:36 PM · Research-Freezer, Data-Engineering, Privacy Engineering, Privacy, Data-release

Apr 2 2021

Nuria added a comment to T267283: Evaluate a differentially private solution to release wikipedia's project-title-country data.

Had more time to review this and appreciate the coolness of this: "I apply a flexible threshold in the tool based on an approach from Google where you calculate the likelihood that the real datapoint is within X% of the noisy data point and threshold based on that. So in the tool, noisy data points that are calculated to have less than a 50% chance of being within 25% of the actual value are greyed out. The 50%/25% parameters can be adjusted but are a reasonable starting place"

Apr 2 2021, 10:23 PM · Data-Engineering, Research, Privacy Engineering, Privacy, Data-release

Jan 21 2021

Nuria added a comment to T267283: Evaluate a differentially private solution to release wikipedia's project-title-country data.

in a setting like the one you describe, what would the attacker know, and what would they be trying to find out?

Sorry let me clarify: what would be known to an attacker is the exact pageviews per project per article, see: https://dumps.wikimedia.org/other/pageview_complete/readme.html
An attack might try to remove the noise in order to find the pageviews per article, per country.

Jan 21 2021, 12:32 AM · Data-Engineering, Research, Privacy Engineering, Privacy, Data-release

Jan 20 2021

Nuria added a comment to T267283: Evaluate a differentially private solution to release wikipedia's project-title-country data.

Parking some thoughts from my conversation with @Isaac after his good work this past couple weeks.

Jan 20 2021, 8:02 PM · Data-Engineering, Research, Privacy Engineering, Privacy, Data-release

Jan 19 2021

Nuria added a comment to T267283: Evaluate a differentially private solution to release wikipedia's project-title-country data.

Nice @Isaac need to get back to this now that https://phabricator.wikimedia.org/T269256 is closed

Jan 19 2021, 4:21 AM · Data-Engineering, Research, Privacy Engineering, Privacy, Data-release
Nuria added a comment to T269256: Story Idea for Blog: Automated detection of wikipedia censorship events .

cc @Slaporte that blogpost about technical measures to detect censhorship is been published

Jan 19 2021, 4:15 AM · Technical-blog-posts
Nuria closed T269256: Story Idea for Blog: Automated detection of wikipedia censorship events as Resolved.
Jan 19 2021, 4:14 AM · Technical-blog-posts

Jan 16 2021

Nuria added a comment to T269256: Story Idea for Blog: Automated detection of wikipedia censorship events .

@srodlund I see, how about (probably a reworked version of)

Jan 16 2021, 12:46 AM · Technical-blog-posts

Jan 15 2021

Nuria added a comment to T269256: Story Idea for Blog: Automated detection of wikipedia censorship events .

@srodlund in mobile specially the initial paragraph : "The act of detecting anomalous events in a series of events (in this case a time series of Wikipedia pageviews) is called anomaly detection. The anomalies we are looking for are sudden drops in pageviews on a per-country basis." looks, I think, much too prominent, can we remove entirely so blogpost starts at "About four years ago"

Jan 15 2021, 9:50 PM · Technical-blog-posts
Nuria added a comment to T269256: Story Idea for Blog: Automated detection of wikipedia censorship events .

"derivative of logo" sounds good. No rush on publishing it whenever works for you.

Jan 15 2021, 12:25 AM · Technical-blog-posts

Jan 12 2021

Nuria added a comment to T269256: Story Idea for Blog: Automated detection of wikipedia censorship events .

Ping @srodlund

Jan 12 2021, 5:14 AM · Technical-blog-posts

Jan 6 2021

Nuria added a comment to T269256: Story Idea for Blog: Automated detection of wikipedia censorship events .

@srodlund I think it is almost final! Accepted all your corrections and elaborated a bit on the conclusion. Please take a second look. Let me know if the tables are to be translated into images (or HTML tables) or how do you prefer to do that.

Jan 6 2021, 3:35 AM · Technical-blog-posts

Jan 5 2021

Nuria added a comment to T271170: Unique devices numbers for all wikipedias missing for Agust and SEptember.

Thanks for the fast response!

Jan 5 2021, 6:51 PM · Analytics-Kanban, Analytics-Data-Quality, Analytics
Nuria closed T271170: Unique devices numbers for all wikipedias missing for Agust and SEptember as Resolved.
Jan 5 2021, 6:50 PM · Analytics-Kanban, Analytics-Data-Quality, Analytics

Jan 4 2021

Nuria created T271170: Unique devices numbers for all wikipedias missing for Agust and SEptember.
Jan 4 2021, 11:19 PM · Analytics-Kanban, Analytics-Data-Quality, Analytics

Dec 18 2020

Nuria added a comment to T269256: Story Idea for Blog: Automated detection of wikipedia censorship events .

@srodlund perfect, that gives me next week to finalize the text. The new year sounds great.

Dec 18 2020, 11:04 PM · Technical-blog-posts

Dec 2 2020

Nuria added a comment to T267283: Evaluate a differentially private solution to release wikipedia's project-title-country data.

@Aklapper I assigned to myself again after my account was re-activated

Dec 2 2020, 4:59 PM · Data-Engineering, Research, Privacy Engineering, Privacy, Data-release
Nuria claimed T267283: Evaluate a differentially private solution to release wikipedia's project-title-country data.
Dec 2 2020, 4:59 PM · Data-Engineering, Research, Privacy Engineering, Privacy, Data-release
Nuria created T269256: Story Idea for Blog: Automated detection of wikipedia censorship events .
Dec 2 2020, 4:58 PM · Technical-blog-posts

Nov 27 2020

Nuria closed T268895: Reactivate nuria's Phabricator account as Resolved.
Nov 27 2020, 10:40 PM · Phabricator
Nuria added a comment to T268895: Reactivate nuria's Phabricator account.

Done both things, many thanks @Reedy

Nov 27 2020, 10:39 PM · Phabricator
Nuria added a comment to T268895: Reactivate nuria's Phabricator account.

Super thanks!

Nov 27 2020, 10:38 PM · Phabricator

Nov 18 2020

Nuria added a comment to T183291: Requesting account expiration extension.

To keep archives happy, WMF did teh work of productionizing these scripts: https://wikitech.wikimedia.org/wiki/Analytics/Data_quality/Traffic_per_city_entropy

Nov 18 2020, 5:21 AM · Analytics-Radar, Analytics-Clusters

Nov 12 2020

Nuria added a comment to T267283: Evaluate a differentially private solution to release wikipedia's project-title-country data.

@JAllemandou Given that user fingerprinting on pageview_hourly data is not effective (and if it were to be it would be a problem) I *think* I am going to center my efforts - when, ahem, I can get to this - in other privacy 'units'

Nov 12 2020, 7:38 PM · Data-Engineering, Research, Privacy Engineering, Privacy, Data-release

Nov 7 2020

Nuria added a comment to T267454: Get list of most viewed articles by viewers from specific country .

This is WIP. please see: T207171: Have a way to show the most popular pages per country

Nov 7 2020, 3:36 AM · Data-Engineering-Wikistats, Analytics

Nov 6 2020

Nuria added a comment to T267283: Evaluate a differentially private solution to release wikipedia's project-title-country data.

Thanks @TedTed for all these pointers, on my end I need to digest all this info before I can get back to you, others here might have more questions.

Nov 6 2020, 10:56 PM · Data-Engineering, Research, Privacy Engineering, Privacy, Data-release
Nuria added a comment to T267283: Evaluate a differentially private solution to release wikipedia's project-title-country data.

We have IPs in a temporary dataset, called pageview_actor that feeds into pageview_hourly, so that's where we'd get the fingerprint Joseph is talking about. We could insert two steps in between these datasets,

Nov 6 2020, 5:17 PM · Data-Engineering, Research, Privacy Engineering, Privacy, Data-release

Nov 5 2020

Nuria added a comment to T267283: Evaluate a differentially private solution to release wikipedia's project-title-country data.

Say that field has value 5, does it means that the page had 5 different views, potentially from 5 different users (but all with the same country, user_agent_map, etc.)?

Yes, exactly, same country, same (broadly) user agent and same article.

Nov 5 2020, 10:49 PM · Data-Engineering, Research, Privacy Engineering, Privacy, Data-release
Nuria updated subscribers of T267312: Requesting access to restricted production access and analytics-privatedata-users for Zxane Soo.

This approvals are now handled by @Ottomata

Nov 5 2020, 9:33 PM · Trust-and-Safety, SRE-Access-Requests, SRE
Nuria added a comment to T267283: Evaluate a differentially private solution to release wikipedia's project-title-country data.

@TedTed Super thanks for chiming in

Nov 5 2020, 8:57 PM · Data-Engineering, Research, Privacy Engineering, Privacy, Data-release
Nuria edited projects for T267283: Evaluate a differentially private solution to release wikipedia's project-title-country data, added: Analytics; removed Analytics-Radar.
Nov 5 2020, 8:53 PM · Data-Engineering, Research, Privacy Engineering, Privacy, Data-release
Nuria created T267283: Evaluate a differentially private solution to release wikipedia's project-title-country data.
Nov 5 2020, 2:04 AM · Data-Engineering, Research, Privacy Engineering, Privacy, Data-release
Nuria added a comment to T189339: An expert panel to produce recommendations on open data sharing for public good.

for #3. see release: https://techblog.wikimedia.org/2020/10/01/mediawiki-history-the-best-dataset-on-wikimedia-content-and-contributors/

Nov 5 2020, 1:37 AM · Data-Engineering-Icebox, Analytics-Radar, Privacy Engineering, Privacy, Data-release
Nuria closed T210313: Statistics for views of individual Wikimedia images as Resolved.
Nov 5 2020, 1:27 AM · Multimedia, Analytics, Tool-Pageviews

Oct 29 2020

Nuria closed T236740: Remove postal code and longitude / latitude from geocoded data object on webrequest data as Resolved.
Oct 29 2020, 6:22 PM · Analytics-Kanban, Product-Analytics, Analytics

Oct 28 2020

Nuria moved T257692: Add data quality alarm for mobile-app data from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Oct 28 2020, 6:36 PM · Analytics-Kanban, Analytics, Product-Analytics
Nuria added a comment to T257692: Add data quality alarm for mobile-app data .

Code merged now, when the entropy counts are re run alarms for may18th will be resend.

Oct 28 2020, 6:36 PM · Analytics-Kanban, Analytics, Product-Analytics

Oct 27 2020

Nuria moved T257692: Add data quality alarm for mobile-app data from In Progress to In Code Review on the Analytics-Kanban board.
Oct 27 2020, 11:52 PM · Analytics-Kanban, Analytics, Product-Analytics
Nuria added a comment to T266467: Check home/HDFS leftovers of rodolfovalentim.

@Dzahn : i do not think so, he should be removed from LDAP

Oct 27 2020, 11:30 PM · Analytics

Oct 23 2020

Nuria added a comment to T266086: Nuria's volunteer account.

done!

Oct 23 2020, 10:52 PM · Analytics-Radar, SRE-Access-Requests, SRE
Nuria added a comment to T266086: Nuria's volunteer account.

NDA signed now but I do not have access to https://phabricator.wikimedia.org/L2?

Oct 23 2020, 10:43 PM · Analytics-Radar, SRE-Access-Requests, SRE
Nuria added a project to T266086: Nuria's volunteer account: Analytics.
Oct 23 2020, 10:42 PM · Analytics-Radar, SRE-Access-Requests, SRE
Nuria added a comment to T266250: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung.

Also, @Rmaung please take a look at https://wikitech.wikimedia.org/wiki/Analytics/Data_Access_Guidelines and ask any questions you might have about it on task

Oct 23 2020, 10:38 PM · Analytics, SRE, SRE-Access-Requests
Nuria updated subscribers of T266250: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung.

@Rmaung: can you describe what data are looking to access? This is so we can see what is the appropriate level of access (cc @Ottomata) .

Oct 23 2020, 10:36 PM · Analytics, SRE, SRE-Access-Requests

Oct 22 2020

Nuria added a comment to T263041: OperationError: The operation failed for an operation-specific reason in generateRandomSessionId .

if this is not super urgent i can work on it on my volunteer capacity.

Oct 22 2020, 11:00 PM · MW-1.36-notes (1.36.0-wmf.33; 2021-03-02), Better Use Of Data, Analytics-Radar, Product-Data-Infrastructure, Event-Platform, JavaScript, MediaWiki-extensions-EventLogging, Wikimedia-production-error

Oct 21 2020

Nuria added a comment to T207171: Have a way to show the most popular pages per country.

I would implement the daily "top" 1st and once that is in place I would add the monthly job, given the very different amounts of data needed for both a different strategy might be needed for the second one.

Oct 21 2020, 11:38 PM · Data-Engineering, Data-Engineering-Wikistats, Privacy Engineering, Inuka-Team, Language-strategy, Tool-Pageviews
Nuria added a comment to T207171: Have a way to show the most popular pages per country.

A daily release to provide quick information for editors interested in very targeted editing. I suspect that this could even be just a ranking of most popular articles that meet the privacy thresholds without including any raw count data

Nice, +1 to this idea

Oct 21 2020, 11:36 PM · Data-Engineering, Data-Engineering-Wikistats, Privacy Engineering, Inuka-Team, Language-strategy, Tool-Pageviews
Nuria added a comment to T264945: Update Wikidata usage metric.

I found differences of <0.1% for recent months and <0.3% for older months. I think that's acceptable.

Oct 21 2020, 7:38 PM · Analytics-Kanban, Analytics
Nuria updated the task description for T266086: Nuria's volunteer account.
Oct 21 2020, 4:56 AM · Analytics-Radar, SRE-Access-Requests, SRE
Nuria created T266086: Nuria's volunteer account.
Oct 21 2020, 4:55 AM · Analytics-Radar, SRE-Access-Requests, SRE
Nuria updated Nuria.
Oct 21 2020, 4:36 AM

Oct 20 2020

Nuria added a comment to T265167: Request a Kerberos identity for sbisson.

For faster resolution of permits issues add SRE-Access-Requests to ticket, that way the persosn on clinic duty will get to work on it soon after ticket is filed. I understand that process is a bit confusing but permits to access the prod infra (including analytics clusters) are handled by the SRE team at large.

Oct 20 2020, 5:21 PM · Analytics
Nuria added a comment to T262626: Remove http.client_ip from EventGate default schema (again).

If the intent is to decide whether all errors are from same user you can send the number of errors for that session of that type and that would tell you the piece of info you want to know.

Flushing this a bit more. The number of errors for a device does not need to be per session but rather can be a tally:

Oct 20 2020, 3:20 PM · Data-Engineering, Better Use Of Data, Analytics-Kanban, Product-Analytics, Product-Data-Infrastructure, observability, Privacy Engineering, Analytics, Event-Platform
Nuria added a comment to T265952: Retain nonsensitive mediawiki_api_request logging data.

We can keep data for longer than 90 days that has no identifying fields. Just need to submit a changeset that lists those fields. Please take a look at docs: https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Data_retention

Oct 20 2020, 2:48 PM · Analytics
Nuria updated the task description for T236740: Remove postal code and longitude / latitude from geocoded data object on webrequest data.
Oct 20 2020, 4:14 AM · Analytics-Kanban, Product-Analytics, Analytics

Oct 19 2020

Nuria added a comment to T262626: Remove http.client_ip from EventGate default schema (again).

A hashed IP would still tell you how many IPs are involved, without revealing any individual IP

For it to be truly not revealing on an 2^32 space it will probably needs to be salted.

Oct 19 2020, 9:13 PM · Data-Engineering, Better Use Of Data, Analytics-Kanban, Product-Analytics, Product-Data-Infrastructure, observability, Privacy Engineering, Analytics, Event-Platform
Nuria updated subscribers of T265167: Request a Kerberos identity for sbisson.

@elukey to create kerberos credentials

Oct 19 2020, 3:38 PM · Analytics

Oct 16 2020

Nuria added a comment to T207171: Have a way to show the most popular pages per country.

So, if you just say "this number is too low to be displayed" , I don't think that anyone will complain

This is actually very useful info, thank you.

Oct 16 2020, 3:05 PM · Data-Engineering, Data-Engineering-Wikistats, Privacy Engineering, Inuka-Team, Language-strategy, Tool-Pageviews

Oct 15 2020

Nuria added a comment to T207171: Have a way to show the most popular pages per country.

Have in mind that per population data is not necessarily needed (it will be great to have at some point but it feels like scope creep in this task).

Oct 15 2020, 11:13 PM · Data-Engineering, Data-Engineering-Wikistats, Privacy Engineering, Inuka-Team, Language-strategy, Tool-Pageviews
Nuria added a comment to T188859: Wikistats 2.0: Add statistics for the geographical origin of the contributors.

I think that CKoerner_WMF.'s works fine with editors as well: "When Bethany started editing Malagasy Wikipedia in 2014, there were no Wikipedia editors in her home country of Madagascar" so I do not really see a strong use case for edits versus editors in this case

Oct 15 2020, 8:17 PM · Analytics-Kanban, Analytics, Data-Engineering-Wikistats
Nuria updated subscribers of T207171: Have a way to show the most popular pages per country.

Adding @Isaac cause I think he can probably be a good person to help to explore more than a simple bucketization solution might be needed.

Oct 15 2020, 5:36 PM · Data-Engineering, Data-Engineering-Wikistats, Privacy Engineering, Inuka-Team, Language-strategy, Tool-Pageviews
Nuria added a comment to T207171: Have a way to show the most popular pages per country.

We talked about doing some data analysis to quantify the issues with privacy and country splits. As we spoke we need to quantify the identification risk, an article with 1 pageview in "Greenlandic-language Wikipedia" might carry an identification risk of 1/55,000 (55,000 being the population of Greenland) and article in Malasyan in San marino might have an identification risk of 1/5 (5 citizens with malsyan names in San Marino) so it is not the "number of pageviews" that defines the identification risk but rather "possible population from which this pageviews are drawn"

Oct 15 2020, 3:13 PM · Data-Engineering, Data-Engineering-Wikistats, Privacy Engineering, Inuka-Team, Language-strategy, Tool-Pageviews
Nuria added a project to T236740: Remove postal code and longitude / latitude from geocoded data object on webrequest data: Analytics-Kanban.
Oct 15 2020, 2:56 PM · Analytics-Kanban, Product-Analytics, Analytics
Nuria added a comment to T236740: Remove postal code and longitude / latitude from geocoded data object on webrequest data.

Moving to kanban and @razzi to work on this.

Oct 15 2020, 2:55 PM · Analytics-Kanban, Product-Analytics, Analytics
nshahquinn-wmf awarded T248884: Documentation of client side error logging capabilities on mediawiki a Cookie token.
Oct 15 2020, 9:09 AM · Instrument-ClientError, Observability-Logging, observability, Analytics-Radar, Documentation, Performance-Team (Radar), Wikimedia-Logstash, Better Use Of Data

Oct 14 2020

Nuria assigned T236740: Remove postal code and longitude / latitude from geocoded data object on webrequest data to razzi.
Oct 14 2020, 9:17 PM · Analytics-Kanban, Product-Analytics, Analytics
Nuria updated the task description for T236740: Remove postal code and longitude / latitude from geocoded data object on webrequest data.
Oct 14 2020, 9:16 PM · Analytics-Kanban, Product-Analytics, Analytics
Nuria updated subscribers of T236740: Remove postal code and longitude / latitude from geocoded data object on webrequest data.

Pinging @JAllemandou in case he can think of any reason why we should leave these fields, giving precision.

Oct 14 2020, 9:16 PM · Analytics-Kanban, Product-Analytics, Analytics
Nuria updated the task description for T236740: Remove postal code and longitude / latitude from geocoded data object on webrequest data.
Oct 14 2020, 9:14 PM · Analytics-Kanban, Product-Analytics, Analytics
Nuria closed T173604: Reportupdater: do not write execution control files in source directories as Resolved.
Oct 14 2020, 9:11 PM · Analytics-Kanban, good first task, Analytics
Nuria closed T173604: Reportupdater: do not write execution control files in source directories, a subtask of T193167: reportupdater TLC, as Resolved.
Oct 14 2020, 9:11 PM · Analytics
Nuria closed T255685: Renaming "analytics-cluster" tag to "analytics-systems" and make into a subproject of analytics as Declined.
Oct 14 2020, 9:09 PM · Analytics
Nuria placed T251788: Add folder creation for sqoop initial installation in puppet up for grabs.
Oct 14 2020, 9:07 PM · Analytics-Kanban, Analytics
Nuria added a comment to T188859: Wikistats 2.0: Add statistics for the geographical origin of the contributors.

@CKoerner_WMF just so you know this data has been publicy available for now about a year, the task in question is to visualize it via Wikistats.

Oct 14 2020, 2:38 PM · Analytics-Kanban, Analytics, Data-Engineering-Wikistats
akosiaris awarded T219544: Make hadoop cluster able to push to swift a Love token.
Oct 14 2020, 11:09 AM · Patch-For-Review, Analytics-Kanban, Research, SRE, Discovery-ARCHIVED, Analytics

Oct 13 2020

Nuria assigned T188859: Wikistats 2.0: Add statistics for the geographical origin of the contributors to fdans.
Oct 13 2020, 10:21 PM · Analytics-Kanban, Analytics, Data-Engineering-Wikistats
Nuria added a comment to T188859: Wikistats 2.0: Add statistics for the geographical origin of the contributors.

This is scheduled to be added to wikistats Q2 2020 (Sep to Dec)

Oct 13 2020, 10:20 PM · Analytics-Kanban, Analytics, Data-Engineering-Wikistats
Nuria updated subscribers of T261461: Capture special mute events in Prefupdate table [4 hour spike].

@jwang I think @Mholloway might be able to help given that this seems to bean instrumentation issue.

Oct 13 2020, 9:47 PM · Anti-Harassment (The Letter Song), Product-Analytics, Analytics-Radar

Oct 12 2020

Nuria added a comment to T257692: Add data quality alarm for mobile-app data .

Sum up: The timeseries of entropy of os_family per access_method works well to as a data quality timeseries for 'mobile web' (see green line in plot above) and 'mobile app' (orange line). For desktop, the timeseries is a lot better from April onwards when filtering of automated agents is deployed (see https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/BotDetection#Why_do_we_need_more_sophisticated_bot_detection). The blue line in above graph starts to clearly oscillate with a weekly cadence from April onwards. Now, as it can also be appreciated on above graph, there are still spikes due to undetected bots. Those are bots that elude our detection for a number of reasons (they are real well spread geographically or their effect on pageviews is not as high as our thresholds).

Oct 12 2020, 6:53 PM · Analytics-Kanban, Analytics, Product-Analytics
Nuria removed a project from T257692: Add data quality alarm for mobile-app data : Epic.
Oct 12 2020, 6:52 PM · Analytics-Kanban, Analytics, Product-Analytics
Nuria closed T234826: Repurpose db1108 as generic Analytics db replica, a subtask of T159170: Sunset MySQL data store for eventlogging, as Resolved.
Oct 12 2020, 5:30 PM · Analytics-Kanban, MediaWiki-extensions-EventLogging
Nuria closed T234826: Repurpose db1108 as generic Analytics db replica as Resolved.
Oct 12 2020, 5:30 PM · Analytics-Clusters, User-Elukey, Analytics-Kanban