Page MenuHomePhabricator

Data-Engineering-IceboxGroup
ActivePublic

Watchers

  • This project does not have any watchers.
  • View All

Details

Description

The Data-Engineering team is going through a process of re-organizing the workboard, which has, over the years, accumulated a large volume of open, stale tickets. As a part of this process, a large fraction of the existing Analytics tasks will be temporarily moved to this new Data-Engineering-Icebox tag as a holding area.

Many moves to this tag will be based on the team's evaluation and reconsideration. Regardless of the mechanism, moves of tasks from Analytics to Data-Engineering-Icebox are not intended to be any reflection of any personal or team opinion on the validity or importance of a given task. We aim to reduce the main Data Engineering workboard to active, ongoing work and current incoming issues. After the initial removal of many tasks to the icebox (from June 9th to June 23rd, 2022), we will be re-organizing the columns of the main workboard into a more Sprint-inspired workflow, and we will also be doing a one-by-one human triage process on all icebox tasks during planning.
This one-by-one triage process will include:

  1. Reviewing for the need to move tasks back to the primary Data Engineering workboard (caught up in bulk filters or hasty imprecise decisions)
  2. Categorizing the tasks into Icebox columns by their nature (new features/decline/Not a priority but nice to have).
  3. Pinging stakeholders (author, assignee, and/or others) for updates or resolution (or, in some rarer cases, closing immediately when it seems clear this wouldn't be controversial or contended by stakeholders).
  4. Un-tagging (from analytics) for tasks managed by other (sub-)teams on other boards, where Data Engineering doesn't have any significant stake or role in the task.

At the end of this first pass of triage, most of them are expected to remain open in the Icebox for some time. The Icebox is not intended to be a graveyard! Over the next two quarters, we'll look for new ways to organize some of the ideas and requests captured in many of these tickets.

Recent Activity

Tue, Nov 12

Ottomata added a subtask for T204950: Public Edit Data Lake: Mediawiki history snapshots available in SQL data store to cloud (labs) users: T258511: Data Lake incremental Data Updates .
Tue, Nov 12, 3:26 PM · Data-Engineering-Icebox
Ottomata added a subtask for T204950: Public Edit Data Lake: Mediawiki history snapshots available in SQL data store to cloud (labs) users: T215858: Plan a replacement for wiki replicas that is better suited to typical OLAP use cases than the MediaWiki OLTP schema.
Tue, Nov 12, 3:25 PM · Data-Engineering-Icebox

Tue, Nov 5

lmata edited projects for T266886: Augment NEL reports with a computed timestamp-of-generation, added: SRE Observability (FY2024/2025-Q2); removed SRE Observability (FY2024/2025-Q1).
Tue, Nov 5, 5:11 PM · SRE Observability (FY2024/2025-Q2), Observability-Logging, Data-Engineering-Icebox, Analytics

Oct 25 2024

Ottomata closed T265966: Proposal: drop kafka-php dependency from MediaWiki as Resolved.

A quick codesearch (https://codesearch.wmcloud.org/search/?q=kafka-php&files=&excludeFiles=&repos=) and local grep yields no results, so this knot might have neatly tied itself

Oct 25 2024, 4:05 PM · Data-Engineering-Icebox, Analytics-Radar, Platform Team Workboards (Clinic Duty Team), MediaWiki-General

Oct 24 2024

Ottomata reopened T204950: Public Edit Data Lake: Mediawiki history snapshots available in SQL data store to cloud (labs) users as "Open".

I'd like to keep this open, but I will remove the Data-Services tag. It is something we would really like to do.

Oct 24 2024, 3:31 PM · Data-Engineering-Icebox
taavi closed T204950: Public Edit Data Lake: Mediawiki history snapshots available in SQL data store to cloud (labs) users as Declined.

That seems unnecessarily complicated when I just want to get old stale tasks off of the Data-Services board. I'll just close this instead, if someone is interested in getting this through the process they're free to re-open this.

Oct 24 2024, 2:50 PM · Data-Engineering-Icebox

Oct 22 2024

Ottomata added a comment to T204950: Public Edit Data Lake: Mediawiki history snapshots available in SQL data store to cloud (labs) users.

@taavi many tickets were declined for complexity reasons, but we have new ways of potentially doing this. It needs to be prioritized though, so if you have desires/needs, please escalate them through https://www.mediawiki.org/wiki/Data_Platform_Engineering/Intake_Process

Oct 22 2024, 5:38 PM · Data-Engineering-Icebox
taavi added a comment to T204950: Public Edit Data Lake: Mediawiki history snapshots available in SQL data store to cloud (labs) users.

Sorry to poke an many years old ticket.. but what still needs to happen here? All of the subtasks have been resolved already.

Oct 22 2024, 5:27 PM · Data-Engineering-Icebox
taavi moved T204950: Public Edit Data Lake: Mediawiki history snapshots available in SQL data store to cloud (labs) users from Datasets to Backlog on the Data-Services board.
Oct 22 2024, 5:26 PM · Data-Engineering-Icebox
taavi moved T173511: Implement technical details and process for "datasets_p" on wikireplica hosts from Datasets to Wiki replicas on the Data-Services board.
Oct 22 2024, 5:21 PM · Data-Engineering-Icebox, cloud-services-team, Analytics-Radar, Data-Services
elukey closed T242712: Deprecation (if possible) of the #central channel on irc.wikimedia.org as Declined.
Oct 22 2024, 10:32 AM · Data-Engineering-Icebox

Oct 14 2024

Pppery removed projects from T139019: statistics about edit conflicts according to page type: Growth-Team-Filtering, Growth-Team, StructuredDiscussions.
Oct 14 2024, 5:11 PM · Data-Engineering-Icebox, Analytics-Radar, research-ideas, Community-Tech (2015-2017), Two-Column-Edit-Conflict-Merge, MediaWiki-Page-editing
Gehel moved T262942: PoC on anomaly detection with Flink from Current work to Feature Requests on the Wikidata-Query-Service board.
Oct 14 2024, 2:40 PM · Data-Engineering-Icebox, Analytics-Radar, Wikidata, Wikidata-Query-Service

Oct 1 2024

Tgr added a comment to T261803: Requests for /static get an invalid WMF-Last-Access cookie for wikipedia.org on non-Wikipedia requests.

Yeah, the wider issue here is that setting the cookie on cross-site subresource requests is probably harmful. Browsers apply all kinds of restrictions to cookies on such requests (e.g. can write but cannot read, or can read and write but the expiration will be different from what you set, or you see different cookie values for each parent domain you are making the cross-site request from) which will probably mess up the stats.

Oct 1 2024, 6:35 PM · MediaWiki-Platform-Team (Radar), Data-Engineering-Icebox, Developer Productivity, Analytics-Radar, Traffic, WMF-General-or-Unknown, SRE
matmarex added a comment to T261803: Requests for /static get an invalid WMF-Last-Access cookie for wikipedia.org on non-Wikipedia requests.

This is still a problem today, and it makes for a distraction when debugging other cookie problems.

Oct 1 2024, 5:00 PM · MediaWiki-Platform-Team (Radar), Data-Engineering-Icebox, Developer Productivity, Analytics-Radar, Traffic, WMF-General-or-Unknown, SRE

Jul 30 2024

Manuel edited parent tasks for T278665: wmde-toolkit-analyzer-build.service fails on stat1007, added: T348609: [EPIC] Clarify team ownership of WMDE cronjobs on stats1007 ; removed: T351070: [EPIC] Clean up Wikidata Grafana cronjobs .
Jul 30 2024, 9:56 AM · Wikidata, Data-Engineering-Icebox, Analytics-Radar, WMDE-Analytics-Engineering

Jul 29 2024

colewhite claimed T266886: Augment NEL reports with a computed timestamp-of-generation.

I think this is doable for Logstash. I'll have a go at it.

Jul 29 2024, 9:00 PM · SRE Observability (FY2024/2025-Q2), Observability-Logging, Data-Engineering-Icebox, Analytics
Ottomata closed T252148: Add a "latest" partition to Hive tables as Declined.

Being bold and declining. In an Iceberg world, this won't be needed.

Jul 29 2024, 8:06 PM · Data-Engineering-Icebox, Analytics
CDanis added a comment to T266886: Augment NEL reports with a computed timestamp-of-generation.

Yep, Logstash presently, although it would be nice if we had them in Hive some day as well :)

Jul 29 2024, 4:52 PM · SRE Observability (FY2024/2025-Q2), Observability-Logging, Data-Engineering-Icebox, Analytics
Ottomata added a comment to T266886: Augment NEL reports with a computed timestamp-of-generation.

See also: T291645: Produce ECS formatted logstash logs to Event Platform, allowing them to be queried in the WMF Data Lake with SQL

Jul 29 2024, 4:46 PM · SRE Observability (FY2024/2025-Q2), Observability-Logging, Data-Engineering-Icebox, Analytics
colewhite added a comment to T266886: Augment NEL reports with a computed timestamp-of-generation.

Are these reports currently in Logstash or are they in Hive?

Jul 29 2024, 4:11 PM · SRE Observability (FY2024/2025-Q2), Observability-Logging, Data-Engineering-Icebox, Analytics
Legoktm added a comment to T209899: The mass-message queue reports 0 when there are still queued messages.

I'm not sure what the best solution is here. The number is wrong and is probably going to be wrong going forwards. We can remove/hide it along the lines of T209899#6656593, but removing it is also an API breaking change and this seems a little trivial to trigger that.

Jul 29 2024, 3:11 PM · Data-Engineering-Icebox, Analytics-Radar, ChangeProp, WMF-JobQueue, MassMessage

Jul 25 2024

nshahquinn-wmf closed T221828: Mediawiki-history release - Backlog as Declined.

I suspect that this tracking task is no longer useful.

Jul 25 2024, 11:27 PM · Data-Engineering-Icebox, Analytics

Jul 22 2024

Aklapper added a project to T246723: Add historical page protection status to MediaWiki history: MediaWiki-Page-history.
Jul 22 2024, 6:40 AM · MediaWiki-Page-history, Data-Engineering-Icebox, Analytics

Jul 15 2024

elukey added a comment to T242712: Deprecation (if possible) of the #central channel on irc.wikimedia.org.

At this point I'd proceed with the following:

  • Announce to Wikitech that we want to get rid or #central
  • File a change to stop sending events (I guess on mediawiki-config?)
Jul 15 2024, 2:44 PM · Data-Engineering-Icebox
fnegri closed T280152: Mitigate breaking changes from the new Wiki Replicas architecture as Resolved.

Marking this as Resolved as all the main subtasks have been completed. There are 4 subtasks left that are follow-ups to this work.

Jul 15 2024, 9:59 AM · Data-Engineering-Icebox, cloud-services-team, Analytics-Radar, Data-Services

Jul 11 2024

AKanji-WMF added a parent task for T253050: Bring Banner History data into Fundraising infrastructure: T369773: Epic: Support retrieval of page and banner view data for FR Analytics.
Jul 11 2024, 12:10 AM · Data-Engineering-Icebox, Analytics-Radar, fundraising-tech-ops, Fundraising-Backlog

Jul 9 2024

Maintenance_bot removed a project from T251812: System administrator reviews API usage by client: Patch-For-Review.
Jul 9 2024, 8:47 PM · Data-Engineering-Icebox, Analytics, Story, MediaWiki-REST-API
KCVelaga_WMF moved T325790: Special:ContentTranslationStats is slow and getting crowded from Essential workstream to Special:CXStats on the LPL Analytics board.
Jul 9 2024, 6:48 AM · LPL Analytics, LPL Essential, MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), ContentTranslation, Language-analytics, Data-Engineering-Icebox, Analytics, Technical-Debt

Jul 5 2024

akosiaris added a comment to T251812: System administrator reviews API usage by client.

All fluentbit images have (once more) been delete from the registry using https://wikitech.wikimedia.org/wiki/Docker-registry#Deleting_images

Jul 5 2024, 8:23 PM · Data-Engineering-Icebox, Analytics, Story, MediaWiki-REST-API
gerritbot added a comment to T251812: System administrator reviews API usage by client.

Change #1052314 merged by jenkins-bot:

[operations/deployment-charts@master] api-gateway: Remove eventgate logging support

https://gerrit.wikimedia.org/r/1052314

Jul 5 2024, 3:38 PM · Data-Engineering-Icebox, Analytics, Story, MediaWiki-REST-API
gerritbot added a comment to T251812: System administrator reviews API usage by client.

Change #1052314 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] api-gateway: Remove eventgate logging support

https://gerrit.wikimedia.org/r/1052314

Jul 5 2024, 2:39 PM · Data-Engineering-Icebox, Analytics, Story, MediaWiki-REST-API
gerritbot added a comment to T251812: System administrator reviews API usage by client.

Change #1051407 merged by Alexandros Kosiaris:

[operations/docker-images/production-images@master] Revert "Resurrect fluent-bit image"

https://gerrit.wikimedia.org/r/1051407

Jul 5 2024, 2:27 PM · Data-Engineering-Icebox, Analytics, Story, MediaWiki-REST-API
KCVelaga_WMF moved T325790: Special:ContentTranslationStats is slow and getting crowded from Incoming to Essential workstream on the LPL Analytics board.
Jul 5 2024, 1:01 PM · LPL Analytics, LPL Essential, MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), ContentTranslation, Language-analytics, Data-Engineering-Icebox, Analytics, Technical-Debt
KCVelaga_WMF added a project to T325790: Special:ContentTranslationStats is slow and getting crowded: LPL Analytics.
Jul 5 2024, 9:35 AM · LPL Analytics, LPL Essential, MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), ContentTranslation, Language-analytics, Data-Engineering-Icebox, Analytics, Technical-Debt

Jul 3 2024

Pginer-WMF edited projects for T325790: Special:ContentTranslationStats is slow and getting crowded, added: LPL Essential; removed Language-Team (Language-2024-April-June).
Jul 3 2024, 9:24 AM · LPL Analytics, LPL Essential, MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), ContentTranslation, Language-analytics, Data-Engineering-Icebox, Analytics, Technical-Debt

Jul 2 2024

akosiaris closed T251812: System administrator reviews API usage by client as Resolved.

I am resolving the task given comments from 4 years ago. However, repeating that the functionality added in the course of this task 4 years ago is going to be removed since it's unused and causes maintenance burden.

Jul 2 2024, 4:21 PM · Data-Engineering-Icebox, Analytics, Story, MediaWiki-REST-API
akosiaris added a comment to T251812: System administrator reviews API usage by client.

4 years later, we don't see any data flowing in the kafka topic created back then. This feature apparently has never been used. But it is costing us in maintenance efforts as the image is on buster and we wanna to remove those images from the registry. Hence, after some discussions in #wikimedia-serviceops IRC channel, we have decided to disable the functionality from api-gateway and delete the fluentbit docker image from our repo as this pipeline is the only user of it. If anyone ever reaches this task and comment and is interested in the functionality implemented during work on this task, it can always be resurrected, assuming it's properly resourced.

Jul 2 2024, 4:07 PM · Data-Engineering-Icebox, Analytics, Story, MediaWiki-REST-API
gerritbot added a project to T251812: System administrator reviews API usage by client: Patch-For-Review.
Jul 2 2024, 4:00 PM · Data-Engineering-Icebox, Analytics, Story, MediaWiki-REST-API
gerritbot added a comment to T251812: System administrator reviews API usage by client.

Change #1051407 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/docker-images/production-images@master] Revert "Resurrect fluent-bit image"

https://gerrit.wikimedia.org/r/1051407

Jul 2 2024, 4:00 PM · Data-Engineering-Icebox, Analytics, Story, MediaWiki-REST-API
Marostegui added a comment to T158166: Discuss labsdb visibility of rev_text_id and ar_comment.

Would you mind creating a task for that field? Just to have a clearer task, as this one is a bit messy and can be confusing with some of the already non existing fields.
Thank you!

Jul 2 2024, 8:59 AM · Data-Engineering-Icebox, Data-Services, Analytics-Radar
Zache added a comment to T158166: Discuss labsdb visibility of rev_text_id and ar_comment.

I am still interested for archive comments as it makes possible to for example analyse if there were notability discussion before page was deleted and how many users participated to discussion with edit. (ie. starting discussion finnish wikipedia it generates semi-standard comment lines which can be filtered using SQL)

Jul 2 2024, 8:10 AM · Data-Engineering-Icebox, Data-Services, Analytics-Radar
Marostegui closed T158166: Discuss labsdb visibility of rev_text_id and ar_comment as Declined.

All those fields are gone
ar_comment T233135: Schema change for refactored actor and comment storage
rev_text_id https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DeletePagesForGood/+/958530
I cannot find the task for archive_text_id but it is not present in the archive table

Jul 2 2024, 7:53 AM · Data-Engineering-Icebox, Data-Services, Analytics-Radar

Jul 1 2024

fnegri added a comment to T158166: Discuss labsdb visibility of rev_text_id and ar_comment.

I've marked this as "Low" priority as there was no activity on the task since 2019. If someone is still interested in having those fields in replicas, please leave a comment below.

Jul 1 2024, 5:07 PM · Data-Engineering-Icebox, Data-Services, Analytics-Radar
ReleaseTaggerBot edited projects for T325790: Special:ContentTranslationStats is slow and getting crowded, added: MW-1.43-notes (1.43.0-wmf.12; 2024-07-02); removed MW-1.43-notes (1.43.0-wmf.8; 2024-06-04).
Jul 1 2024, 12:00 PM · LPL Analytics, LPL Essential, MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), ContentTranslation, Language-analytics, Data-Engineering-Icebox, Analytics, Technical-Debt
Maintenance_bot removed a project from T325790: Special:ContentTranslationStats is slow and getting crowded: Patch-For-Review.
Jul 1 2024, 11:33 AM · LPL Analytics, LPL Essential, MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), ContentTranslation, Language-analytics, Data-Engineering-Icebox, Analytics, Technical-Debt
gerritbot added a comment to T325790: Special:ContentTranslationStats is slow and getting crowded.

Change #1041413 merged by jenkins-bot:

[mediawiki/extensions/ContentTranslation@master] Combine two slow queries into one

https://gerrit.wikimedia.org/r/1041413

Jul 1 2024, 11:04 AM · LPL Analytics, LPL Essential, MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), ContentTranslation, Language-analytics, Data-Engineering-Icebox, Analytics, Technical-Debt
fnegri triaged T158166: Discuss labsdb visibility of rev_text_id and ar_comment as Low priority.
Jul 1 2024, 9:44 AM · Data-Engineering-Icebox, Data-Services, Analytics-Radar

Jun 25 2024

elukey added a comment to T242712: Deprecation (if possible) of the #central channel on irc.wikimedia.org.

The only users that I can see are in #central:

Jun 25 2024, 8:40 AM · Data-Engineering-Icebox
Jdforrester-WMF added a comment to T147137: Decide on JSON validation library.

In practice we still have two PHP JSON Schema implementations live in production (plus at least one other in JavaScript); I think we should probably keep this open until that is resolved.

Jun 25 2024, 8:19 AM · Data-Engineering-Icebox, Analytics-Radar, Multimedia