As part of the Wikistats 2 project we have developed the Edit Data Lake (see https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits). The Edit Data Lake is a denormalized data store that is the best dataset we have had to date to answer questions about content and contributors. At this time this data is only available for the WMF in the private hadoop cluster.
This is the parent task for all the work to make the Data Lake data available on our public cloud infrastructure for our community at large; the more accessible that data is, the more impact it can have.
Description
Event Timeline
De-prioritizing until cloud infrastructure can support monitoring similar to what we can do in production.
Sorry to poke an many years old ticket.. but what still needs to happen here? All of the subtasks have been resolved already.
@taavi many tickets were declined for complexity reasons, but we have new ways of potentially doing this. It needs to be prioritized though, so if you have desires/needs, please escalate them through https://www.mediawiki.org/wiki/Data_Platform_Engineering/Intake_Process
That seems unnecessarily complicated when I just want to get old stale tasks off of the Data-Services board. I'll just close this instead, if someone is interested in getting this through the process they're free to re-open this.
I'd like to keep this open, but I will remove the Data-Services tag. It is something we would really like to do.