Looking at the code, it stores the stats values every couple of hours but it won't select any of them, only the most recent dates and let the rest rot. It's already holding 2.4M rows in Polish wikisource and I can't make schema changes live on the table due to this. This needs clean up.
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
mediawiki: Reduce the frequency of flaggedrevs updates | operations/puppet | production | +1 -1 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T31744 FlaggedRev installation (deployment) requests (tracking) | |||
Stalled | None | T143886 Activating Flagged revisions on ar.wikinews | |||
Stalled | None | T204354 Flagged Revisions for Vietnamese Wikipedia | |||
Stalled | None | T205145 Deploy FlaggedRevs on bn.wikibooks | |||
Stalled | None | T221933 Enable Flagged Revisions (for trial run purpose) at the Chinese Wikipedia | |||
Open | None | T185664 Code stewardship review: FlaggedRevs | |||
Resolved | Ladsgroup | T277883 Drop all low-use and unused features of FlaggedRevs to make it more maintainable | |||
Open | Ladsgroup | T323495 flaggedrevs_statistics is holding millions of rows while FR needs only a handful |
Event Timeline
I double checked, it doesn't need any non-most-recent row. Just to be safe, I'll delete anything older than 2022. Now deletion like that will cause read-only given the sheer size of crap in this lovely extension. I have to run it in batches.
There is some older tickets related to this: T163107, T44360 , T177951.
Couple of notes. The flaggedrevs_statistics table is currently the only place to see how Flaggedrews works over a longer period of time so even deleting or tidying the duplicate values should be done so that there is lest one row per day (even if the value is clearly broken = identical than last value).
An even better solution would move displaying stats to wikistats so that there would be formal crosswiki stats system and deprecate this but until then the least some broken logging is better than none.
@Ladsgroup I have used the data for generating the crosswiki reports on how flaggedrevs is working so even it is kind of broken it works for that. Also some data works on somewikis and not on others. No idea why.
We have dumps of this table: https://dumps.wikimedia.org/fiwiki/20221120/fiwiki-20221120-flaggedrevs_statistics.sql.gz you can simply download this one and look at old data. Anything older than a year can be accessed via this.
This would cover your usecase. Again, we need to simply and prune this table, this means sometimes barely used features have to go as well. Specially if they are not part of the standard extension.
And it would make the statistics gathering multiple times harder at same time. Also, even if there is millions rows the amount of the data in the disk is trivial.
The problem is not the disk size, I have explained in details why we need to be careful on what we keep in core databases. It's not the kind of problem you can throw hardware at: T294881#7743146
It's also adds a lot of maintenance burden, for example I can't run schema changes on it live, which turns a maint work that could take an hour to something that would take months requiring read-only times for each wiki.
Third, in any other extension with lower tech debt and steward, that'd be to some degree acceptable but FlaggedRevs has too many issues to count and simplifying its data storage and data model would help keeping it alive a lot.
Yeah, and i have explained here why the data for getting trends is important T44360#3235893.
In any case, you could tidy the database size down just to average the data to daily without losing the history
I believe something like this would work (though needs to be tested first and I don't have clue how it should be done in production).
CREATE TEMPORARY TABLE flaggedrevs_statistics_temp ( SELECT FLOOR(AVG(frs_timestamp)) as frs_timestamp, frs_stat_key, FLOOR(AVG(frs_stat_val)) as frs_stat_val FROM flaggedrevs_statistics GROUP BY frs_stat_key, FLOOR(frs_timestamp/1000000) ); TRUNCATE flaggedrevs_statistics; INSERT INTO flaggedrevs_statistics SELECT * FROM flaggedrevs_statistics_temp;
Also, I believe that updating (adding new rows) the stats is still running through cronjob or something and it could be running less than 12 per day if one would want to reduce the creation of new data without fixing the actual update script.
Yup, it has two sides:
- Clean up old data
- Store correctly onwards.
I can make it store only once per day and clean it up somehow. I can't run that query in production, it'll choke replication (large write queries do), but finding ways to do it in chunks shouldn't be hard.
Change 859589 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):
[operations/puppet@production] mediawiki: Reduce the frequency of flaggedrevs updates
Having up-to-date data is important, especially for motivation. If I worked hard to get a counter in the namespaces section to zero, and it’s still non-zero after hours, it’s disappointing. A better solution would be still collecting the data frequently, but dropping some older rows (e.g. the previous one unless it’s the one kept as daily statistics).
It’s even more important to have the “list of 5 most active reviewers in the last hour” table up-to-date (it makes no sense to have 23-hours-old data about the most active reviewers in the “last” hour), but that bit seems to actually be live, despite the note about stale data on the top.
I'm starting to clean up the old data (and keeping one per day) but I also realized it has been storing a lot of rows per timestamp as well. For example for plwikisource, it keeps 38 rows per timestamp:
wikiadmin@10.64.32.196(plwikisource)> select * from flaggedrevs_statistics order by frs_timestamp limit 40; +----------------+--------------------------------+--------------+ | frs_timestamp | frs_stat_key | frs_stat_val | +----------------+--------------------------------+--------------+ | 20111013052535 | pendingLag-average | 0 | | 20111013052535 | reviewLag-average | 3854 | | 20111013052535 | reviewLag-median | 1150 | | 20111013052535 | reviewLag-percentile:35 | 395 | | 20111013052535 | reviewLag-percentile:45 | 680 | | 20111013052535 | reviewLag-percentile:55 | 1338 | | 20111013052535 | reviewLag-percentile:65 | 1570 | | 20111013052535 | reviewLag-percentile:75 | 2999 | | 20111013052535 | reviewLag-percentile:85 | 5140 | | 20111013052535 | reviewLag-percentile:90 | 10883 | | 20111013052535 | reviewLag-percentile:95 | 22113 | | 20111013052535 | reviewLag-sampleEndTimestamp | 1318483533 | | 20111013052535 | reviewLag-sampleSize | 38 | | 20111013052535 | reviewLag-sampleStartTimestamp | 1310707533 | | 20111013052535 | reviewedPages-NS:0 | 16739 | | 20111013052535 | reviewedPages-NS:10 | 1467 | | 20111013052535 | reviewedPages-NS:100 | 43019 | | 20111013052535 | reviewedPages-NS:102 | 385 | | 20111013052535 | reviewedPages-NS:104 | 324 | | 20111013052535 | reviewedPages-NS:12 | 20 | | 20111013052535 | reviewedPages-NS:14 | 1798 | | 20111013052535 | reviewedPages-NS:6 | 50 | | 20111013052535 | syncedPages-NS:0 | 16739 | | 20111013052535 | syncedPages-NS:10 | 1467 | | 20111013052535 | syncedPages-NS:100 | 43019 | | 20111013052535 | syncedPages-NS:102 | 385 | | 20111013052535 | syncedPages-NS:104 | 324 | | 20111013052535 | syncedPages-NS:12 | 20 | | 20111013052535 | syncedPages-NS:14 | 1798 | | 20111013052535 | syncedPages-NS:6 | 50 | | 20111013052535 | totalPages-NS:0 | 21759 | | 20111013052535 | totalPages-NS:10 | 1467 | | 20111013052535 | totalPages-NS:100 | 43020 | | 20111013052535 | totalPages-NS:102 | 385 | | 20111013052535 | totalPages-NS:104 | 456 | | 20111013052535 | totalPages-NS:12 | 20 | | 20111013052535 | totalPages-NS:14 | 1798 | | 20111013052535 | totalPages-NS:6 | 50 | | 20111019200751 | pendingLag-average | 0 | | 20111019200751 | reviewLag-average | 3677 | +----------------+--------------------------------+--------------+ 40 rows in set (0.050 sec)
Do we need all of these? Which one is safe to get rid of? Each type removed translates to millions of rows being cleaned up. Maybe syncedPages-NS*?
! In T323495#8419025, @Ladsgroup wrote:
Do we need all of these? Which one is safe to get rid of? Each type removed translates to millions of rows being cleaned up. Maybe syncedPages-NS*?
Important ones are
- pendingLag-average
- reviewedPages-*
- syncedPages-*
- totalPages-*
reviewLag-* values are known to be broken in some wikis and script is duplicating same data over and over. reviewLag can be detected to be broken by checking how many unique values there is. If the number is well below 500 then it is broken, but if i remember correctly in some wikis it has worked for some time and then broke.
I can check in which wikis it is working, but my guess is that in most of the wikis it is broken and reviewLags could be deleted.
MariaDB [fiwiki_p]> select count(distinct(frs_stat_val)) as number_of_distinct_values, sum(1) as number_of_rows, frs_stat_key from flaggedrevs_statistics group by frs_stat_key; +---------------------------+----------------+-------------------------------------+ | number_of_distinct_values | number_of_rows | frs_stat_key | +---------------------------+----------------+-------------------------------------+ | 42253 | 45213 | pendingLag-average | | 9 | 43228 | reviewLag-anon-average | | 9 | 43228 | reviewLag-anon-median | | 9 | 43228 | reviewLag-anon-percentile:35 | | 9 | 43228 | reviewLag-anon-percentile:45 | | 8 | 43228 | reviewLag-anon-percentile:55 | | 8 | 43228 | reviewLag-anon-percentile:65 | | 8 | 43228 | reviewLag-anon-percentile:75 | | 8 | 43228 | reviewLag-anon-percentile:85 | | 9 | 43228 | reviewLag-anon-percentile:90 | | 9 | 43228 | reviewLag-anon-percentile:95 | | 6 | 43228 | reviewLag-anon-sampleEndTimestamp | | 9 | 43228 | reviewLag-anon-sampleSize | | 6 | 43228 | reviewLag-anon-sampleStartTimestamp | | 157 | 1985 | reviewLag-average | | 85 | 1985 | reviewLag-median | | 77 | 1970 | reviewLag-percentile:35 | | 90 | 1970 | reviewLag-percentile:45 | | 88 | 1970 | reviewLag-percentile:55 | | 70 | 1970 | reviewLag-percentile:65 | | 67 | 1970 | reviewLag-percentile:75 | | 59 | 1970 | reviewLag-percentile:85 | | 58 | 1970 | reviewLag-percentile:90 | | 46 | 1970 | reviewLag-percentile:95 | | 192 | 1985 | reviewLag-sampleEndTimestamp | | 90 | 1985 | reviewLag-sampleSize | | 180 | 1985 | reviewLag-sampleStartTimestamp | | 10 | 43228 | reviewLag-user-average | | 1 | 43228 | reviewLag-user-median | | 1 | 43228 | reviewLag-user-percentile:35 | | 1 | 43228 | reviewLag-user-percentile:45 | | 3 | 43228 | reviewLag-user-percentile:55 | | 4 | 43228 | reviewLag-user-percentile:65 | | 4 | 43228 | reviewLag-user-percentile:75 | | 6 | 43228 | reviewLag-user-percentile:85 | | 8 | 43228 | reviewLag-user-percentile:90 | | 10 | 43228 | reviewLag-user-percentile:95 | | 6 | 43228 | reviewLag-user-sampleEndTimestamp | | 10 | 43228 | reviewLag-user-sampleSize | | 6 | 43228 | reviewLag-user-sampleStartTimestamp | | 40146 | 45213 | reviewedPages-NS:0 | | 5979 | 45213 | reviewedPages-NS:10 | | 16456 | 45213 | reviewedPages-NS:6 | | 245 | 41370 | reviewedPages-NS:828 | | 38899 | 45213 | syncedPages-NS:0 | | 6813 | 45213 | syncedPages-NS:10 | | 16506 | 45213 | syncedPages-NS:6 | | 254 | 41370 | syncedPages-NS:828 | | 39060 | 45213 | totalPages-NS:0 | | 6076 | 45213 | totalPages-NS:10 | | 17425 | 45213 | totalPages-NS:6 | | 248 | 41370 | totalPages-NS:828 | +---------------------------+----------------+-------------------------------------+
Here is some analysis.
- Columns reviewLag-anon-average, reviewLag-user-average, have meaningful data in some wikis (such as trwiki and ukwiki)
- data in other fields is pretty much repeating values
Personally i would drop the reviewLag-* values as in most of the wikis they are broken and it is likely some problems in updating the values in cases where it works least somehow.
Examples
- https://fiwiki-tools.toolforge.org/fr_stats_example.html?db=fiwiki_p,arwiki_p,ukwiki_p,trwiki_p&frs_key=reviewLag-anon-average
- https://fiwiki-tools.toolforge.org/fr_stats_example.html?db=fiwiki_p,arwiki_p,ukwiki_p,trwiki_p&frs_key=reviewLag-user-average
Number of distinct values in selected reviewLag columns in different wikis.
wiki reviewLag-anon-average reviewLag-anon-median reviewLag-anon-percentile:65 reviewLag-anon-sampleSize reviewLag-anon-sampleStartTimestamp reviewLag-average reviewLag-median reviewLag-percentile:65 reviewLag-percentile:65 reviewLag-user-average reviewLag-user-median reviewLag-user-percentile:65 reviewLag-user-sampleEndTimestamp alswiki_p 3 2 2 3 2 1 1 1 1 2 1 1 1 arwiki_p 17726 36 35 45 23 1 1 1 1 17578 1 6 23 bewiki_p 2627 935 924 456 2544 315 226 224 224 7952 7 10 2543 bnwiki_p 1 1 1 1 1 1 1 1 1 bswiki_p 5426 891 1067 678 7337 5 5 5 5 9799 1 16 7337 cawikinews_p 1 1 1 1 1 1 7 1 1 3 cewiki_p 8 5 6 8 4 82 1 1 183 ckbwiki_p 1 1 1 1 1 1 1 dewiki_p 10 6 6 9 1 2 2 2 2 6 1 2 1 dewiktionary_p 2 2 2 2 2 1 1 1 1 1 1 1 1 elwikinews_p 3 3 2 2 3 8 4 3 3 7 4 4 5 enwiki_p 1 1 1 1 1 1 1 1 1 enwikibooks_p 5 4 4 5 4 2 2 2 2 7 7 7 4 enwikinews_p 1508 418 414 167 24637 219 49 45 45 7751 209 584 26327 eowiki_p 2 2 2 2 2 1 1 1 1 43199 1 1 1 fawiki_p 1 1 1 1 1 1 1 fawikinews_p 4 4 3 4 7 2 2 1 1 75 1 1 10 fiwiki_p 9 9 8 9 6 157 85 70 70 10 1 4 6 frwikinews_p 4 3 2 3 2 1 1 1 1 3 1 1 1 hewikisource_p 1 1 1 1 1 1 3 2 2 3 hiwiki_p 1 1 1 1 1 1 1 1 1 huwiki_p 4 2 2 4 1 1 1 1 1 6 1 1 1 iawiki_p 4 3 4 4 6 1 1 1 1 6 1 1 5 idwiki_p 8 8 7 8 7 1 1 1 1 12 2 3 7 kawiki_p 645 326 312 330 931 14 12 12 12 5558 1 1 930 plwiki_p 9 4 4 10 1 2 1 2 2 387 1 1 1 plwikisource_p 982 59 61 44 356 82 26 25 25 784 1 1 355 plwiktionary_p 10280 1346 1404 949 2079 198 125 121 121 6910 1 1 2078 ptwikibooks_p 8 7 8 7 5 2 2 1 1 6 1 4 5 ptwikinews_p 8 6 6 7 4 1 1 1 1 10 1 3 3 ptwikisource_p 60 32 33 21 284 15 9 8 8 181 4 6 283 ruwiki_p 2 2 2 2 1 1 1 1 1 43166 6 6 1 ruwikinews_p 268 66 47 54 4050 100 26 26 26 11277 3 17 4049 ruwikisource_p 6 5 4 5 5 2 1 1 1 11 1 3 5 ruwiktionary_p 583 22 24 22 24 4 3 3 3 287 1 1 22 sqwiki_p 3 2 2 3 1 2 2 1 1 3 2 3 2 test2wiki_p 2 2 1 2 2 3 3 3 3 trwiki_p 37692 3894 3971 1975 4455 442 374 378 378 32386 1 2 4456 trwikiquote_p 2550 313 307 303 3855 22 17 18 18 9954 6 98 3855 ukwiki_p 3479 460 455 398 456 52 45 45 45 9307 1 89 456 vecwiki_p 349 93 90 76 3022 158 41 46 46 1145 1 4 3021
Code for generating table
Thanks, this is valid point.
It’s even more important to have the “list of 5 most active reviewers in the last hour” table up-to-date (it makes no sense to have 23-hours-old data about the most active reviewers in the “last” hour), but that bit seems to actually be live, despite the note about stale data on the top.
Afaik this doesn't change that list as if i understand correctly the code then the list of most active reviewers are generated on fly (?)
What's the difference between syncedPages-* and reviewedPages-*? They seem to have almost identical values in wikis I checked, even the percentage combines these two and I wonder if one can be deleted in favor of the other without much signal being lost.
What's the difference between syncedPages-* and reviewedPages-*?
totalPages-* = total pages in the namespace ( "pages" column Special:ValidationStatistics)
reviewedPages-* = least once reviewed pages in the namespace ( "Reviewed" column Special:ValidationStatistics)
syncedPages-* = number of pages which latest version is reviewed in the namespace ("Reviewed" column Special:ValidationStatistics)
"Synced/Reviewed" column is syncedPages / reviewedPages.
Outdated = reviewed - synced ( ie. pending changes )
Unreviewed = totalPages - reviewed
They seem to have almost identical values in wikis I checked, even the percentage combines these two and I wonder if one can be deleted in favor of the other without much signal being lost.
They cannot be deleted as they are used for calculating values. Also it depends on wiki (and how they are using flaggedrevs), and time when value is taken if there is significant difference in numbers. In example these are from ruwiki.
year total_pages_ns0_avg reviewed_pages_ns0_avg synced_pages_ns0_avg pending_changes_avg 2011 795339 716591 624527 92064 2012 846635 769079 658162 110917 2013 1020816 943144 778797 164347 2014 1131552 1058719 877582 181137 2015 1229720 1159341 968508 190833 2016 1322675 1267686 1035040 232646 2017 1404197 1342062 1089368 252694 2018 1481811 1419904 1140898 279006 2019 1551381 1490383 1188168 302215 2020 1645599 1579578 1255634 323944 2021 1735227 1667736 1351624 316112 2022 1827644 1756645 1418276 338369
Okay, I was a bit confused by the terminology in FR. If a synced page means it's fully reviewed and reviewed page means it has at least one reviewed revision, then I personally think "reviewedpage" doesn't bring much value. When dealing with FR in my home wiki, a page is fully reviewed or needs to be reviewed (whether it has one reviewed revision or not), the third piece in between doesn't bring much info to me. Is that correct?
Yes, I know, I just wanted to mention it for the sake of completeness.
The difference between unsynced but reviewed and unreviewed pages is that unreviewed pages are not “protected” by FlaggedRevs: their unsynced version is shown to readers instead of a synced one (because there’s no such version), making vandalism there more visible. Therefore it should be a priority to review them (although the practice is probably just the opposite, as reviewing a small diff takes less effort than reviewing a whole new article).
Change 859589 merged by Ladsgroup:
[operations/puppet@production] mediawiki: Reduce the frequency of flaggedrevs updates