Wikipedia:Wikipedia Signpost/2020-03-01/By the numbers

By the numbers

How many actions by administrators does it take to clean up spam?

Administrators clean up the messes left by other editors. The time and effort spent by admins is a key to building and maintaining quality of the encyclopedia. Among the actions they take are blocking other editors, deleting articles, and protecting articles from vandalism. MER-C has collected the data for these and other admin actions taken in 2019, mostly on the English language Wikipedia (enWiki). See github for his methodology and this page for the raw data.

While the descriptive statistics themselves may be of interest, especially to administrators, our main purpose in examining them is to explore the burden that spam places on admins. Spam is not identical to paid editing – for example, an unpaid fan of an entertainer might wish to post the website of the entertainer's fanclub on dozens of pages. We believe, however, that most spam is inserted by editors, including paid editors, with a more serious conflict of interest.

As a rough indication of importance of spam to admins, we summed the number of blocks, deletions, and protections related to various wiki-offenses on enWiki. Using an open proxy had the highest total actions for 2019 (387,984), spam has the second highest total (81,699), followed by vandalism (68,039) and sockpuppeting/long term abuse (46,029). Not all admin actions require the same amount of time or dedication – discovering and blocking open proxies may be fairly simple or automatic and it is difficult to compare the time required for the three other major wiki-offenses. But as a first approximation this simple measure lets us know that spam is one of the most frequent problems for admins.

Blocks

Other than a global lock, which prevents an editor from editing on all WMF sites, a block on English Wikipedia is the most serious action that an editor faces. The table below records all blocks on enWiki for 2019. The use of an open proxy or web host accounts for almost 70% of the more than half a million blocks. These open proxy blocks may be because of the effectiveness of a bot, ProcseeBot, in uncovering proxy users.

Vandalism, spamming, and sockpuppeting and long-term abuse are responsible for the large majority (72.4%) of the remaining 168,649 blocks after the open proxy blocks are subtracted. For all blocks, spamming follows vandalism as the most important reason for these blocks and is ahead of sockpuppeting. Dividing the data into registered accounts and anonymous (IP) editors, we see that blocks for spamming are predominantly for registered accounts, while blocks for vandalism are more evenly divided. Thus for registered accounts, spamming is the most frequent reason for blocking. Many vandals may feel that registering an account is too time consuming for editing that will almost surely get them blocked, whereas spammers may feel that their editing is more difficult for admins to discover if they have a registered account. Either that or vandals mostly target existing articles, whilst spammers often want to create an article on a non notable business - and for that they need an account.

All enWiki blocks for 2019
Reason All Blocks IP Blocks Account Blocks
Total 556,633 448,515 108,118
Open proxy/web host 387,984 387,979 5
Vandalism 53,451 30,700 22,751
Spamming 38,112 970 37,142
Sockpuppetry and long term abuse 30,541 8,003 22,538
Disruptive editing 7,928 4,839 3,089
Anonymous blocks 6,029 5,814 215
Not here to build the encyclopedia 5,902 81 5,821
Other inappropriate username 5,782 - 5,782
Unclassified 5,385 2,384 3,001
Triggering the edit filter 4,630 1,894 2,736
Range blocks 3,313 3,260 53
Promotional username soft blocks 2,360 - 2,360
BLP violations 1,255 697 558
Harassment 1,047 560 487
Edit warring 872 285 587
Unauthorized, malfunctioning bot or bot username 339 - 339

Looking at global locks rather than just enWiki blocks shows an even larger relation to spamming. Just over 200 locks per day, or 73,474 for the year, are performed because of spamming, accounting for 72.7% of all global locks. Many of these locks are likely due to the use of spam-bots, which apparently find it easy to avoid Wikipedia's CAPTCHA screening at registration. These locks are normally performed by stewards.

All Global locks (and unlocks) for 2019
Reason Count Percent
Total 101,108 100%
Spamming 73,474 72.7%
Long term abuse 22,795 22.5%
Cross wiki abuse 2,720 2.7%
Unclassified 1,063 1.1%
Vandalism 820 0.8%
Inappropriate username 183 0.2%
Compromised 53 0.1%

Deletions

Deletions are the most serious action that can be taken for articles, user pages including drafts, and files. Spamming itself is only named as the cause in 4.7% of deletions of articles on enWiki. However, other named causes may also be related to spamming or paid editing. For example articles for deletion (AfD) discussions and expired proposed deletions (PRODs) together account for 27.8% of article deletions and a major proportion of these may be due to spam.

Examining deletions in all namespaces presents a clearer picture. Spam is the 4th most frequent reason for deletion in all namespaces. Over 118 items per day, or 43,342 for the year, were deleted. Many of these deletions are likely draft articles, e.g. those being reviewed at WP:Articles for creation or being prepared in user space. Abandoned drafts, which are also likely to be related to spam or paid editing, were responsible for 67,253 deletions for the year. The overall picture appears to be a multi-level of screening for deletion of spam on enWiki. In the first level, large numbers of drafts are submitted and later abandoned as the authors discover that we consider the draft to be spam. This includes up to 67,253 deletions. In subsequent screening levels, drafts are outright deleted at AfC or the draft stage amounting to 43,342 deletions. Many of the 38,287 miscellany for deletion (MfD) discussions may also be related to spam or paid editing, as are some of the 25,297 expired PRODs. These 4 categories (which may include some double counting) add up to a total of 174,179 possible deletions (477 per day) at the draft stage. After an article is accepted, it may later be deleted as spam (4,825 per year) or as an expired PROD (9,271) or at an AfD debate (19,225). The battle of admins to clean up spam by deletion is spread out in many stages and is clearly time consuming.

2019 deletions
Reason Articles All Namespaces
Total 102,344 623,202
Dependent on deleted page 21,134 181,164
Deletion debate (AFD) 19,225 20,772
Expired PROD 9,271 25,297
Maintenance 8,789 46,909
Deletion debate (RFD) 5,792 7,316
Created by block/ban evading sockpuppet 5,727 12,523
Fails to give reason for inclusion 5,410 5,473
Cross-namespace redirect 5,125 5,180
Spam 4,825 43,342
Author/user request 3,384 21,409
Unclassified 2,686 18,128
Copyright violations 2,386 10,761
Implausible redirect 1,357 1,720
Unclassified nukes 1,161 4,941
No content or context 1,053 1,083
Repost of deleted content 1,000 1,397
Unnecessary disambiguation 692 705
Vandalism 663 6,116
Copyright problems 603 656
Redundant 455 2,943
Test page 366 3,563
Deletion debate (MFD) 335 38,287
No reason given 216 1,416
Expired BLP PROD 177 179
Made up one day 176 180
Attack page 125 1,627
Patent nonsense 118 1,062
Foreign language 54 55
Abandoned draft 22 67,253
Misuse of Wikipedia as a webhost 6 28,813
Deletion debate (TFD) 5 24,634
File redirect to Commons 5 178
User page where user does not exist 1 837
Problems with non-free files - 23,248
File moved to Commons - 13,796
Deletion debate (CFD) - 9,947
Empty category - 9,682
Lack of copyright information (files) - 4,805
Category renaming or merger - 3,196
Deletion debate (FFD) - 2,201
Corrupt file - 323

Protections

At first glance, the use of article protection is not extensively used by admins to stop spamming. Spamming is the tenth most common reason for article protection for both articles and in all namespaces. Some protections caused by spamming might be included under other headings, for example, unclassified disruptive editing, addition of unsourced material, unclassified salting, or unclassified.

Protections for 2019
Reason Articles All Namespaces
Total 22,543 27,275
Vandalism 7,634 8,472
Unclassified disruptive editing 5,170 5,416
Sock puppetry 2,395 2,965
Addition of unsourced material 1,916 1,955
BLP violations 1,791 1,816
Unclassified salting 1,265 1,890
Unclassified 726 1,439
Edit warring/content dispute 837 897
High risk page 351 1,805
Spamming 197 245
Arbitration enforcement 197 202
Copyright violations 64 75
User request - 98

Conclusions

Spamming is the most common reason for actions by administrators other than for the use of an open proxy. It is the most common reason for blocking registered accounts and is cited for 72.7% of all global locks.

Spam is only the ninth most common reason for article deletions, but the fourth most common reason for deletion in all namespaces. It appears that the effort to delete spam crosses many of the classifications used for deletions, e.g. abandoned drafts, MfD, AfD, and expired PRODs.

The least important aspect of the use of admin tools against spam appears to be article protection.