Wikipedia:Wikipedia Signpost/2020-03-01/By the numbers
How many actions by administrators does it take to clean up spam?
Administrators clean up the messes left by other editors. The time and effort spent by admins is a key to building and maintaining quality of the encyclopedia. Among the actions they take are blocking other editors, deleting articles, and protecting articles from vandalism. MER-C has collected the data for these and other admin actions taken in 2019, mostly on the English language Wikipedia (enWiki). See github for his methodology and this page for the raw data.
While the descriptive statistics themselves may be of interest, especially to administrators, our main purpose in examining them is to explore the burden that spam places on admins. Spam is not identical to paid editing – for example, an unpaid fan of an entertainer might wish to post the website of the entertainer's fanclub on dozens of pages. We believe, however, that most spam is inserted by editors, including paid editors, with a more serious conflict of interest.
As a rough indication of importance of spam to admins, we summed the number of blocks, deletions, and protections related to various wiki-offenses on enWiki. Using an open proxy had the highest total actions for 2019 (387,984), spam has the second highest total (81,699), followed by vandalism (68,039) and sockpuppeting/long term abuse (46,029). Not all admin actions require the same amount of time or dedication – discovering and blocking open proxies may be fairly simple or automatic and it is difficult to compare the time required for the three other major wiki-offenses. But as a first approximation this simple measure lets us know that spam is one of the most frequent problems for admins.
Blocks
Other than a global lock, which prevents an editor from editing on all WMF sites, a block on English Wikipedia is the most serious action that an editor faces. The table below records all blocks on enWiki for 2019. The use of an open proxy or web host accounts for almost 70% of the more than half a million blocks. These open proxy blocks may be because of the effectiveness of a bot, ProcseeBot, in uncovering proxy users.
Vandalism, spamming, and sockpuppeting and long-term abuse are responsible for the large majority (72.4%) of the remaining 168,649 blocks after the open proxy blocks are subtracted. For all blocks, spamming follows vandalism as the most important reason for these blocks and is ahead of sockpuppeting. Dividing the data into registered accounts and anonymous (IP) editors, we see that blocks for spamming are predominantly for registered accounts, while blocks for vandalism are more evenly divided. Thus for registered accounts, spamming is the most frequent reason for blocking. Many vandals may feel that registering an account is too time consuming for editing that will almost surely get them blocked, whereas spammers may feel that their editing is more difficult for admins to discover if they have a registered account. Either that or vandals mostly target existing articles, whilst spammers often want to create an article on a non notable business - and for that they need an account.
Reason | All Blocks | IP Blocks | Account Blocks |
---|---|---|---|
Total | 556,633 | 448,515 | 108,118 |
Open proxy/web host | 387,984 | 387,979 | 5 |
Vandalism | 53,451 | 30,700 | 22,751 |
Spamming | 38,112 | 970 | 37,142 |
Sockpuppetry and long term abuse | 30,541 | 8,003 | 22,538 |
Disruptive editing | 7,928 | 4,839 | 3,089 |
Anonymous blocks | 6,029 | 5,814 | 215 |
Not here to build the encyclopedia | 5,902 | 81 | 5,821 |
Other inappropriate username | 5,782 | - | 5,782 |
Unclassified | 5,385 | 2,384 | 3,001 |
Triggering the edit filter | 4,630 | 1,894 | 2,736 |
Range blocks | 3,313 | 3,260 | 53 |
Promotional username soft blocks | 2,360 | - | 2,360 |
BLP violations | 1,255 | 697 | 558 |
Harassment | 1,047 | 560 | 487 |
Edit warring | 872 | 285 | 587 |
Unauthorized, malfunctioning bot or bot username | 339 | - | 339 |
Looking at global locks rather than just enWiki blocks shows an even larger relation to spamming. Just over 200 locks per day, or 73,474 for the year, are performed because of spamming, accounting for 72.7% of all global locks. Many of these locks are likely due to the use of spam-bots, which apparently find it easy to avoid Wikipedia's CAPTCHA screening at registration. These locks are normally performed by stewards.
Reason | Count | Percent |
---|---|---|
Total | 101,108 | 100% |
Spamming | 73,474 | 72.7% |
Long term abuse | 22,795 | 22.5% |
Cross wiki abuse | 2,720 | 2.7% |
Unclassified | 1,063 | 1.1% |
Vandalism | 820 | 0.8% |
Inappropriate username | 183 | 0.2% |
Compromised | 53 | 0.1% |
Deletions
Deletions are the most serious action that can be taken for articles, user pages including drafts, and files. Spamming itself is only named as the cause in 4.7% of deletions of articles on enWiki. However, other named causes may also be related to spamming or paid editing. For example articles for deletion (AfD) discussions and expired proposed deletions (PRODs) together account for 27.8% of article deletions and a major proportion of these may be due to spam.
Examining deletions in all namespaces presents a clearer picture. Spam is the 4th most frequent reason for deletion in all namespaces. Over 118 items per day, or 43,342 for the year, were deleted. Many of these deletions are likely draft articles, e.g. those being reviewed at WP:Articles for creation or being prepared in user space. Abandoned drafts, which are also likely to be related to spam or paid editing, were responsible for 67,253 deletions for the year. The overall picture appears to be a multi-level of screening for deletion of spam on enWiki. In the first level, large numbers of drafts are submitted and later abandoned as the authors discover that we consider the draft to be spam. This includes up to 67,253 deletions. In subsequent screening levels, drafts are outright deleted at AfC or the draft stage amounting to 43,342 deletions. Many of the 38,287 miscellany for deletion (MfD) discussions may also be related to spam or paid editing, as are some of the 25,297 expired PRODs. These 4 categories (which may include some double counting) add up to a total of 174,179 possible deletions (477 per day) at the draft stage. After an article is accepted, it may later be deleted as spam (4,825 per year) or as an expired PROD (9,271) or at an AfD debate (19,225). The battle of admins to clean up spam by deletion is spread out in many stages and is clearly time consuming.
Reason | Articles | All Namespaces |
---|---|---|
Total | 102,344 | 623,202 |
Dependent on deleted page | 21,134 | 181,164 |
Deletion debate (AFD) | 19,225 | 20,772 |
Expired PROD | 9,271 | 25,297 |
Maintenance | 8,789 | 46,909 |
Deletion debate (RFD) | 5,792 | 7,316 |
Created by block/ban evading sockpuppet | 5,727 | 12,523 |
Fails to give reason for inclusion | 5,410 | 5,473 |
Cross-namespace redirect | 5,125 | 5,180 |
Spam | 4,825 | 43,342 |
Author/user request | 3,384 | 21,409 |
Unclassified | 2,686 | 18,128 |
Copyright violations | 2,386 | 10,761 |
Implausible redirect | 1,357 | 1,720 |
Unclassified nukes | 1,161 | 4,941 |
No content or context | 1,053 | 1,083 |
Repost of deleted content | 1,000 | 1,397 |
Unnecessary disambiguation | 692 | 705 |
Vandalism | 663 | 6,116 |
Copyright problems | 603 | 656 |
Redundant | 455 | 2,943 |
Test page | 366 | 3,563 |
Deletion debate (MFD) | 335 | 38,287 |
No reason given | 216 | 1,416 |
Expired BLP PROD | 177 | 179 |
Made up one day | 176 | 180 |
Attack page | 125 | 1,627 |
Patent nonsense | 118 | 1,062 |
Foreign language | 54 | 55 |
Abandoned draft | 22 | 67,253 |
Misuse of Wikipedia as a webhost | 6 | 28,813 |
Deletion debate (TFD) | 5 | 24,634 |
File redirect to Commons | 5 | 178 |
User page where user does not exist | 1 | 837 |
Problems with non-free files | - | 23,248 |
File moved to Commons | - | 13,796 |
Deletion debate (CFD) | - | 9,947 |
Empty category | - | 9,682 |
Lack of copyright information (files) | - | 4,805 |
Category renaming or merger | - | 3,196 |
Deletion debate (FFD) | - | 2,201 |
Corrupt file | - | 323 |
Protections
At first glance, the use of article protection is not extensively used by admins to stop spamming. Spamming is the tenth most common reason for article protection for both articles and in all namespaces. Some protections caused by spamming might be included under other headings, for example, unclassified disruptive editing, addition of unsourced material, unclassified salting, or unclassified.
Reason | Articles | All Namespaces |
---|---|---|
Total | 22,543 | 27,275 |
Vandalism | 7,634 | 8,472 |
Unclassified disruptive editing | 5,170 | 5,416 |
Sock puppetry | 2,395 | 2,965 |
Addition of unsourced material | 1,916 | 1,955 |
BLP violations | 1,791 | 1,816 |
Unclassified salting | 1,265 | 1,890 |
Unclassified | 726 | 1,439 |
Edit warring/content dispute | 837 | 897 |
High risk page | 351 | 1,805 |
Spamming | 197 | 245 |
Arbitration enforcement | 197 | 202 |
Copyright violations | 64 | 75 |
User request | - | 98 |
Conclusions
Spamming is the most common reason for actions by administrators other than for the use of an open proxy. It is the most common reason for blocking registered accounts and is cited for 72.7% of all global locks.
Spam is only the ninth most common reason for article deletions, but the fourth most common reason for deletion in all namespaces. It appears that the effort to delete spam crosses many of the classifications used for deletions, e.g. abandoned drafts, MfD, AfD, and expired PRODs.
The least important aspect of the use of admin tools against spam appears to be article protection.
Discuss this story
"Other than a global lock, which prevents an editor from editing on all WMF sites, a block is the most serious action that an editor faces on Wikipedia."
I'd consider a site ban more serious than a block. --kingboyk (talk) 18:54, 1 March 2020 (UTC)[reply]
From a look at the numbers above it appears that the workload of all editors in fighting spam would be made substantially lighter if edits by only registered users were permitted. The spammers would then be much easier to trace as it would not be so easy for them to hop to another IP address. In lieu of that I would like to see administrators take a much more proactive approach to semi-protection. Xxanthippe (talk) 05:02, 2 March 2020 (UTC).[reply]