BotWikiAwk is a framework and libraries for creating and running bots on Wikipedia.
Features
edit- Bot management tools compatible with bots written in any language
- Libraries for bots written in awk
- Non-SQL. Data files in plain-text
- Manage batches of articles of any size, 50 for WP:BRFA or 50k+ for production runs
- Runs using GNU parallel making full use of multi-core CPUs
- ..or runs on the Toolforge grid across 40+ distributed computers
- Dry-run mode, diffs can be checked out before uploading
- Inline colorized diffs on the command-line
- Re-run individual pages via a cached copy of the page (download wikisource once, run bot many)
- Installs in a single directory, easily removed
- Includes complete example bots and skeleton bots
- Includes a general awk library developed over years of writing bots
- Includes a command-line interface to the MediaWiki API
- In development and private use since 2016. Public June 2018
Overview
editBotWikiAwk contains two elements:
- A library of routines for writing bots in awk
- An integrated set of tools for running and managing bots written in any language
Why awk? Awk is a small, elegant language composed of a single binary file, the interpreter. It is a POSIX tool installed on most unix computers. The language syntax is simple and forgiving. It is usually associated with one-line scripts, but since about 2012 the GNU version has become more powerful. While not a general purpose language, awk is primarily a text processing language which is exactly what bots do. The areas that awk can not support (eg. networking) are executed through external programs.
BotWikiAwk is batch oriented. After creating a master list of articles, it then carves out batches which are assigned a unique name, called a project ID. Each utility takes as input the project ID and what action to take for the project. Projects can be any size including the full size of the master-list ie. a single project.
Requirements
edit- A Wikipedia account with bot flag permissions
- GNU awk (version 4.1+)
- GNU wget (version 1.13+)
- GNU parallel (sudo apt-get install parallel) - not required on Toolforge
- openssl for login authentication (if writing to pages)
- wdiff (sudo apt-get install wdiff) - small utility for inline diffs
- GNU tac (sudo apt-get install tac) - small utility reverse cat
Setup
editIf installing on Toolforge see special instructions.
- Download (zip) or Clone the project:
git clone https://github.com/greencardamom/BotWikiAwk
- Create an AWKPATH environment in .bash_profile eg.
export AWKPATH=.:/home/adminuser/BotWikiAwk/lib:/usr/local/share/awk
- If on Toolforge see special instructions
- Add BotWikiAwk to the PATH eg.
PATH=$PATH:/home/adminuser/BotWikiAwk/bin
- Log out and back in so environment vars are set.
- cd to ~/BotWikiAwk and run
./setup.sh
- Edit
~/BotWikiAwk/lib/botwiki.awk
- Change #1) StopButton URL
- Change #2) UserPage URL
- Read the SETUP file for additional instructions
- For Wikipedia edit authorization: add your OAuth key/secrets to bin/wikiget.awk -- see EDITSETUP
New bot
editTo create a new bot:
makebot ~/botname
The path should point to a new directory, botname
that has not been created yet, with "botname" being the name of your bot (no spaces recommended). The path can be to anywhere, but if different from the default ~/BotWikiAwk/bots
directory also update ~/BotWikiAwk/lib/botwiki.awk
section #3 following the "mybot" example.
I find locating the bot outside the ~/BotWikiAwk directories makes it easier to upgrade BotWikiAwk later. One can simply delete everything and re-clone it (saving only the original botwiki.awk file).
It will prompt for type of bot skeleton. If the bot will be doing operations on CS1|2 templates choose #2.
Writing bot
editSee ~/BotWikiBot/example-bots
<to be expanded>
Running bot
editIn summary, the process works by running four utilities:
wikiget
downloads a list of page titles the bot will operate on eg. 10k page titles from a categoryproject -c
creates a new project (or batch) to process eg. the first 50 pagesrunbot
executes the bot in dry-run mode on a given projectbug -dc
to view diffs for individual pages, to see what changes the bot madebug -r
to re-run for individual pages- when satisfied the bot is running well,
runbot
again in live mode to upload changes. Repeat with larger project sizes until done.
The utility programs (wikiget, project, runbot and bug) have many options available with -h
Example bot
editThe easiest way to demonstrate BotWikiBot by running a real bot.
0. Create the bot using existing example, accdate, a bot for removing |access-date=
in CS|2 templates.
- Make the bot:
makebot ~/BotWikiBot/bots/accdate
- Copy in the pre-written example bot:
cp ~/BotWikiBot/example-bots/accdate.awk ~/BotWikiBot/bots/accdate
- cd to the bot directory
cd ~/BotWikiBot/bots/accdate
- All utilities only work while in the bot's home directory; with the exception of wikiget which can run anywhere.
A. Make a master list of pages to process, called an "auth" file. Here getting the list from a category, the "-c" option.
wikiget -c "Category:Pages using citations with accessdate and no URL" > meta/accdate20181102.auth
- The file ends in
.auth
(required) and is located in the bot's meta subdirectory. - In this case '20181102' is today's date but it can be any identifying string of numbers or letters.
- The "accdate" portion of the filename can also be anything, though it's helpful to use the bot name.
- Manually edit meta/accdate20181102.auth to remove unwanted pages eg. "Template:" or "Wikipedia:" space.
B. Create (-c) a batch (called a 'project') of 50 articles to process
project -c -p accdate20181102.00001-00050
- The project ID (-p) is composed of the name created in Step A (accdate20181102) followed by a "." followed by a set of numbers (00001-00050) which means line # 1 -> line #50 in the file meta/accdate20181102.auth ie. the first 50 articles to process.
- The project ID is referenced by every utility to identify which project is being worked on.
C. Run the bot in dry-run mode
runbot accdate20181102.00001-00050 auth dryrun
D. Look at resulting local diffs
- Find which pages the bot modified as recorded in the "discovered" file in the meta directory
cat meta/accdate20181102.00001-00050/discovered
- For each, visually check the diff with bug -dc
bug -p accdate20181102.00001-00050 -n "Theory of relativity" -dc
- The bot can be re-run for individual pages
bug -p accdate20181102.00001-00050 -n "Theory of relativity" -r
- Further info available with -v shows location of data directory
bug -p accdate20181102.00001-00050 -n "Theory of relativity" -v
E. Push changes to Wikipedia
- If project was previously run in dry-run mode, first delete it and recreate
project -x -p accdate20181102.00001-00050
project -c -p accdate20181102.00001-00050
- Then run in live mode (CAUTION: don't do this for the demonstration)
runbot accdate20181102.00001-00050 auth
- If project has never been created before just create it new and run
project -c -p accdate20181102.00001-00050
runbot accdate20181102.00001-00050 auth
F. Repeat
- Repeat steps B->F increasing the size of the batch and using the "bug -dc" to spot check diffs until confidence is high. Once confidence is high, only the last part of step E required. As can be seen each project run is a 2-step process: create the project defining its size, then run the bot on the project.