Google Search Contractors Move to Testing Chatbot in AI Race

Google's Bard chatbot launched in March in response to OpenAI's chatbot ChatGPT.
The bot has been tested internally by Googlers, and now contractors are also testing a chatbot.
Some contractors say they're not given enough time to rate the most accurate chatbot responses.

Google has tasked some of its large contract workforce with helping to evaluate the quality of responses produced by its AI-powered chatbot, and contractors say they're often not given enough time to rate the responses' accuracy.

The company released the chatbot Bard in limited beta in March, after the launch of OpenAI's ChatGPT, and the bot works similarly: You prompt it with a question or a task, and it will return a humanlike response.

Contractors with the firm Appen are now helping to improve Google's AI chatbot. These workers are not told explicitly that the assignments they're given pertain to Bard, but internal discussions about the new assignment date back to February 7, around the time Google first unveiled Bard. Internal documents reviewed by Insider include instructions to raters that they are to review the quality of responses produced by a theoretical "AI chatbot."

As so-called "raters," these contractors typically evaluate Google's search algorithms and the relevance of ads placed in search results, as well as flag harmful websites so they don't appear in search results.

Since January, much of the raters' work has shifted toward reviewing AI prompts, according to four raters who spoke to Insider on the condition of anonymity because they're not authorized to speak to the press. These raters expressed frustration about the process of rating chatbot prompts, saying they aren't given enough time to accurately grade its responses to prompts and sometimes make best guesses so they still get paid.

Bard received criticism after it was discovered that the bot gave an incorrect answer during an announcement event. Google has said the chatbot will get better over time and that it should not be seen as a replacement for search.

In the run-up to the launch, Google in February also asked its full-time employees to spend two to four hours of their time testing the bot, asking it questions and flagging answers that didn't meet the company's standards for accuracy and other measures. Employees could rewrite responses to questions on any topic, and Bard would learn from those responses.

Google and Appen did not respond to requests for comment.

Not enough time

An instruction document for raters that was viewed by Insider said they will be provided with a "Prompt from a user (e.g. a question, instruction, statement) to an AI chatbot along with two potential machine-generated Responses to the Prompt." The rater then assesses which response is better.

They may also elaborate in a text box why they chose one response over another, which can help the bot learn what attributes to look for in acceptable responses. Among other things, responses should be coherent and accurate, and based on up-to-date information.

Contractors said they have a set amount of time to complete each task, like review a prompt, and the time they're allotted for tasks can vary wildly — from as little as 60 seconds to several minutes. Raters said it's difficult to rate a response when they're not well-versed in a topic the chatbot is talking about, such as technical subjects like blockchain.

Because each assigned task represents billable time, some workers say they will complete the tasks even if they realize they cannot accurately assess the chatbot responses.

"Some people are going to say that's still 60 seconds of work, and I can't recoup this time having sat here and figured out I don't know enough about this, so I'm just going to give it my best guess so I can keep that pay and keep working," one rater said.

Another expressed a similar sentiment, saying that they want to get the facts right and provide the best quality chatbot experience they can but are simply not given enough time to research a topic before they need to provide an assessment. "A lot of us are at our breaking point, honestly."

"Three hours of research to complete a 60-second task, that's a great way to frame the problem we're facing right now," said one of the raters.

Contractors are demanding better working conditions

Contractors who work for Google through outsourcing firms have been increasingly agitating for better working conditions.

In February, raters visited the Googleplex to deliver a petition to the head of search, Prabhakar Raghavan, to advocate for better wages. Google raters who work for Appen make between $14 and $14.50 an hour, despite supporting a business that generates most of its revenue from search and advertising.

The Alphabet Workers Union currently represents raters as a "solidarity union," meaning the labor group supports them and helps with activism but does not formally represent the workers or negotiate a collective-bargaining agreement.

In Austin, Texas, contractors for YouTube late last year announced plans to unionize with the AWU. The group estimates that Google employs more than 200,000 people as contractors who aren't recorded in the company's official head count.

Got a tip? This reporter can be reached via email at tmaxwell@businessinsider.com, Signal at 540.955.7134, or Twitter at @tomaxwell.

Google contractors say they don't have enough time to verify correct answers from the company's AI chatbot and end up guessing

Related stories

Not enough time

Contractors are demanding better working conditions

Read next