Voice Controlled Personal Assistant Using Raspberry Pi
Voice Controlled Personal Assistant Using Raspberry Pi
Voice Controlled Personal Assistant Using Raspberry Pi
ISSN 2229-5518
Abstract–The purpose of this paper is to illustrate the hence voice command systems are omnipresent in computer
implementation of a Voice Command System as an Intelligent devices.
Personal Assistant (IPA) that can perform numerous tasks or
services for an individual. These tasks or services are based on There have been some very good innovations in the field
user input, location awareness, and the ability to access
information from a variety of online sources (such as weather or
of speech recognition. Some of the latest innovations have
traffic conditions, news, stock prices, user schedules, retail prices, been due to the improvements and high usage of big data and
telling time, local traffic, travel assistant, events, notification deep learning in this field [1]. These innovations have
from social applications plus one can ask questions to the system, attributed to the technology industry using deep learning
invoke its machine learning otherwise get it from Wikipedia… methods in making and using some of the speech recognition
etc…..). systems, Google was able to reduce word error rate by 6% to
10% relative, for the system that had the word error rate of
Using Raspberry Pi as a main hardware to implement this 17% to 52% [2].
IJSER
model which works on the primary input of a user’s voice. Using
Text to speech conversion is the process of converting a
voice as an input to convert into text using a speech to text
engine. The text hence produced was used for query processing
machine recognized text into any language which could be
and fetching relevant information. When the information was identified by a speaker when the text is read out loud. It is two
fetched, it then be converted to speech using text to speech step processes which is divided into front end and back end
conversion and the relevant output to the user was given. [3].
Additionally, some extra modules were also implemented which First part is responsible for converting numbers and
worked on the concept of keyword matching. abbreviations to a written word format. This is also referred to
as normalization of text. Second part involves the signal to be
It can help the visually impaired to connect with the world by processed into an understandable one [4].
giving them access to Wikipedia, Calculator, Email and Music all
through their voice. This model can also keep people secure as it
can be used as a surveillance system which captures the voice of Speech Recognition is the ability of machine for instance
the person standing at the door and similarity checking. Also it a computer to understand words and sentences spoken in any
can be a source of entertainment and information for language [5]. These words or sentences are then converted to a
blind/visually impaired. This model will interact with other format that could be understood by the machine. Speech
systems by means of IOT, thus provides a fully automated recognition is basically implemented using vocabulary
system. Many experiments and results were accomplished and systems [6]. A speech recognition system may be a Small
documented. Vocabulary-many user system or a Large Vocabulary- small
user system [7].
Keywords-- Personal Assistant, Text to speech, Speech to text, II. SYSTEM ARCHITECTURE
Raspberry Pi, Voice Command System, Query Processing Machine
Learning. Existing Systems
IJSER © 2017
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 8, Issue 11, November-2017 1612
ISSN 2229-5518
after the command the output will get in the form of voice
means speech.
Hardware Implementation
Raspberry Pi is the heart of the voice command system Flow of Events in Voice Command System
as it is involved in every step of processing data to connecting
First, when the user starts the system, he uses a
IJSER
components together [8]. The Raspbian OS is mounted onto
the SD card which is then loaded in the card slot to provide a microphone to send in the input. Basically, what it does is that
functioning operating system. it takes sound input from the user and it is fed to the computer
to process it further. Then, that sound input if fed to the speech
The Raspberry Pi needs a constant 5V, 2.1 mA power to text converter, which converts audio input to text output
supply. This can either be provided through an AC supply which is recognizable by the computer and can also be
using a micro USB charger or through a power bank. processed by it.
Ethernet is being used to provide internet connection to Then that text is parsed and searched for keywords. Our
the voice command system. Since the system relies on online voice command system is built around the system of keywords
text to speech conversion, online query processing and online where it searches the text for key words to match. And once
speech to text conversion hence we need a constant connection key words are matched then it gives the relevant output.
to achieve all this.
This output is in the form of text. This is then converted
Monitor provides the developer an additional way to look to speech output using a text to speech converter which
at the code and make any edits if any. It is not required for any involves using an optical character recognition system. OCR
sort of communication with the end user. categorizes and identifies the text and then the text to speech
engine converts it to the audio output. This output is
Speakers, once the query put forward by the user has transmitted via the speakers which are connected to the audio
been processed, the text output of that query is converted to jack of the raspberry pi as shown in Figure 2.
speech using the online text to speech converter. Now this
speech which is the audio output is sent to the user using the
speakers which are running on audio out.
IJSER © 2017
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 8, Issue 11, November-2017 1613
ISSN 2229-5518
personalities, simple mathematical calculations, description of
any general object etc.
Wikipedia
News
IJSER
a corresponding error message is generated.
(AVS) is a Speech-To-Text (STT) engine which is used to
convert the commands given by the user in audio input to text Weather
form, so that these commands can be interpreted by the
modules properly. To use (AVS) engine, an application has to This module tells the user about the weather conditions of
be created in the Amazon developers console and the the location whose station identifier is specified in the profile
generated API key has to be used to access the speech engine. of the user. This module can be executed by using the
It requires continuous internet connection as data is sent over keyword “weather”. The weather information is taken from
the Amazon servers. the weather underground service which includes the details of
temperature, wind speed and direction etc. It generates an
Text To Speech Engine error message, if the information cannot be retrieved for the
specified location.
(AVS) is a Text-To-Speech (TTS) engine is used to create
a spoken sound version of the text in a computer document, Movies
such as a help file or a Web page. TTS can enable the reading
of computer display information for the visually challenged The voice command system searches for relevant movie
person, or may simply be used to augment the reading of a text by the keyword “movie”. It is implemented using a function
message. To use (AVS) engine, an application has to be which asks the user, the movie they want to know more about.
created in the Amazon developers console and the generated Then a function is called to ask for the movie the user needs to
API key has to be used to access the speech engine. It requires know about. It searches for the top five results regarding the
continuous internet connection as data is sent over the movie name. It confirms the movie among the listed names.
Amazon servers. On confirmation of the movie, it gives the detailed
information including the rating, producer, director, cast,
Query Processor runtime and genres. In case of the failure, it generates the error
of “Unable to find information on the requested movie”.
The Voice Command System has a module for query
processing which works in general like many query processors Define
do. That means, taking the input from the users, searching for
relevant outputs and then presenting the user with the Define module is used to fetch the definitions of word
appropriate output. In this system we are using the site specified by the user. The execution of this module is started
wolfram alpha as the source for implementing query by the keyword “define”. Then, the user is asked for the word
processing in the system. The queries that can be passed to whose definition has to be provided. The module checks for
this module include retrieving information about famous the definitions of the specified word using the Yandex
Dictionary API. All the available definitions are provided as a
IJSER © 2017
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 8, Issue 11, November-2017 1614
ISSN 2229-5518
response from the API and the same is given as output to the IV. RESULT
user in audio form. If the connection to API cannot be
The Voice Command System works on the idea and the
established properly, then the error message “Unable to reach
logic it was designed with. Our personal assistant uses the
dictionary API” is generated for the user.
button to take a command. Each of the commands given to it
is matched with the names of the modules written in the
Find My IPhone
program code. If the name of the command matches with any
set of keywords, then those set of actions are performed by the
Find my iPhone module lets the users find their iPhone by
Voice Command System. The modules of Find my iPhone,
their voice. This is done using the user’s iCloud id and
Wikipedia and Movies are based upon API calling. We have
password to make the connection to the server. After the
used open source text to speech and speech to text converters
authentication of the password it checks the status of the
which provide us the features of customizability. If the system
iPhone. After checking, it connects to iPhone using the
is unable to match any of the said commands with the
iCloud. After the connection, the command for ringing is sent
provided keywords for each command, then the system
to the iPhone to start a sound and notification on the phone for
apologizes for not able to perform the said task. All in all, the
detection. If a failure occurs, the system generates an error that
system works on the expected lines with all the features that
the iPhone is not found. An error is also generated when there
were initially proposed. Additionally, the system also provides
are multiple iPhones associated with a single apple id.
enough promise for the future as it is highly customizable and
new modules can be added any time without disturbing the
Joke
working of current modules.
Joke module can be used for entertainment purposes by
V. CONCLUSIONS AND FUTURE WORK
the user. This module works on the keywords “joke” or
“knock knock”. The jokes used in this module are predefined In this paper, introduced the idea and rationale behind the
IJSER
in a text file from which the jokes are read in a random order. Voice Command System, the flaws in the current system and
A start and end line is present in every joke to differentiate it the way of resolving those flaws and laid out the system
from others present in the file. All the lines of a joke are architecture of the presented Voice Command System. Many
spoken by the system in the specified order only. modules are of open source systems and have customized
those modules according to the presented system. This helps
Unclear get the best performance from the system in terms of space
time complexity.
This module is used to generate an error message if the
keyword specified by the user does not match the keywords The Voice Command System has an enormous scope in
present in any module of the system. This module has an the future. Like Siri, Google Now and Cortana become
empty array of keywords and is allocated the least priority, so popular in the mobile industry. This makes the transition
that it is executed only after the keyword has been matched smooth to a complete voice command system. Additionally,
with all the other modules of the system. The messages this also paves way for a Connected Home using Internet of
generated by this module include statements like “I beg your Things, voice command system and computer vision.
pardon”, “Say that again”, “I’m sorry, Could you repeat that”
and “My apologies, Could you try saying that again” which
are selected in a random order each time this module is Acknowledgment
executed.
The author would thank Prof. Dr. Shilong Ma in Beijing
University of Aeronautics and Astronautics (BUAA), China
for his helpful discussions and effective supports.
Other Command Specific Modules
IJSER © 2017
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 8, Issue 11, November-2017 1615
ISSN 2229-5518
[5] Singh, Bhupinder, Neha Kapur, and Puneet Kaur. "Speech recognition
with hidden Markov model: a review." International Journal of Advanced
Research in Computer Science and Software Engineering 2.3 (2012).
[6] Lamere, Paul, et al. "The CMU SPHINX-4 speech recognition system."
IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP
2003), Hong Kong. Vol. 1. 2003.
[7] Black, Alan W., and Kevin A. Lenzo. "Flite: a small fast run-time
synthesis engine." 4th ISCA Tutorial and Research Workshop (ITRW) on
Speech Synthesis. 2001.
[8] Lee, Chin-Hui, Frank K. Soong, and Kuldip Paliwal, eds. Automatic
speech and speaker recognition: advanced topics. Vol. 355. Springer
Science & Business Media, 2012.
[9] Elaine Rich, Kelvin Knight, Shiva Shankar B-Nair, “Artificial
Intelligence”, MC Graw Hill Publication, “2013”.
[10]Steven Hickson, “Voice Control and Face Recognition on the Raspberry
Pi”[online], Available at http://stevenhickson.blogspot.com.
[11]Raspbian[online], Available at http://www.raspbian.org/.
[12]Wayne Wobcke, Anh Nguyen, Van Ho and Alfred Kqzywicki, School of
computer science and engineering, University of new south walls, Sydney
nsw 2052, Australia, “the smart personal assistant: an overview”.
(jounal[online])
[13] “Intelligent Personal Assistant”, “siri”, “corrana”, [online] by Wikipedia.
[14]“Pineapple and Raspberry Pi”, circuit designs and comparison in google”.
IJSER
IJSER © 2017
http://www.ijser.org