0% found this document useful (0 votes)
14 views11 pages

AI Assistant

testng

Uploaded by

Jayesh Amberkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views11 pages

AI Assistant

testng

Uploaded by

Jayesh Amberkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/360389261

Voice-Controlled Intelligent Personal Assistant

Chapter · May 2022


DOI: 10.1007/978-3-030-92905-3_6

CITATIONS READS
4 774

3 authors, including:

Mikhail Skorikov Riasat Khan


North South University North South University
3 PUBLICATIONS 27 CITATIONS 70 PUBLICATIONS 571 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Riasat Khan on 21 September 2022.

The user has requested enhancement of the downloaded file.


Voice-Controlled Intelligent Personal Assistant

Mikhail Skorikov, Kazi Noshin Jahin Omar, and Riasat Khan

Department of Electrical and Computer Engineering


North South University
Dhaka, Bangladesh
{mikhail.skorikov, kazi.omar, riasat.khan}@northsouth.edu

Abstract. Virtual intelligent personal assistants perform various tasks


and services by taking text, voice, and gesture inputs from different
individuals. At present, personal assistants such as Google Assistant,
Amazon’s Alexa, and Apple’s Siri are commonly used because they are
embedded with our smartphones by default. This paper proposes an An-
droid application-based intelligent personal assistant that differs from
the common services provided by the default assistants. The application
was made using React Native and Rasa chatbot development tool, which
comes with a transformer-based pre-trained language model. This model
was then trained to understand various tasks that the users want to be
performed and then execute the code that corresponds to those tasks.
Survey results confirm that the personal assistant application success-
fully executes different commands and provides services for its users by
interpreting their voice commands. The provided features may find sim-
ilarities with existing works, but the number of available features can
be easily extended by developers using the Python language and React
Native framework.

Keywords: artificial intelligence · Android · intelligent personal assistant · nat-


ural language processing · speech recognition · speech synthesis

1 Introduction
The world has advanced quite rapidly with the widespread availability and use
of computing technologies. One field of computing that is reaching new heights is
the domain of artificial intelligence (AI). But even AI systems are not precisely
at the level of being called intelligent. This is a natural consequence of lacking
computing power and the difficulty of modeling abstract concepts that humans
can easily grasp. The domain of natural language processing (NLP) has advanced
by leaps and bounds, but we are still quite far from achieving prolonged natural
conversations. To demonstrate and advocate for such advancements in AI tech-
nologies, we developed an intelligent personal assistant (IPA) that lives inside
the smartphone and assists in general tasks without explicit instructions. The
personal assistant application will interact with the smart device’s user natu-
rally, follow his communication, and perform actions via voice commands of the
user.
2 Skorikov et al.

The idea of IPAs for phones and other platforms is not new considering al-
most every smartphone comes equipped with the operating system’s IPA inside.
Examples of prominent intelligent or virtual personal assistants are Google As-
sistant, Apple’s Siri, Amazon’s Alexa, and Microsoft’s Cortana. There have been
many attempts to develop assistants that have their domain vastly restricted to
improve performance.
Iannizzotto et al. [9] have designed an architecture for intelligent assistants in
smart home environments using a Raspberry Pi device. Their prototype device
consisted of the Raspberry Pi along with a small screen, microphone, camera,
and speaker. Using these components, the device can ’see’ the user while speaking
to them. The screen of the device is used to display a virtual red fox that moves
its mouth while speaking and can make several expressions. Such functionally
unnecessary features are useful in making users have a positive impression of
the assistant. The authors have used several existing software tools to make the
entire system work, such as the Text-to-Speech (TTS) system, Speech-to-Text
(STT), Mycroft smart assistant [13], as well as several other such tools. In the
end, they seamlessly integrated various services and independent systems into a
full-fledged intelligent visual assistant that received positive test evaluation.
Matsuyama et al. [11] present a social focus on virtual assistants. Their as-
sistant, made to help conference attendees find their seats and meet like-minded
people, speaks with the user and builds rapport with them through analysis of
visual, vocal, and verbal cues. The proposed assistant can generate animated
behavior that matches the level of rapport with the user to make the user more
comfortable with the assistant while also making personalized recommendations.
Their work consists of tasks that cover a small domain, and their emphasis is on
the social aspect of conversations.
Felix et al. [6] have made an Android application intended to help people with
visual impairments. The proposed application uses AI technologies to increase
the independent activities of blind people and help them interact with their
surroundings by using voice input and output as the interface. They proposed a
system that leverages Google Cloud APIs for helping identify objects, perform
text recognition, and maintain conversations with the user. It can act as an audio-
book while also having the capacity to answer queries such as the weather. Their
system maintains focus on helping the visually impaired.
In [3], Chowdhury et al. presented a restricted-domain assistant that uses
finite state automaton to perform language processing over a small vocabulary.
They implemented and trained their own Automated Speech Recognition (ASR)
module for two languages - English and Bengali. The scope of their assistant is
limited to opening and closing the Facebook and Google Chrome apps on the
phone and so the required data was very small. Their focus was on building the
system with the speech recognition and user intent identification aspects as their
primary features.
Khattar et al. [10] have created a smart home virtual assistant based on the
Raspberry Pi. The device is extended through components such as microphones,
speakers, and cameras placed at various locations around the house. Existing
Voice-Controlled Intelligent Personal Assistant 3

modules such as Google Speech Recognition API, Google Text-to-Speech (TTS)


API, and the Eigenface algorithm are used to provide speech recognition, text-
to-speech, and face recognition features. The assistant can be controlled through
voice to control home appliances, and answer basic queries related to the weather,
stock market, and word definitions.
This paper is organized such that, in section 2, the proposed system is dis-
cussed with appropriate visualizations for clarity. The results of our work are
discussed in section 3 in detail. Finally, section 4 concludes the paper with some
directions on future work.

2 Proposed System
This paper proposes a system for a voice-controlled AI assistant. The design of
the system is entirely software-based, needing only a smartphone and an inter-
net connection to operate. A server is required to host the assistant’s processing
modules, which will communicate with the phone application to take appropri-
ate actions. The system can be described as a chatbot extended to have voice
capabilities.

2.1 Software Components


The software dependencies and components of the proposed assistant are de-
scribed in Table 1. The most critical two software components are Rasa chatbot
development tool [2], which consists of Rasa Natural Language Understanding
(NLU) and Rasa Core for dialogue management, and the React Native mobile
development framework [5].

2.2 Hardware Dependencies


The base hardware requirements for this system on the user side are only enough
memory and storage space to run the smartphone application. The server hard-

Table 1. Software Components

Component Library/Program used Description


Speech-to-Text Native function of phone Converts user voice input to text
Text-to-Speech Native function of phone Converts assistant’s text output to
voice
Intent Classifier RASA NLU From a given text, identifies the
intention of the text
Dialogue Manager RASA CORE Based on a predicted or given intent,
select what response and action to take
Action Executor Phone App and Server Depending on the selected action,
execute the action either on phone or
on server
Phone App React Native The interface for the users connecting
all services with the core assistant
4 Skorikov et al.

ware requirements will depend on the number of users actively using the appli-
cation. At the bare minimum, it is required to have at least 2 GB of memory
and a moderately powerful processor.

2.3 System Architecture

The assistant is designed to be an online one, where natural language processing


is done on the server and relevant actions are taken on the phone. Fig. 1 illus-
trates how the system is structured and how the components interact with each
other.

Fig. 1. System Architecture.

2.4 System Workflow

The system works in the following manner: a voice input from the user is con-
verted to its text form, which is then passed to the intent classifier. Next, the
intent classifier gives out the intent along with any entities while the dialogue
manager chooses what to do. Then, a text response is generated from templates
in addition to any further actions in code, which is finally converted to voice.
The sequential working procedure of the proposed system is illustrated in Fig. 2.
The sequence of events occurs every time the user talks to the personal assistant
through the application.
Voice-Controlled Intelligent Personal Assistant 5

Fig. 2. System Flowchart.

2.5 System Features


The features of the assistant are concisely described in Table 2. The proposed
automated personal assistant can perform several tasks for the user based on
his voice commands. For instance, it can set alarms and reminders and look
for the definition of any specified word. It can notify the location and weather
information and read the incoming text messages for the user. The assistant
can play videos from YouTube, and it can inform the most trending local and
international news. It should be noted that, these features are not focused on
assisting certain groups of people but are meant to be for anyone’s general-
purpose use.

3 Results and Discussion


The final application was tested with ten unbiased individuals who rated the
performance of the system in terms of various features.

3.1 Server
We have leveraged a Virtual Private Server (VPS) from IBM Cloud [4], which
contains our core assistant services. There are two application programming in-
terfaces (APIs) hosted on the server: Rasa chatbot API and Rasa Action Handler
API.
6 Skorikov et al.

The server hosts the intent classification model as well as the dialogue man-
ager. The action execution is performed by the phone, barring the weather and
definition features, but the text responses come from the dialogue manager on
the server.

Intent Classifier The intent classification model is trained by providing a list


of intents with many examples. For example, for an intent of greeting, it may be
provided examples of greetings such as ”Hello” and many variations of such. The
intent classifier comes with a pre-trained language model that is later trained
on these provided intents. The final result is a robust model that can handle
intent classification quite well. Each of the features listed in Table 2 needs its
own intent, along with relevant parameters and supporting intents.

Entity Extractor Entity extraction is the process of retrieving useful informa-


tion such as dates, numbers, and names from a text. For most of our features, we
needed necessary information, e.g., for alarms, we needed alarm time and day;
for the weather, there is the option of asking the weather for a specific place
or time, and so on. In this work, we have used an advanced natural language
processing library called SpaCy [8] and its models to extract the entities in our
pipeline.

Dialogue Manager The dialogue management and action selection is handled


by a Rasa component called Rasa Core, which uses a transformer-based neural
network model to identify the most likely action to take with respect to a given
intent. Rasa Core also has several other methods to determine the next action to
take, one of which comes in the form of mapping to stories. Stories are sequences
of conversations and operations that take place in a chat environment. It is in
the form of intents-then-responses and can be chained to be very complex.

Table 2. System Features

Feature Description
Weather Answer queries about weather
Reminder Set reminders and the assistant will remind with a notification
on the phone
Alarm Set alarms and the assistant will set an alarm on the phone’s
default alarm app
Read Aloud SMS Read out loud incoming text messages from the default Messaging
app on phone
News Read aloud and display latest or most trending news either locally
or internationally
YouTube Play YouTube videos on the phone from the search-term given
Definition Find the definition of the given word and read aloud
Location Display the current location of the user
Voice-Controlled Intelligent Personal Assistant 7

(a) Weather re- (b) Assistant read- (c) Assistant dis- (d) Showing the
sponse from asking ing and display- playing the defini- list of videos found
for the weather 3 ing the latest local tion of ”assistant” on YouTube for
days into the fu- news after having spo- ”tree” and is about
ture ken it to play the top re-
sult automatically

Fig. 3. Screens of the phone application.

3.2 Phone Application


The phone application is built using React Native, a cross-platform mobile de-
velopment framework. It features taking in voice input, using the native speech-
to-text service to convert it to text, and then making an API call to the Rasa
NLU service. The text response from the API is then converted to synthetic
voice using the native text-to-speech service, while a JSON object is received
with action data for the execution.

Hot-Word Detection Hot-word or wake-word detection is the application’s


constant waiting mechanism where it waits for a specific word to activate voice
input for the user to start a conversation. We have used the Porcupine [16]
wake-word detection engine, which allows for a limited number of hot-words to
be set up, all predefined by the engine. The words that have been selected for
our application are blueberry and bumblebee.

Speech-to-Text (STT) and Text-to-Speech (TTS) The application uses


the phone’s native STT and TTS services made available by Google. Though
this reduces the complexity of implementation, there are concerns over the per-
formance of these services. Our tests show that clarity and volume of voice
are critical in successful implementation of the executable commands. Although
there are better performing alternatives such as Mozilla DeepSpeech [7] and
Mozilla TTS [12], the increased storage burden on the user side encourages us
to keep the default services.
8 Skorikov et al.

Action Execution Once an action is identified, we have the option of executing


the actions either on the server or on the phone. The phone execution option is
more reliable and fast since most of the features are native to smartphones. So
once the action is identified, the API returns a JSON object along with a text
response for the user. The phone application is responsible for using its native
functions to execute the required actions. With the exception of weather and
definitions search features, the remaining features are all executed on the phone,
although they do require the server to pass the phone the extracted entities that
are needed. Fig. 3 demonstrates screenshots of some samples of the executed
actions.

Constraints Finally, the complete system performs as intended, with certain


restrictions and performance reductions. The most significant debilitating factor
of our application is the fact that the phone must keep the app open in the app
list for it to function. The user also has to manually enter the app settings to
allow it to display over other applications, although there is no other alternative
than to set it manually.

3.3 Communication

The application communicates with the server through API calls. The server
internally communicates also through API calls, but it is entirely handled by
the framework and thus unnecessary for us to design.

REST API The app needs to use two API calls each time the user speaks to
the assistant. First, it sends the text that is retrieved from the speech to the
server in the form of a rest request with a JSON body. The server responds with
a text response that is intended for speech synthesis. The phone application also
has to make a request for the conversation tracker, which is a JSON object that
has all the conversation history and data stored. When a task is expected, the
tracker has a member called slots that act as the memory for the assistant, and
the slot member task-to-do has the value for the action the phone application has
to execute. The slots also hold any values for time and date and other entities
which have been filled. Using these details, the app can execute actions, after
which the app makes a request to reset the slots.

RASA The server internally has two servers running. Both of these are Sanic-
based web servers [14] where one hosts the assistant and the other hosts an
action execution server. The action server is necessary for running the custom
code if a text response is not sufficient. A text response cannot be generated for
most of the features without running some specific code initially. For example,
to respond with weather details, the weather API needs to be called first. The
communication between the action server and the assistant happens within the
framework, with minimal configurations required.
Voice-Controlled Intelligent Personal Assistant 9

Other APIs The implemented features of the personal assistant in this work
required API calls to many online services. For location and YouTube, we used
Google Cloud Services APIs which do not come free of cost if the number of
usages exceed a certain value. For weather service, we subscribed to the Ac-
cuWeather [1] APIs which also come at a cost if exceeding a certain number of
usages. For definitions, we use Owlbot [15], a free-to-use API for English word
definitions. For showing news, we use web scraping to retrieve news information
from a popular local news site.

3.4 Survey

After designing the personal assistant application, finally some of its important
features were tested to evaluate its performance. The survey was conducted with
the intent of testing the performance of the assistant in real-life settings. It was
completed by ten individuals of ages between 20 and 22 and unrelated to the
development of the assistant and hence were not biased in the way of their
instructions as the developers would be.

Fig. 4. Performance rating of the features

The survey contained questions regarding the user’s comfort in keeping the
phone application open in the background all the time, whether the speech of the
user was properly understood, and the performance rating of each feature out
of 5. Fig. 4 shows the distribution of average scores for the features. According
to the survey results, half the surveyors said they are uncomfortable keeping
the app always open and all of the users agreed that their speech was properly
understood by the assistant.
10 Skorikov et al.

4 Conclusion and Future Work


This paper proposes the development of an intelligent personal assistant for An-
droid phones. The assistant design is similar to a chatbot with extended abilities
and given a voice. The proposed assistant can perform several actions and in-
terpret queries by taking the voice inputs by implementing natural language
processing. Although the final app has some minor limitations, there have been
positive responses from the test users about the usefulness and performance of
the assistant. In the future, the capabilities of the assistant can be extended to
include more unique features as well as increase the sophistication of the exist-
ing features. The app can be improved by being made to be able to run in the
background.

References
1. AccuWeather: Accuweather apis. https://developer.accuweather.com/apis (2021)
2. Bocklisch, T., Faulkner, J., Pawlowski, N., Nichol, A.: Rasa: Open source language
understanding and dialogue management. arXiv preprint arXiv:1712.05181 (2017)
3. Chowdhury, S.S., Talukdar, A., Mahmud, A., Rahman, T.: Domain specific intelli-
gent personal assistant with bilingual voice command processing. In: IEEE Region
10 Conference (TENCON). pp. 731–734 (2018)
4. Coyne L., Gopalakrishnan S., S.J.R.I.: Ibm private, public, and hybrid cloud
storage solutions. http://www.redbooks.ibm.com/redpapers/pdfs/redp4873.pdf
(2014)
5. Facebook: React native. https://github.com/facebook/react-native (2021)
6. Felix, S.M., Kumar, S., Veeramuthu, A.: A smart personal ai assistant for visually
impaired people. In: 2018 2nd International Conference on Trends in Electronics
and Informatics (ICOEI). pp. 1245–1250 (2018)
7. Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger,
R., Satheesh, S., Sengupta, S., Coates, A., et al.: Deep speech: Scaling up end-to-
end speech recognition. arXiv preprint arXiv:1412.5567 (2014)
8. Honnibal, M., Montani, I.: spaCy 2: Natural language understanding with Bloom
embeddings, convolutional neural networks and incremental parsing (2017)
9. Iannizzotto, G., Bello, L.L., Nucita, A., Grasso, G.M.: A vision and speech enabled,
customizable, virtual assistant for smart environments. In: 2018 11th International
Conference on Human System Interaction (HSI). pp. 50–56. IEEE (2018)
10. Khattar, S., Sachdeva, A., Kumar, R., Gupta, R.: Smart home with virtual assis-
tant using raspberry pi. In: 2019 9th International Conference on Cloud Comput-
ing, Data Science & Engineering (Confluence). pp. 576–579. IEEE (2019)
11. Matsuyama, Y., Bhardwaj, A., Zhao, R., Romeo, O., Akoju, S., Cassell, J.: Socially-
aware animated intelligent personal assistant agent. In: Proceedings of the 17th
annual meeting of the special interest group on discourse and dialogue. pp. 224–
227 (2016)
12. Mozilla: Tts. https://github.com/mozilla/TTS (2021)
13. MycroftAI: Mycroft core. https://github.com/MycroftAI/mycroft-core (2021)
14. Organization, S.C.: Sanic. https://github.com/sanic-org/sanic (2021)
15. Owlbot: Owlbot dictionary api. https://owlbot.info/ (2021)
16. Picovoice: Porcupine. https://github.com/Picovoice/porcupine (2021)

View publication stats

You might also like