AI Assistant
AI Assistant
net/publication/360389261
CITATIONS READS
4 774
3 authors, including:
All content following this page was uploaded by Riasat Khan on 21 September 2022.
1 Introduction
The world has advanced quite rapidly with the widespread availability and use
of computing technologies. One field of computing that is reaching new heights is
the domain of artificial intelligence (AI). But even AI systems are not precisely
at the level of being called intelligent. This is a natural consequence of lacking
computing power and the difficulty of modeling abstract concepts that humans
can easily grasp. The domain of natural language processing (NLP) has advanced
by leaps and bounds, but we are still quite far from achieving prolonged natural
conversations. To demonstrate and advocate for such advancements in AI tech-
nologies, we developed an intelligent personal assistant (IPA) that lives inside
the smartphone and assists in general tasks without explicit instructions. The
personal assistant application will interact with the smart device’s user natu-
rally, follow his communication, and perform actions via voice commands of the
user.
2 Skorikov et al.
The idea of IPAs for phones and other platforms is not new considering al-
most every smartphone comes equipped with the operating system’s IPA inside.
Examples of prominent intelligent or virtual personal assistants are Google As-
sistant, Apple’s Siri, Amazon’s Alexa, and Microsoft’s Cortana. There have been
many attempts to develop assistants that have their domain vastly restricted to
improve performance.
Iannizzotto et al. [9] have designed an architecture for intelligent assistants in
smart home environments using a Raspberry Pi device. Their prototype device
consisted of the Raspberry Pi along with a small screen, microphone, camera,
and speaker. Using these components, the device can ’see’ the user while speaking
to them. The screen of the device is used to display a virtual red fox that moves
its mouth while speaking and can make several expressions. Such functionally
unnecessary features are useful in making users have a positive impression of
the assistant. The authors have used several existing software tools to make the
entire system work, such as the Text-to-Speech (TTS) system, Speech-to-Text
(STT), Mycroft smart assistant [13], as well as several other such tools. In the
end, they seamlessly integrated various services and independent systems into a
full-fledged intelligent visual assistant that received positive test evaluation.
Matsuyama et al. [11] present a social focus on virtual assistants. Their as-
sistant, made to help conference attendees find their seats and meet like-minded
people, speaks with the user and builds rapport with them through analysis of
visual, vocal, and verbal cues. The proposed assistant can generate animated
behavior that matches the level of rapport with the user to make the user more
comfortable with the assistant while also making personalized recommendations.
Their work consists of tasks that cover a small domain, and their emphasis is on
the social aspect of conversations.
Felix et al. [6] have made an Android application intended to help people with
visual impairments. The proposed application uses AI technologies to increase
the independent activities of blind people and help them interact with their
surroundings by using voice input and output as the interface. They proposed a
system that leverages Google Cloud APIs for helping identify objects, perform
text recognition, and maintain conversations with the user. It can act as an audio-
book while also having the capacity to answer queries such as the weather. Their
system maintains focus on helping the visually impaired.
In [3], Chowdhury et al. presented a restricted-domain assistant that uses
finite state automaton to perform language processing over a small vocabulary.
They implemented and trained their own Automated Speech Recognition (ASR)
module for two languages - English and Bengali. The scope of their assistant is
limited to opening and closing the Facebook and Google Chrome apps on the
phone and so the required data was very small. Their focus was on building the
system with the speech recognition and user intent identification aspects as their
primary features.
Khattar et al. [10] have created a smart home virtual assistant based on the
Raspberry Pi. The device is extended through components such as microphones,
speakers, and cameras placed at various locations around the house. Existing
Voice-Controlled Intelligent Personal Assistant 3
2 Proposed System
This paper proposes a system for a voice-controlled AI assistant. The design of
the system is entirely software-based, needing only a smartphone and an inter-
net connection to operate. A server is required to host the assistant’s processing
modules, which will communicate with the phone application to take appropri-
ate actions. The system can be described as a chatbot extended to have voice
capabilities.
ware requirements will depend on the number of users actively using the appli-
cation. At the bare minimum, it is required to have at least 2 GB of memory
and a moderately powerful processor.
The system works in the following manner: a voice input from the user is con-
verted to its text form, which is then passed to the intent classifier. Next, the
intent classifier gives out the intent along with any entities while the dialogue
manager chooses what to do. Then, a text response is generated from templates
in addition to any further actions in code, which is finally converted to voice.
The sequential working procedure of the proposed system is illustrated in Fig. 2.
The sequence of events occurs every time the user talks to the personal assistant
through the application.
Voice-Controlled Intelligent Personal Assistant 5
3.1 Server
We have leveraged a Virtual Private Server (VPS) from IBM Cloud [4], which
contains our core assistant services. There are two application programming in-
terfaces (APIs) hosted on the server: Rasa chatbot API and Rasa Action Handler
API.
6 Skorikov et al.
The server hosts the intent classification model as well as the dialogue man-
ager. The action execution is performed by the phone, barring the weather and
definition features, but the text responses come from the dialogue manager on
the server.
Feature Description
Weather Answer queries about weather
Reminder Set reminders and the assistant will remind with a notification
on the phone
Alarm Set alarms and the assistant will set an alarm on the phone’s
default alarm app
Read Aloud SMS Read out loud incoming text messages from the default Messaging
app on phone
News Read aloud and display latest or most trending news either locally
or internationally
YouTube Play YouTube videos on the phone from the search-term given
Definition Find the definition of the given word and read aloud
Location Display the current location of the user
Voice-Controlled Intelligent Personal Assistant 7
(a) Weather re- (b) Assistant read- (c) Assistant dis- (d) Showing the
sponse from asking ing and display- playing the defini- list of videos found
for the weather 3 ing the latest local tion of ”assistant” on YouTube for
days into the fu- news after having spo- ”tree” and is about
ture ken it to play the top re-
sult automatically
3.3 Communication
The application communicates with the server through API calls. The server
internally communicates also through API calls, but it is entirely handled by
the framework and thus unnecessary for us to design.
REST API The app needs to use two API calls each time the user speaks to
the assistant. First, it sends the text that is retrieved from the speech to the
server in the form of a rest request with a JSON body. The server responds with
a text response that is intended for speech synthesis. The phone application also
has to make a request for the conversation tracker, which is a JSON object that
has all the conversation history and data stored. When a task is expected, the
tracker has a member called slots that act as the memory for the assistant, and
the slot member task-to-do has the value for the action the phone application has
to execute. The slots also hold any values for time and date and other entities
which have been filled. Using these details, the app can execute actions, after
which the app makes a request to reset the slots.
RASA The server internally has two servers running. Both of these are Sanic-
based web servers [14] where one hosts the assistant and the other hosts an
action execution server. The action server is necessary for running the custom
code if a text response is not sufficient. A text response cannot be generated for
most of the features without running some specific code initially. For example,
to respond with weather details, the weather API needs to be called first. The
communication between the action server and the assistant happens within the
framework, with minimal configurations required.
Voice-Controlled Intelligent Personal Assistant 9
Other APIs The implemented features of the personal assistant in this work
required API calls to many online services. For location and YouTube, we used
Google Cloud Services APIs which do not come free of cost if the number of
usages exceed a certain value. For weather service, we subscribed to the Ac-
cuWeather [1] APIs which also come at a cost if exceeding a certain number of
usages. For definitions, we use Owlbot [15], a free-to-use API for English word
definitions. For showing news, we use web scraping to retrieve news information
from a popular local news site.
3.4 Survey
After designing the personal assistant application, finally some of its important
features were tested to evaluate its performance. The survey was conducted with
the intent of testing the performance of the assistant in real-life settings. It was
completed by ten individuals of ages between 20 and 22 and unrelated to the
development of the assistant and hence were not biased in the way of their
instructions as the developers would be.
The survey contained questions regarding the user’s comfort in keeping the
phone application open in the background all the time, whether the speech of the
user was properly understood, and the performance rating of each feature out
of 5. Fig. 4 shows the distribution of average scores for the features. According
to the survey results, half the surveyors said they are uncomfortable keeping
the app always open and all of the users agreed that their speech was properly
understood by the assistant.
10 Skorikov et al.
References
1. AccuWeather: Accuweather apis. https://developer.accuweather.com/apis (2021)
2. Bocklisch, T., Faulkner, J., Pawlowski, N., Nichol, A.: Rasa: Open source language
understanding and dialogue management. arXiv preprint arXiv:1712.05181 (2017)
3. Chowdhury, S.S., Talukdar, A., Mahmud, A., Rahman, T.: Domain specific intelli-
gent personal assistant with bilingual voice command processing. In: IEEE Region
10 Conference (TENCON). pp. 731–734 (2018)
4. Coyne L., Gopalakrishnan S., S.J.R.I.: Ibm private, public, and hybrid cloud
storage solutions. http://www.redbooks.ibm.com/redpapers/pdfs/redp4873.pdf
(2014)
5. Facebook: React native. https://github.com/facebook/react-native (2021)
6. Felix, S.M., Kumar, S., Veeramuthu, A.: A smart personal ai assistant for visually
impaired people. In: 2018 2nd International Conference on Trends in Electronics
and Informatics (ICOEI). pp. 1245–1250 (2018)
7. Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger,
R., Satheesh, S., Sengupta, S., Coates, A., et al.: Deep speech: Scaling up end-to-
end speech recognition. arXiv preprint arXiv:1412.5567 (2014)
8. Honnibal, M., Montani, I.: spaCy 2: Natural language understanding with Bloom
embeddings, convolutional neural networks and incremental parsing (2017)
9. Iannizzotto, G., Bello, L.L., Nucita, A., Grasso, G.M.: A vision and speech enabled,
customizable, virtual assistant for smart environments. In: 2018 11th International
Conference on Human System Interaction (HSI). pp. 50–56. IEEE (2018)
10. Khattar, S., Sachdeva, A., Kumar, R., Gupta, R.: Smart home with virtual assis-
tant using raspberry pi. In: 2019 9th International Conference on Cloud Comput-
ing, Data Science & Engineering (Confluence). pp. 576–579. IEEE (2019)
11. Matsuyama, Y., Bhardwaj, A., Zhao, R., Romeo, O., Akoju, S., Cassell, J.: Socially-
aware animated intelligent personal assistant agent. In: Proceedings of the 17th
annual meeting of the special interest group on discourse and dialogue. pp. 224–
227 (2016)
12. Mozilla: Tts. https://github.com/mozilla/TTS (2021)
13. MycroftAI: Mycroft core. https://github.com/MycroftAI/mycroft-core (2021)
14. Organization, S.C.: Sanic. https://github.com/sanic-org/sanic (2021)
15. Owlbot: Owlbot dictionary api. https://owlbot.info/ (2021)
16. Picovoice: Porcupine. https://github.com/Picovoice/porcupine (2021)