CS232 Team Project Final: Infoglut
CS232 Team Project Final: Infoglut
A INFOGLUT 1
Julie L. Gill
Luisa M. Morales
Igor D. Pokryshevskiy
INFORMATION OVERLOAD A.K.A INFOGLUT 2
Abstract
The problem of infoglut is growing and computer scientists and engineers need to find a
tough and complex problem. Our intelligent embedded system is a web portal, which
allows a user to access all of his or her various social media, email, news, and blogs in
one place from any computer. Using a system based on the concept of genetic
algorithms, the portal learns and continuously refines favorite keywords and relationships
between the user’s preferences and actions and displays only the most relevant
information.
INFORMATION OVERLOAD A.K.A INFOGLUT 3
The Internet has granted us entry into a repository and exchange of information.
The information present exceeds measurable amounts. The problem with this is that we
often cannot separate the information, which may increase our knowledge and improve
our lives from that which has no value and leads us astray from productive goals. This
dilemma is defined as Infoglut, the mixing of quality information with irrelevant and
time-wasting data. “It's not so much about technology as it is about deciding which
information is of value and to whom, and then configuring the technology accordingly.
Infoglut causes an increase in the time one must spend sorting through
extraneous information to find what one needs. Employees may spend enormous
amounts of time searching through, deleting, or organizing emails, a mundane task made
2009). While much of the information in e-mails may be non-critical, typically all e-
mails must be processed and answered, so this introduces a heavy time-load for going
through information that may not be critical or important to the employee’s job. Often,
humans have neither the time nor the capability to read and process all of the information
that is available, so it is necessary to have a system that will reduce the amount of
information that needs to be processed. In order not to lose any data and to understand the
system by which data and information can be filtered and irrelevant data can be held
aside.
It is necessary for this system to have an intelligent algorithm that will ensure that
any and all relevant data makes it to the user, and irrelevant information remains unseen.
The definition of relevant depends on the user, and the solution is to gather data and
decide what the particular user believes to be relevant (each algorithm must be
implemented and held accountable for one user at a time, as all data the user seeks is
relative). The system will have the capabilities to learn about the user and make future
decisions based on the user’s preferences. Over time, it will be able to hone its definition
of relevant and “solicited” information, resulting in a highly personal and focused set of
An attempt to aid in the relevance of data obtained based on what a user seeks to
find was implemented by Google, and other search engines, when they implemented
proven to be helpful, it is often more important how "good" a user is at searching. Often
the user will need to be an intelligent and knowledgeable searcher if they seek
information from respected sources and not from those who paid for a top spot or have,
over the course of time, been able to generate more clicks, leading to a higher spot in the
search results.
In comparison, RSS feeds allow for users, who are not sophisticated in terms of
web-searching, to swiftly access data from reliable sources. RSS feeds are a step-up from
Search Engine Optimization because it sorts the data you seek, the data is from sources
INFORMATION OVERLOAD A.K.A INFOGLUT 5
you have inquired about, and it eliminates the hassle of pop-ups and frustrating
process to keep up with the numerous online postings when there may be 30 or more
postings in a given day from each feed you chose to subscribe to.
We believe Neural Networks and Genetic Algorithms will aid in the processing
and sorting of data. “Neural networks come with panoply of various learning algorithms,
both supervised and unsupervised, that help implement a nonlinear mapping between
input and output variables” (Pedrycz 2002). A system that is designed to sift through the
massive amounts of data that is put on the Internet and into the world every minute must
be automated, to an extent, to learn what the user does and does not want to see. The
information in order to display or hide the latest information. Based on the volume of
intelligence will aid in the labor-intensive process of making decisions about data. The
system must learn as it runs in order to develop a figurative sense of what decisions it
should make without the engineer (or user) needing to specify and spell out each case,
are...applied to the solutions [in order to] generate a new set of solutions for the next
iteration called a generation" (Fang et al., 2008). Through the use of tree structures and
more accurate generation (or population) while continuously learning from previous
INFORMATION OVERLOAD A.K.A INFOGLUT 6
generations and the user as time (and user interests) move forward. Through the use of
building blocks and arrays of strings the genetic algorithm is able to grab the "fittest"
individuals within a generation of data and reproduce the information in a more compact
a more accurate representation of the data sought by the user. Genetic algorithms have a
incremental learning, the system will continuously learn from its encounters with the user
and, through arrays of strings, the continuous data flow will be represented in trees with
the most important information, relatively speaking in terms of each individual user,
pushed to the top, while the rest of information trickles down the hierarchy.
This section denotes the outcomes acquired through the creation of an intelligent
embedded system, which classifies, analyzes, and prioritizes content under a scalable
framework while addressing the challenges of extreme computing to efficiently target the
Our proposed system functions as a portal connecting all of the service providers
of any given user including email, news, and social media, each of which is contained
within a module on the main screen. The portal functions as an embedded system
because all information generated is stored on a server in a cloud and accessed through
any browser. Along with being a platform independent embedded system, it is scalable
INFORMATION OVERLOAD A.K.A INFOGLUT 7
allowing for its functionality to be accessed from any computer or mobile device via
For the purposes of the remaining sections, “keyword” and “filter”, including all their
respective derivations, will be redefined as such: “Keyword” will represent the result
acquired through the semantic analysis of any given piece of content. The result acquired
is, in its totality, a simplified meaning of the analyzed content and NOT a stand-alone
word. “Filter” is the process by which new content is compared to previously analyzed
learning, the system develops an understanding of each user and which data will appeal
to him or her. Each module will have its own hierarchy of most popular keywords, which
particular module. The modules will analyze keywords against each other in order to
classify and prioritize the relevant and desired information and, as the user interacts with
the filtered data, the system will develop an understanding of user preferences based on
In the field of extreme computing, there is an exhibited issue with the speed,
space, and immense amount of consumed energy programs take up in order to run; this
system will limit the space used, optimize speed, and give way for a more efficient use of
energy. The biggest load the system will experience is the analysis of new content against
previously classified content. In order to decrease the energy exerted by the system and
INFORMATION OVERLOAD A.K.A INFOGLUT 8
increase performance, the limited amount of keywords for each module will allow for
constructed through user behavior, relative to content. Although not specifically cited,
our system uses Google searches entered by the user to complement hierarchies
analyzed and classified when the threshold of frequency and duration spent on the search
Unlike a generic history display, this “Recently Viewed Pages” section keeps
track of the pages from which a user finds information and uses on a regular basis. To
save and display the recently viewed sites, the system marks the time that a user spends
on a given webpage as well as the number of related links clicked and the comparison of
the site to the keywords of its linked modules. With this data, the system decides if a
given page is more important and relevant to the user to display. For example, if a user
will move up on the display module. However, if, during the fifteen minutes on
PageX.com, the user clicks on several related links and spends time on those pages,
PageX.com will move up in the hierarchy. If a user spends ten minutes on PageZ.com
several times in a day, PageZ.com will move up higher than PageX.com and PageZ.com
which were visited only once for a longer period of time. In addition, the titles and
INFORMATION OVERLOAD A.K.A INFOGLUT 9
keywords on a given page are compared to the other keywords of its linked modules and,
if there are more matches, the page will be given a higher priority.
Social Media:
Facebook, Twitter, and other social media services are also available in the
portal. These social media sites can be linked to either personal or work emails as well as
to blogs, news sites, and Google searches. Based on the keyword hierarchies of the
linked modules, postings that have a certain threshold of keyword matches will be
displayed to the user. In addition, if the user interacts with a Facebook friend twenty
times in a workday, the posts of that friend will get priority in the display module over
those that only meet the threshold of keyword matches. For example, if a user often
searches for “software engineering” in Google and on Twitter, and person A posts a blurb
about his latest software engineering endeavors, it will move up in the hierarchy because
of the keywords. However, if the user re-tweets the posts of person B five times in one
day, person B’s post will move ahead of person A’s in the hierarchy and be displayed to
News Feeds:
The portal will contain a subsection designated for News, which will include three
The BLOGS portion will consist of blogs the user currently has, blogs the user
follows, and, as time progresses, blogs the user may be interested in following. Personal
blogs will fall under a separate hierarchy; the hierarchy will update based on blogs most
frequented. Blogs the user follows will fall under a hierarchy of their own; the hierarchy
INFORMATION OVERLOAD A.K.A INFOGLUT 10
will be based on the frequency of the user’s visits to each blog, the time spent on each
blog, and the frequency of updates made to each blog. For example: the user follows
yourblog.com, hisblog.com, and herblog.com. On a given day the user spends five, three,
and ten minutes on each blog, respectively; however, the user frequents each blog four,
two, and one time per day, respectively. Based on this, yourblog.com would be on the top
of the hierarchy, followed by herblog.com and hisblog.com; each blog would then
alternate in the hierarchy based which blog is most up-to-date. Blogs that might be of
interest to the user will be placed in the hierarchy based on key words within the blogs
The SOCIAL portion will consist of sites the user frequents and of which he or
she is an active member. The hierarchy will run the same way it does for blogs followed
by the user. Furthermore, the NEWS portion will consist of a section for sites frequented
by the user (or of which the user is a member), which will function under the algorithm in
place for blogs followed by the user, and sites the user may want to visit, which will
function the same as it does for blogs. NEWS is strictly limited to news sites such as
Email:
Submit changes
(SMTP and IMAP)
Emails (IMAP)
Email keyword
User actions
heap
Pass
Browsing recommended pile Keyword
Manipulating Heaps keyword
Sending Fail Test heap
glutpile
keyword
heap
INFORMATION OVERLOAD A.K.A INFOGLUT 11
Our website portal also adds email functionality. This component is implemented
by means of a delivery and filtering system. There will be an interface with two profiles,
personal and work. Each profile will associate with the user’s multiple email accounts
(the user may have more than one personal or work email accounts). An email account
can be linked with another module (social networking, news, internet activity/searches)
and the module’s respective keyword heap. The heaps linked to a specific email account
The emails may be pushed to the portal using the IMAP protocol, so that emails
stay on the user’s mail server and can be manipulated without saving a copy outside of
the email server. As the emails arrive, they are filtered against the keyword heaps. Each
the body of the email matches a certain number of keywords (determined by the
If it passes either test, it will be delivered into the preferred pile. If it fails both
tests, it will be delivered to a glut pile. These piles will manifest themselves as
expandable folders or headings, with the preferred section being darker and more visible
to the user. It is important to note that in order to protect data from imperfect filtering,
the filter must never delete unimportant emails on-delivery, but rather allow the user to
move the emails to the preferred pile if he or she feels that the filter failed to detect an
important email. The user may browse and manipulate emails from the component on
the website. All changes are submitted to the mail server regularly, just like with any
mail client.
INFORMATION OVERLOAD A.K.A INFOGLUT 12
The following section describes what a user will experience when first interacting with
To start, the user connects all accounts to the system and can then choose to
modify a set of basic settings. The standard settings include links generated between his
or her work email accounts to the news module, personal email to the social media
module, and Google searches to news articles. Base keywords can be added by the user
so that the system takes less time to learn; however, the system can begin learning by
picking out commonly used words from the modules, excluding simple connective words
such as “the,” “is,” “of,” etc., and then proceeding to refine the filters as the user interacts
with the system. With all of the information the system acquires and continuously learns,
the user has the ability to veto, change, and add his or her own keywords and preferences.
Setting
s
INFORMATION OVERLOAD A.K.A INFOGLUT 13
For example, when a user receives an email, it must pass through a filter of linked
keyword hierarchies. If the email is linked to Facebook, then the words in the email will
the email module. If it contains a certain number of the keywords, it can be sent through
to the user. Upon reaching the user, the system monitors the actions of the user and will
confirm or deny the correct placement of the keyword on the hierarchy. If the user
immediately deletes the email and the keyword was near the top, the system will bump it
down to a lower priority. Likewise, if a keyword was in the middle of the hierarchy and a
user spends a lot of time on the particular piece of information, replies to the email, or
References