0% found this document useful (0 votes)

248 views

CS232 Team Project Final: Infoglut

A research paper compiled by three students participating in Dr. Joseph's CS232: Computer Organization class in the Spring 2010 term.

Uploaded by

Seidenberg School of Computer Science and Information Systems

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

248 views

CS232 Team Project Final: Infoglut

A research paper compiled by three students participating in Dr. Joseph's CS232: Computer Organization class in the Spring 2010 term.

Uploaded by

Seidenberg School of Computer Science and Information Systems

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 15

Running head: INFORMATION OVERLOAD A.K.

A INFOGLUT 1

Information Overload a.k.a Infoglut

Julie L. Gill

Luisa M. Morales

Igor D. Pokryshevskiy
INFORMATION OVERLOAD A.K.A INFOGLUT 2

Abstract

The problem of infoglut is growing and computer scientists and engineers need to find a

solution. Deciding which information is good and which is bad, or unnecessary, is a

tough and complex problem. Our intelligent embedded system is a web portal, which

allows a user to access all of his or her various social media, email, news, and blogs in

one place from any computer. Using a system based on the concept of genetic

algorithms, the portal learns and continuously refines favorite keywords and relationships

between the user’s preferences and actions and displays only the most relevant

information.
INFORMATION OVERLOAD A.K.A INFOGLUT 3

The Internet has granted us entry into a repository and exchange of information.

The information present exceeds measurable amounts. The problem with this is that we

often cannot separate the information, which may increase our knowledge and improve

our lives from that which has no value and leads us astray from productive goals. This

dilemma is defined as Infoglut, the mixing of quality information with irrelevant and

time-wasting data. “It's not so much about technology as it is about deciding which

information is of value and to whom, and then configuring the technology accordingly.

It's bringing a human dimension back to an inhumane consequence of information

technology” (Denning 2006).

Infoglut causes an increase in the time one must spend sorting through

extraneous information to find what one needs. Employees may spend enormous

amounts of time searching through, deleting, or organizing emails, a mundane task made

complicated by a very high amount of data. According to an article in IEEE Spectrum,

“Information workers typically receive 50 to 200 work-related e-mails daily” (Zeldez,

2009). While much of the information in e-mails may be non-critical, typically all e-

mails must be processed and answered, so this introduces a heavy time-load for going

through information that may not be critical or important to the employee’s job. Often,

humans have neither the time nor the capability to read and process all of the information

that is available, so it is necessary to have a system that will reduce the amount of

information that needs to be processed. In order not to lose any data and to understand the

important data, it is necessary to employ the help of an intelligent embedded software

INFORMATION OVERLOAD A.K.A INFOGLUT 4

system by which data and information can be filtered and irrelevant data can be held

aside.

It is necessary for this system to have an intelligent algorithm that will ensure that

any and all relevant data makes it to the user, and irrelevant information remains unseen.

The definition of relevant depends on the user, and the solution is to gather data and

decide what the particular user believes to be relevant (each algorithm must be

implemented and held accountable for one user at a time, as all data the user seeks is

relative). The system will have the capabilities to learn about the user and make future

decisions based on the user’s preferences. Over time, it will be able to hone its definition

of relevant and “solicited” information, resulting in a highly personal and focused set of

information for any one user.

An attempt to aid in the relevance of data obtained based on what a user seeks to

find was implemented by Google, and other search engines, when they implemented

Search Engine Optimization. Although the implementation of such optimization has

proven to be helpful, it is often more important how "good" a user is at searching. Often

the user will need to be an intelligent and knowledgeable searcher if they seek

information from respected sources and not from those who paid for a top spot or have,

over the course of time, been able to generate more clicks, leading to a higher spot in the

search results.

In comparison, RSS feeds allow for users, who are not sophisticated in terms of

web-searching, to swiftly access data from reliable sources. RSS feeds are a step-up from

Search Engine Optimization because it sorts the data you seek, the data is from sources
INFORMATION OVERLOAD A.K.A INFOGLUT 5

you have inquired about, and it eliminates the hassle of pop-ups and frustrating

advertisements run through Search Engines. However, it is a difficult and overwhelming

process to keep up with the numerous online postings when there may be 30 or more

postings in a given day from each feed you chose to subscribe to.

We believe Neural Networks and Genetic Algorithms will aid in the processing

and sorting of data. “Neural networks come with panoply of various learning algorithms,

both supervised and unsupervised, that help implement a nonlinear mapping between

input and output variables” (Pedrycz 2002). A system that is designed to sift through the

massive amounts of data that is put on the Internet and into the world every minute must

be automated, to an extent, to learn what the user does and does not want to see. The

system will be able to make decisions based on previous preferences of viewing

information in order to display or hide the latest information. Based on the volume of

data that must be processed, in addition to computational intelligence, artificial

intelligence will aid in the labor-intensive process of making decisions about data. The

system must learn as it runs in order to develop a figurative sense of what decisions it

should make without the engineer (or user) needing to specify and spell out each case,

leading to time efficiency and improved functionality and reliability.

Genetic Algorithms randomly generate "an initial set of solutions called a

population. A set of genetic operators such as selection, crossover, and mutation,

are...applied to the solutions [in order to] generate a new set of solutions for the next

iteration called a generation" (Fang et al., 2008). Through the use of tree structures and

population-based incremental learning, the genetic algorithm is enabled to produce a

more accurate generation (or population) while continuously learning from previous
INFORMATION OVERLOAD A.K.A INFOGLUT 6

generations and the user as time (and user interests) move forward. Through the use of

building blocks and arrays of strings the genetic algorithm is able to grab the "fittest"

individuals within a generation of data and reproduce the information in a more compact

and direct manner. Population-based incremental learning leads to greater efficiency and

a more accurate representation of the data sought by the user. Genetic algorithms have a

tendency to lack a condensed way of sorting data, but through population-based

incremental learning, the system will continuously learn from its encounters with the user

and, through arrays of strings, the continuous data flow will be represented in trees with

the most important information, relatively speaking in terms of each individual user,

pushed to the top, while the rest of information trickles down the hierarchy.

Our System Design (iWingNow Portal):

This section denotes the outcomes acquired through the creation of an intelligent

embedded system, which classifies, analyzes, and prioritizes content under a scalable

framework while addressing the challenges of extreme computing to efficiently target the

hampering effects of infoglut.

Our proposed system functions as a portal connecting all of the service providers

of any given user including email, news, and social media, each of which is contained

within a module on the main screen. The portal functions as an embedded system

because all information generated is stored on a server in a cloud and accessed through

any browser. Along with being a platform independent embedded system, it is scalable
INFORMATION OVERLOAD A.K.A INFOGLUT 7

allowing for its functionality to be accessed from any computer or mobile device via

developed mobile applications.

For the purposes of the remaining sections, “keyword” and “filter”, including all their

respective derivations, will be redefined as such: “Keyword” will represent the result

acquired through the semantic analysis of any given piece of content. The result acquired

is, in its totality, a simplified meaning of the analyzed content and NOT a stand-alone

word. “Filter” is the process by which new content is compared to previously analyzed

content in order to maintain a continuous and real-time hierarchy of displayed

information, which is then pushed to each individual end user.

Using the framework of genetic algorithms and population-based incremental

learning, the system develops an understanding of each user and which data will appeal

to him or her. Each module will have its own hierarchy of most popular keywords, which

can be moved up or down in the tree-structure depending on factors relevant to the

particular module. The modules will analyze keywords against each other in order to

classify and prioritize the relevant and desired information and, as the user interacts with

the filtered data, the system will develop an understanding of user preferences based on

behavior resulting in the generation/updating of a module’s hierarchy.

In the field of extreme computing, there is an exhibited issue with the speed,

space, and immense amount of consumed energy programs take up in order to run; this

system will limit the space used, optimize speed, and give way for a more efficient use of

energy. The biggest load the system will experience is the analysis of new content against

previously classified content. In order to decrease the energy exerted by the system and
INFORMATION OVERLOAD A.K.A INFOGLUT 8

increase performance, the limited amount of keywords for each module will allow for

efficient classification processes.

The Algorithms Behind Each Module:

This section contains a detailed description of the learning module (algorithm)

constructed through user behavior, relative to content. Although not specifically cited,

our system uses Google searches entered by the user to complement hierarchies

generated in an effort to maintain real-time, user-specific content. Content is only

analyzed and classified when the threshold of frequency and duration spent on the search

allots of the generation of a new keyword.

spends fifteen minutes on PageX.com and twenty minutes on PageY.com, PageY.com

will move up on the display module. However, if, during the fifteen minutes on

PageX.com, the user clicks on several related links and spends time on those pages,

PageX.com will move up in the hierarchy. If a user spends ten minutes on PageZ.com

several times in a day, PageZ.com will move up higher than PageX.com and PageZ.com

which were visited only once for a longer period of time. In addition, the titles and
INFORMATION OVERLOAD A.K.A INFOGLUT 9

keywords on a given page are compared to the other keywords of its linked modules and,

if there are more matches, the page will be given a higher priority.

Social Media:

Facebook, Twitter, and other social media services are also available in the

portal. These social media sites can be linked to either personal or work emails as well as

to blogs, news sites, and Google searches. Based on the keyword hierarchies of the

linked modules, postings that have a certain threshold of keyword matches will be

displayed to the user. In addition, if the user interacts with a Facebook friend twenty

times in a workday, the posts of that friend will get priority in the display module over

those that only meet the threshold of keyword matches. For example, if a user often

searches for “software engineering” in Google and on Twitter, and person A posts a blurb

about his latest software engineering endeavors, it will move up in the hierarchy because

of the keywords. However, if the user re-tweets the posts of person B five times in one

day, person B’s post will move ahead of person A’s in the hierarchy and be displayed to

the user first.

News Feeds:

The portal will contain a subsection designated for News, which will include three

subsections: BLOGS, SOCIAL, and NEWS.

The BLOGS portion will consist of blogs the user currently has, blogs the user

follows, and, as time progresses, blogs the user may be interested in following. Personal

blogs will fall under a separate hierarchy; the hierarchy will update based on blogs most

frequented. Blogs the user follows will fall under a hierarchy of their own; the hierarchy
INFORMATION OVERLOAD A.K.A INFOGLUT 10

will be based on the frequency of the user’s visits to each blog, the time spent on each

blog, and the frequency of updates made to each blog. For example: the user follows

yourblog.com, hisblog.com, and herblog.com. On a given day the user spends five, three,

and ten minutes on each blog, respectively; however, the user frequents each blog four,

two, and one time per day, respectively. Based on this, yourblog.com would be on the top

of the hierarchy, followed by herblog.com and hisblog.com; each blog would then

alternate in the hierarchy based which blog is most up-to-date. Blogs that might be of

interest to the user will be placed in the hierarchy based on key words within the blogs

currently being viewed and from Google searches.

The SOCIAL portion will consist of sites the user frequents and of which he or

she is an active member. The hierarchy will run the same way it does for blogs followed

by the user. Furthermore, the NEWS portion will consist of a section for sites frequented

by the user (or of which the user is a member), which will function under the algorithm in

place for blogs followed by the user, and sites the user may want to visit, which will

function the same as it does for blogs. NEWS is strictly limited to news sites such as

CNN.com, Economist.com, wsj.com, BBC.com and others of the like.

Email:

Submit changes
(SMTP and IMAP)
Emails (IMAP)
Email keyword
User actions
heap
Pass
Browsing recommended pile Keyword
Manipulating Heaps keyword
Sending Fail Test heap
glutpile

keyword
heap
INFORMATION OVERLOAD A.K.A INFOGLUT 11

Our website portal also adds email functionality. This component is implemented

by means of a delivery and filtering system. There will be an interface with two profiles,

personal and work. Each profile will associate with the user’s multiple email accounts

(the user may have more than one personal or work email accounts). An email account

can be linked with another module (social networking, news, internet activity/searches)

and the module’s respective keyword heap. The heaps linked to a specific email account

are used as a basis for filtering the emails.

The emails may be pushed to the portal using the IMAP protocol, so that emails

stay on the user’s mail server and can be manipulated without saving a copy outside of

the email server. As the emails arrive, they are filtered against the keyword heaps. Each

email must pass one of two tests:

 the subject line matches with at least one keyword

 the body of the email matches a certain number of keywords (determined by the

changeable threshold variable)

If it passes either test, it will be delivered into the preferred pile. If it fails both

tests, it will be delivered to a glut pile. These piles will manifest themselves as

expandable folders or headings, with the preferred section being darker and more visible

to the user. It is important to note that in order to protect data from imperfect filtering,

the filter must never delete unimportant emails on-delivery, but rather allow the user to

move the emails to the preferred pile if he or she feels that the filter failed to detect an

important email. The user may browse and manipulate emails from the component on

the website. All changes are submitted to the mail server regularly, just like with any

mail client.
INFORMATION OVERLOAD A.K.A INFOGLUT 12

End User Experience:

The following section describes what a user will experience when first interacting with

our system, followed by a detailed example based on the email module.

To start, the user connects all accounts to the system and can then choose to

modify a set of basic settings. The standard settings include links generated between his

or her work email accounts to the news module, personal email to the social media

module, and Google searches to news articles. Base keywords can be added by the user

so that the system takes less time to learn; however, the system can begin learning by

picking out commonly used words from the modules, excluding simple connective words

such as “the,” “is,” “of,” etc., and then proceeding to refine the filters as the user interacts

with the system. With all of the information the system acquires and continuously learns,

the user has the ability to veto, change, and add his or her own keywords and preferences.

Manage links to this module

Work Email | Personal Email Facebook | Twitter

Search the Internet

Setting
s
INFORMATION OVERLOAD A.K.A INFOGLUT 13

For example, when a user receives an email, it must pass through a filter of linked

keyword hierarchies. If the email is linked to Facebook, then the words in the email will

be compared to the structure of preferred keywords in the Facebook module as well as in

the email module. If it contains a certain number of the keywords, it can be sent through

to the user. Upon reaching the user, the system monitors the actions of the user and will

confirm or deny the correct placement of the keyword on the hierarchy. If the user

immediately deletes the email and the keyword was near the top, the system will bump it

down to a lower priority. Likewise, if a keyword was in the middle of the hierarchy and a

user spends a lot of time on the particular piece of information, replies to the email, or

stars it, the keywords will be given a higher priority.

INFORMATION OVERLOAD A.K.A INFOGLUT 14

References

Barstow, D. (1987). Artificial Intelligence and Software Engineering. Proceedings of the

9th international Conference on Software Engineering. 201-211. Accessed from
http://delivery.acm.org/10.1145/50000/41786/p200-barstow.pdf?
key1=41786&key2=4189929621&coll=ACM&dl=ACM&CFID=83071988&CFTOKEN=
93933442.
Deb, K., P., A., Agarwal, S., & Meyarivan, T. (2002). A fast and elitist multi-
objective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary
Computation, 6:182–197.
Denning, Peter J. (2006). The profession of IT: Infoglut. Communications of the ACM.
15-19. Accessed from http://delivery.acm.org/10.1145/1140000/1139936/p15-
denning.html?
key1=1139936&key2=4889968621&coll=ACM&dl=ACM&CFID=80237519&CFTOKE
N=33449546.
Fang, H., Wang, Q., Tu Y.C.,Horstemeyer, M. (2008). An Efficient Non-dominated Sorting
Method for Evolutionary Algorithms. Massachusetts Institute of
Technology. Accessed from
http://www.mitpressjournals.org/doi/abs/10.1162/evco.2008.16.3.355?
url_ver=Z39.88-2003&rfr_id=ori:rid:crossref.org&rfr_dat=cr_pub
%3dncbi.nlm.nih.gov
Goldberg, D. E. (1989). Genetic algorithms in search optimization and machine learning.
Boston, MA: Addison Wesley.
Lueg, C, & Sam M. (2007). Users Dealing with Spam and Spam Filters: Some
Observations and Recommendations. Proceedings of the 7th ACM SIGCHI New
INFORMATION OVERLOAD A.K.A INFOGLUT 15

Zealand Chapter's international Conference on Computer-Human interaction:

Design Centered HCI. 67-72. Accessed from
http://delivery.acm.org/10.1145/1280000/1278970/p67-lueg.pdf?
key1=1278970&key2=6460039621&coll=ACM&dl=ACM&CFID=83073213&CFTOKE
N=20697899.
Pedrycs, W. (2002). Computational intelligence as an emerging paradigm of software
engineering. SEKE, 7-14.Accessed from
http://delivery.acm.org/10.1145/570000/568763/p7-pedrycz.pdf?
key1=568763&key2=1660078621&coll=ACM&dl=ACM&CFID=80238591&CFTOKEN
=93428931.
Stergiou, C., & Siganos, D. Neural Networks. Imperial College London. Accessed from
http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html.
Zeldez, N. (2009). How to Beat Information Overload. IEEE Spectrum. Accessed from
http://spectrum.ieee.org/computing/it/how-to-beat-information-overload/0.