0% found this document useful (0 votes)
36 views

1.2 A Brief History of The Web and The Internet

The operation of the Web relies on the structure of itshypertext documents. Hypertext allows Web page authors to link their documents to other related documents. To view these documents, one simply follows the links (calledhyperl inks) hypertext that also allows other media (e.g., image, audio and video files) is calledhypermedi a.

Uploaded by

marathedm
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

1.2 A Brief History of The Web and The Internet

The operation of the Web relies on the structure of itshypertext documents. Hypertext allows Web page authors to link their documents to other related documents. To view these documents, one simply follows the links (calledhyperl inks) hypertext that also allows other media (e.g., image, audio and video files) is calledhypermedi a.

Uploaded by

marathedm
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

1 Introduction

the returned documents written in HTML and laying out the text and graphics on the users computer screen on the client side. The operation of the Web relies on the structure of itshyp ertext documents. Hypertext allows Web page authors to link their documents to other related documents residing on computers anywhere in the world. To view these documents, one simply follows the links (calledhyperl inks). The idea of hypertext was invented by Ted Nelson in 1965 [403], who also created the well known hypertext system Xanadu (http://xanadu. com/). Hypertext that also allows other media (e.g., image, audio and video files) is calledhypermedi a.

1.2 A Brief History of the Web and the Internet Creation of the Web: The Web was invented in 1989 by Tim BernersLee, who, at that time, worked at CERN (Centre European pour la Recherche Nucleaire, or European Laboratory for Particle Physics) in Switzer- land. He coined the term World Wide Web, wrote the first World Wide Web server, httpd, and the first client program (a browser and editor), WorldWideWeb. It began in March 1989 when Tim Berners-Lee submitted a proposal ti- tled Information Management: A Proposal to his superiors at CERN. In the proposal, he discussed the disadvantages of hierarchical information organization and outlined the advantages of a hypertext-based system. The proposal called for a simple protocol that could request information stored in remote systems through networks, and for a scheme by which informa- tion could be exchanged in a common format and documents of individu- als could be linked by hyperlinks to other documents. It also proposed methods for reading text and graphics using the display technology at CERN at that time. The proposal essentially outlined a distributed hyper-

Mosaic and Netscape Browsers: The next significant event in the de- velopment of the Web was the arrival ofMosaic. In February of 1993, Marc Andreesen from the University of Illinois NCSA (National Center for Supercomputing Applications) and his team released the first "Mosaic for X" graphical Web browser for UNIX. A few months later, different versions of Mosaic were released for Macintosh and Windows operating systems. This was an important event. For the first time, a Web client, with a consistent and simple point-and-click graphical user interface, was im- plemented

for the three most popular operating systems available at the time. It soon made big splashes outside the academic circle where it had begun. In mid-1994, Silicon Graphics founder Jim Clark collaborated with Marc Andreessen, and they founded the company Mosaic Communications (later renamed as Netscape Communications). Within a few months, theNetscape browser was released to the public, which started the explosive growth of the Web. The Internet Explorer from Microsoft en- tered the market in August, 1995 and began to challenge Netscape. The creation of the World Wide Web by Tim Berners-Lee followed by the release of the Mosaic browser are often regarded as the two most sig- nificant contributing factors to the success and popularity of the Web. Internet: The Web would not be possible without the Internet, which provides the communication network for the Web to function. TheInter- net started with the computer network ARPANET in the Cold War era. It was produced as the result of a project in the United States aiming at main- taining control over its missiles and bombers after a nuclear attack. It was supported by Advanced Research Projects Agency (ARPA), which was part of the Department of Defense in the United States. The first ARPANET connections were made in 1969, and in 1972, it was demon- strated at the First International Conference on Computers and Communi- cation, held in Washington D.C. At the conference, ARPA scientists linked computers together from 40 different locations. In 1973, Vinton Cerf and Bob Kahn started to develop the protocol later to be calledTCP/IP (Transmission Control Protocol/Internet Proto- col). In the next year, they published the paper Transmission Control Pro- tocol, which marked the beginning of TCP/IP. This new protocol allowed diverse computer networks to interconnect and communicate with each other. In subsequent years, many networks were built, and many compet- ing techniques and protocols were proposed and developed. However, ARPANET was still the backbone to the entire system. During the period, the network scene was chaotic. In 1982, the TCP/IP was finally adopted, and theInternet, which is a connected set of networks using the TCP/IP protocol, was born. 4 1 Introduction Search Engines:With information being shared worldwide, there was a need for individuals to find information in an orderly and efficient manner. Thus began the development of search engines. The search systemExcite was introduced in 1993 by six Stanford University students. EINet Galaxy was established in 1994 as part of the MCC Research Consortium at the University of Texas. Jerry Yang and David Filo createdYahoo! in 1994, which started out as a listing of their favorite Web sites, and offered direc- tory search. In subsequent years, many search systems emerged, e.g.,Lycos, Inforseek, AltaVista, Inktomi, Ask Jeeves, Northernlight, etc. Google was launched in 1998 by Sergey Brin and Larry Page based on their research project at Stanford University. Microsoft started to commit to search in 2003, and launched theMSN search engine in spring 2005. It used search engines from others before.Yahoo! provided a general search capability in 2004 after it purchased Inktomi in 2003. W3C(The World Wide Web Consortium): W3C was formed in the December of 1994 by MIT and CERN as an international organization to lead the development of the Web. W3C's main objective was to promote standards for the evolution of the Web and interoperability between WWW products by producing specifications and reference software. The firstInternational Conference on World Wide Web (WWW) was also held in 1994, which has been a yearly event ever since. From 1995 to 2001, the growth of the Web boomed. Investors saw commercial opportunities and became involved. Numerous businesses started on the Web, which led to irrational developments. Finally, the bubble burst in 2001. However, the development of the Web was not stopped, but has only become more rational since.

1.3 Web Data Mining The rapid growth of the Web in the last decade makes it the largest pub- licly accessible data source in the world. The Web has many unique char- acteristics, which make mining useful information and knowledge a fasci- nating and challenging task. Let us review some of these characteristics. 1. The amount of data/information on the Web is huge and still growing. The coverage of the information is also very wide and diverse. One can find information on almost anything on the Web. 2. Data of all types exist on the Web, e.g., structured tables, semi- structured Web pages, unstructured texts, and multimedia files (images, audios, and videos). 1.3 Web Data Mining 5 3. Information on the Web isheterogeneous. Due to the diverse author- ship of Web pages, multiple pages may present the same or similar in- formation using completely different words and/or formats. This makes integration of information from multiple pages a challenging problem. 4. A significant amount of information on the Web is linked. Hyperlinks exist among Web pages within a site and across different sites. Within a site, hyperlinks serve as information organization mechanisms. Across different sites, hyperlinks represent implicit conveyance of authority to the target pages. That is, those pages that are linked (or pointed) to by many other pages are usually high quality pages or authoritative pages simply because many people trust them. 5. The information on the Web is noisy. Thenoise comes from two main sources. First, a typical Web page contains many pieces of information, e.g., the main content of the page, navigation links, advertisements, copyright notices, privacy policies, etc. For a particular application, only part of the information is useful. The rest is considered noise. To per- form fine-grain Web information analysis and data mining, the noise should be removed. Second, due to the fact that the Web does not have quality control of information, i.e., one can write almost anything that one likes, a large amount of information on the Web is of low quality, erroneous, or even misleading. 6. The Web is also about services. Most commercial Web sites allow people to perform useful operations at their sites, e.g., to purchase products, to pay bills, and to fill in forms. 7. The Web is dynamic. Information on the Web changes constantly. Keeping up with the change and monitoring the change are important is- sues for many applications. 8. The Web is a virtual society. The Web is not only about data, informa- tion and services, but also about interactions among people, organiza- tions and automated systems. One can communicate with people any- where in the world easily and instantly, and also express ones views on anything in Internet forums, blogs and review sites. All these characteristics present both challenges and opportunities for min- ing and discovery of information and knowledge from the Web. In this book, we only focus on mining textual data. For mining of images, videos and audios, please refer to [143, 441]. To explore information mining on the Web, it is necessary to know data mining, which has been applied in many Web mining tasks. However, Web mining is not entirely an application of data mining. Due to the rich- ness and diversity of information and other Web specific characteristics discussed above, Web mining has developed many of its own algorithms.

What is Data Mining? Data mining is also called knowledge discovery in databases (KDD). It is commonly defined as the process of discovering usefulpatterns or knowledge from data sources,

e.g., databases, texts, images, the Web, etc. The patterns must be valid, potentially useful, and understandable. Data mining is a multi-disciplinary field involving machine learning, statistics, databases, artificial intelligence, information retrieval, and visualization. There are many data mining tasks. Some of the common ones are supervised learning(or classification), unsupervised learning(or clustering), association rule mining, and sequential pattern mining. We will study all of them in this book. A data mining application usually starts with an understanding of the application domain by data analysts (data miners), who then identify suitable data sources and the target data. With the data, data mining can be performed, which is usually carried out in three main steps: Pre-processing: The raw data is usually not suitable for mining due to various reasons. It may need to be cleaned in order to remove noises or abnormalities. The data may also be too large and/or involve many irrelevant attributes, which call for data reduction through sampling and attribute selection. Details about data pre-processing can be found in any standard data mining textbook. Data mining: The processed data is then fed to a data mining algorithm which will produce patterns or knowledge. Post-processing: In many applications, not all discovered patterns are useful. This step identifies those useful ones for applications. Various evaluation and visualization techniques are used to make the decision. The whole process (also called the data mining process) is almost always iterative. It usually takes many rounds to achieve final satisfactory results, which are then incorporated into real-world operational tasks. Traditional data mining uses structured data stored in relational tables, spread sheets, or flat files in the tabular form. With the growth of the Web and text documents, Web mining and text mining are becoming increasingly important and popular. Web mining is the focus of this book. 1.3.2 What is Web Mining? Web mining aims to discover useful information or knowledge from the Web hyperlink structure, page content, and usage data. Although Web mining uses many data mining techniques, as mentioned above it is not purely an application of traditional data mining due to the heterogeneity and semistructured or unstructured nature of the Web data. Many new mining tasks and algorithms were invented in the past decade. Based on the primary kinds of data used in the mining process, Web mining tasks can be categorized into three types: Web structure mining, Web content mining and Web usage mining. Web structure mining: Web structure mining discovers useful knowledge from hyperlinks (or links for short), which represent the structure of the Web. For example, from the links, we can discover important Web pages, which, incidentally, is a key technology used in search en- gines. We can also discover

communities of users who share common interests. Traditional data mining does not perform such tasks because there is usually no link structure in a relational table. Web content mining: Web content mining extracts or mines useful information or knowledge from Web page contents. For example, we can automatically classify and cluster Web pages according to their topics. These tasks are similar to those in traditional data mining. However, we can also discover patterns in Web pages to extract useful data such as descriptions of products, postings of forums, etc, for many purposes. Furthermore, we can mine customer reviews and forum postings to dis- cover consumer sentiments. These are not traditional data mining tasks. Web usage mining: Web usage mining refers to the discovery of user access patterns from Web usage logs, which record every click made by each user. Web usage mining applies many data mining algorithms. One of the key issues in Web usage mining is the pre-processing of click- stream data in usage logs in order to produce the right data for mining. In this book, we will study all these three types of mining. However, due to the richness and diversity of information on the Web, there are a large number of Web mining tasks. We will not be able to cover them all. We will only focus on some important tasks and their algorithms. TheWeb mining process is similar to the data mining process. The dif- ference is usually in the data collection. In traditional data mining, the data is often already collected and stored in a data warehouse. For Web mining, data collection can be a substantial task, especially for Web structure and content mining, which involves crawling a large number of target Web pages. We will devote a whole chapter on crawling. Once the data is collected, we go through the same three-step process: data preprocessing, Web data mining and post-processing. However, the techniques used for each step can be quite different from those used in tra- ditional data mining.

2 Association Rules and Sequential Patterns


Association rules are an important class of regularities in data. Mining of association rules is a fundamental data mining task. It is perhaps the most important model invented and extensively studied by the database and data mining community. Its objective is to find allco-occurren ce relationships, calledassociations, among data items. Since it was first introduced in 1993 by Agrawal et al. [9], it has attracted a great deal of attention. Many efficient algorithms, extensions and applications have been reported. The classic application of association rule mining is the market basket data analysis, which aims to discover how items purchased by customers in a supermarket (or a store) are associated. An example association rule is

Cheese->Beer [support = 10%, confidence = 80%]. The rule says that 10% customers buyCheese andBeer together, and those who buyCheese also buy Beer 80% of the time. Support and confi- dence are two measures of

rule strength, which we will define later This mining model is in fact very general and can be used in many ap- plications. For example, in the context of the Web and text documents, it can be used to find word co-occurrence relationships and Web usage pat- terns as we will see in later chapters. Association rule mining, however, does not consider the sequence in which the items are purchased. Sequential pattern mining takes care of that. An example of a sequential pattern is 5% of customers buybedfirst, thenmattress and then pillows. The items are not purchased at the same time, but one after another. Such patterns are useful in Web usage mining for analyzingclickstreams in server logs. They are also useful for finding languageor linguistic patterns from natural language texts.

2.1 Basic Concepts of Association Rules


The problem of mining association rules can be stated as follows: LetI = {i1, i2, ,im} be a set ofitems. LetT = (t1,t2, ,tn) be a set oftransactions (the database), where each transactionti is a set of items such that ti I. An association rule is an implication of the form,
T

You might also like