0% found this document useful (0 votes)
22 views

Capitulo 1 Big data uc3m

The document provides an overview of statistical learning, categorizing it into supervised and unsupervised learning, and discusses its applications in predicting wages, stock market movements, and gene expression data. It also outlines the historical development of statistical learning methods from the 19th century to recent advancements. Additionally, it emphasizes the relevance of statistical learning across various disciplines and the importance of applying these methods to real-world problems.

Uploaded by

100473538
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Capitulo 1 Big data uc3m

The document provides an overview of statistical learning, categorizing it into supervised and unsupervised learning, and discusses its applications in predicting wages, stock market movements, and gene expression data. It also outlines the historical development of statistical learning methods from the 19th century to recent advancements. Additionally, it emphasizes the relevance of statistical learning across various disciplines and the importance of applying these methods to real-world problems.

Uploaded by

100473538
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

1. Overview 2. History 3.

Premises

Big data for Business

CHAPTER 1: INTRODUCTION

Department of Statistics
Universidad Carlos III de Madrid

Bachelor in Business Administration


Bachelor in Finance and Accounting

1 / 10
1. Overview 2. History 3. Premises

An overview of Statistical Learning

• Statistical learning refers to a vast set of tools for understanding


data.

• These tools can be classified as:

Supervised learning: involves building a statistical model for


predicting, or estimating, an output based on one or
more inputs.
Unsupervised learning: there are inputs but no supervising output;
nevertheless we can learn relationships and structure
from such data.

2 / 10
1. Overview 2. History 3. Premises

Wage data

We examine a number of factors that relate to wages for a group of


males from the Atlantic region of the United States.
300

300

300
200

200

200
Wage

Wage

Wage
50 100

50 100

50 100
20 40 60 80 2003 2006 2009 1 2 3 4 5

Age Year Education Level

The Wage data involves predicting a continuous or quantitative output


value. This is often referred to as a regression problem.

3 / 10
1. Overview 2. History 3. Premises

Wage data

• wage as a function of age. On average, wage increases with age until


about 60 years, at which point it begins to decline.

• wage as a function of year. There is a slow but steady increase of


approximately 10, 000 in the average wage between 2003 and 2009.

• Boxplots displaying wage as a function of education, with 1 indicating


the lowest level (no high school diploma) and 5 the highest level (an
advanced graduate degree). On average, wage increases with the level
of education.
4 / 10
1. Overview 2. History 3. Premises

Stock Market data


We examine a stock market data set that contains the daily movements
in the Standard & Poor’s 500 (S&P) stock index over a 5-year period
between 2001 and 2005.
Yesterday Two Days Previous Three Days Previous
6

6
Percentage change in S&P

Percentage change in S&P

Percentage change in S&P


4

4
2

2
0

0
−2

−2

−2
−4

−4

−4
Down Up Down Up Down Up

Today’s Direction Today’s Direction Today’s Direction

The Stock Market data involves predicting a categorical or qualitative


output value. This is often referred to as a classification problem.
5 / 10
1. Overview 2. History 3. Premises

Stock Market data

• The left-hand panel displays two boxplots of the previous day’s percentage
changes in the stock index.

• The two plots look almost identical, suggesting that there is no simple strategy
for using yesterday’s movement in the S&P to predict today’s returns.

• The remaining panels, which display boxplots for the percentage changes 2 and
3 days previous to today, similarly indicate little association between past and
present returns.

6 / 10
1. Overview 2. History 3. Premises

Gene Expression data


We consider the NCI60 data set, which consists of 6, 830 gene expression
measurements for each of 64 cancer cell lines.
20

20
0

0
Z2

Z2
−20

−20
−40

−40
−60

−60
−40 −20 0 20 40 60 −40 −20 0 20 40 60

Z1 Z1

Instead of predicting a particular output variable, we are interested in


determining whether there are groups, or clusters, among the cell lines
based on their gene expression measurements. This is often referred to as
a clustering problem.
7 / 10
1. Overview 2. History 3. Premises

Gene Expression data

• Each point corresponds to one of the 64 cell lines. Left: There


appear to be four groups of cell lines, which we have represented
using different colors.

• The right panel shows the same as left panel except that we have
represented each of the 14 different types of cancer using a different
colored symbol. Cell lines corresponding to the same cancer type
tend to be nearby in the two-dimensional space.
8 / 10
1. Overview 2. History 3. Premises

A Brief History of Statistical Learning


• At the beginning of the XIX century, Legendre and Gauss developed the
method of least squares, now known as linear regression.
• In 1936, Fisher proposed linear discriminant analysis.
• In the 1940s, various authors put forth the logistic regression.
• In the 1970s, Nelder and Wedderburn developed generalized linear models.
• By the 1980s, computing technology improved sufficiently that non-linear
methods were no longer prohibitive. Breiman, Friedman, Olshen and Stone
introduced classification and regression trees.
• In 1986, Hastie and Tibshirani proposed generalized additive models.
• Inspired by the advent of machine learning and other disciplines,
statistical learning emerged as a new subfield in statistics.
• In recent years, progress has been marked by the increasing availability of
powerful and relatively user-friendly software, like R.

9 / 10
1. Overview 2. History 3. Premises

Four premises

• Many statistical learning methods are relevant and useful in a wide


range of academic and non-academic disciplines, beyond just the
statistical sciences.

• Statistical learning should not be viewed as a series of black boxes.

• While it is important to know what job is performed by each cog, it


is not necessary to have the skills to construct the machine inside
the box!

• Interest is focused on applying statistical learning methods to


real-world problems.

10 / 10

You might also like