COEN413 Machine Learning-2
COEN413 Machine Learning-2
describe an object
Size: Number of objects
– Object is also known as Dimensionality: Number of attributes
record, point, case, sample, Sparsity: Number of populated
entity, or instance object-attribute pairs
Types of Attributes
• There are different types of attributes
– Categorical
• Examples: eye color, zip codes, words, rankings (e.g, good, fair, bad),
height in {tall, medium, short}
• Nominal (no order or comparison) vs Ordinal (order but not comparable)
– Numeric
• Examples: dates, temperature, time, length, value, count.
• Discrete (counts) vs Continuous (temperature)
• Special case: Binary attributes (yes/no, exists/not exists)
Types of data
• Numeric data: Each object is a point in a multidimensional space
• Categorical data: Each object is a vector of categorical values
• Set data: Each object is a set of values (with or without counts)
– Sets can also be represented as binary vectors, or vectors of counts
• Ordered sequences: Each object is an ordered sequence of values.
• Graph data: Web graph and HTML Links
What can you do with the data?
• Suppose you are a search engine and you have a
toolbar log consisting of
– pages browsed, Ad click prediction
– queries,
Query reformulations
– pages clicked,
– ads clicked
each with a user id and a timestamp. What information
would you like to get out of the data?
Why data mining?
• Commercial point of view
– Data has become the key competitive advantage of companies
• Examples: Facebook, Google, Amazon
– Being able to extract useful information out of the data is key for exploiting them commercially.
• Scientific point of view
– Scientists are at an unprecedented position where they can collect TB of information
• Examples: Sensor data, astronomy data, social network data, gene data
– We need the tools to analyze such data to get a better understanding of the world and advance
science
• Scale (in data size and feature dimension)
– Why not use traditional analytic methods?
– Enormity of data, curse of dimensionality
– The amount and the complexity of data does not allow for manual processing of the data. We need
automated techniques.
What is Data Mining again?
• “Data mining is the analysis of (often large) observational data sets to
find unsuspected relationships and to summarize the data in novel ways
that are both understandable and useful to the data analyst” (Hand,
Mannila, Smyth)
• Recommendations:
– Users who buy this item often buy this item as well
– Users who watched James Bond movies, also watched Jason Bourne
movies.
Intracluster
Intraclusterdistances
distances Intercluster
Interclusterdistances
distances
are
areminimized
minimized are
aremaximized
maximized
Clustering: Application 1
• Bioinformatics applications:
– Goal: Group genes and tissues together such that genes are
coexpressed on the same tissues
Clustering: Application 2
• Document Clustering:
– Goal: To find groups of documents that are similar to each
other based on the important terms appearing in them.
– Approach: To identify frequently occurring terms in each
document. Form a similarity measure based on the
frequencies of different terms. Use it to cluster.
– Gain: Information Retrieval can utilize the clusters to relate
a new document or search term to clustered documents.
Classification: Definition
• Given a collection of records (training set )
– Each record contains a set of attributes, one of the
attributes is the class.
• Find a model for class attribute as a function of the
values of other attributes.
• The important thing is to find the right metrics and ask the right questions
• It helps our understanding of the world, and can lead to models of the
phenomena we observe.
Connections of Data Mining with other areas
• Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems
• Traditional Techniques
may be unsuitable due to Statistics/
AI
Machine Learning/
Pattern
– High dimensionality
Data Mining
of data
– Heterogeneous,
Database systems
distributed nature
of data
Cultures
• Databases: concentrate on large-scale (non-
main-memory) data.
• AI (machine-learning): concentrate on complex
methods, small data.
– In today’s world data is more important than
algorithms
• Statistics: concentrate on models.
26
Models vs. Analytic Processing
• To a database person, data-mining is
an extreme form of analytic processing
– queries that examine large amounts
of data.
– Result is the query answer.
• To a statistician, data-mining is the
inference of models.
– Result is the parameters of the model.
27
Data Mining: Confluence of Multiple Disciplines
Database
Technology Statistics
Machine Visualization
Data Mining
Learning
Pattern
Recognition Distributed
Algorithm Computing
Commodity Clusters
• Web data sets can be very large
– Tens to hundreds of terabytes
– Cannot mine on a single server
• Standard architecture emerging:
– Cluster of commodity Linux nodes, Gigabit ethernet interconnect
– Google GFS; Hadoop HDFS; Kosmix KFS
• Typical usage pattern
– Huge files (100s of GB to TB)
– Data is rarely updated in place
– Reads and appends are common
• How to organize computations on this architecture?
– Map-Reduce paradigm
Map-Reduce paradigm
• Map the data into key-value pairs
– E.g., map a document to word-count pairs
• Group by key
– Group all pairs of the same word, with lists of
counts
• Reduce by aggregating
– E.g. sum all the counts to produce the total count.
The data analysis pipeline
• Mining is not the only step in the analysis process
Result
Data Preprocessing Data Mining
Post-processing
• Preprocessing: real data is noisy, incomplete and inconsistent. Data cleaning is required to
make sense of the data
– Techniques: Sampling, Dimensionality Reduction, Feature selection.
– A dirty work, but it is often the most important step for the analysis.
• Post-Processing: Make the data actionable and useful to the user
– Statistical analysis of importance
– Visualization.
– Pre- and Post-processing are often data mining tasks as well
Data Quality
• Examples of data quality problems:
– Noise and outliers
– missing values
– duplicate data
Sampling
• Sampling is the main technique employed for data
selection.
– It is often used for both the preliminary investigation of
the data and the final data analysis.
• Stratified sampling
– Split the data into several partitions; then draw random samples from each partition
Sample Size