강의 노트 1a
강의 노트 1a
Lecture 1 (a)
SSM Seoul
Oct. to Nov. 2023
• Data Science
➢ To understand the data to
derive “insights”
Oct. 28, 2023 Math behind ML Tech :: Lecture 1 (a) :: J. Rhee 3
AI, ML, DL vs Data Science
• Data Science
➢ an area, a field of study
➢ (relatively) scientific study
• Data Mining
➢ a technique
➢ (relatively) business process
➢ finding trends (in a data set) previously
unknown and using these trends to
identify future patterns
Oct. 28, 2023 Math behind ML Tech :: Lecture 1 (a) :: J. Rhee 4
Machine Learning
• Machine Learning (or Statistical Learning) refers to a
huge set of tools for understanding data.
➢ In Supervised Learning, a statistical model is
built to predict an output based upon inputs
by training the model with a dataset having
both inputs and a labeled response (output).
km/h
→ Hubble’s conclusion:
The universe is expanding.
interpretation
“Accelerating” Universe
https://youtu.be/BE1i20DzaAc
2022년 가을 !
• However, recall!
➢Kepler’s Laws of Planetary Motion
➢Hubble’s Law
• Other names
➢ Predictor (X): feature, input variable, independent variable, field
➢ Response (Y): output variable, dependent variable, target variable, outcome variable
Oct. 28, 2023 Math behind ML Tech :: Lecture 1 (a) :: J. Rhee 13
Model Fit
1. Model to fit the data.
• Linear Regression
• Logistic Regression
• Other ML models
Oct. 28, 2023 Math behind ML Tech :: Lecture 1 (a) :: J. Rhee 18
Data Mining: from a Process Perspective
Data Analytics
Social Network Analytics
Text Mining
Oct. 28, 2023 Math behind ML Tech :: Lecture 1 (a) :: J. Rhee 19
Application Examples
• Astronomy
➢ Photometric redshift
▪ Neural networks, boosted decision trees
➢ Gravitational wave: detection and parameter estimation
▪ CNNs
➢ Gravitational lens: detection and parameter estimation
▪ NNs, CNNs, and ResNets (residual NNs)
➢ Fundamental cosmological parameter estimation
▪ 3D CNNs (for a fast mapping between the dark matter and visible galaxies)
• Particle Physics
➢ Particle identification
➢ Event selection and high-level physics tasks
➢ Reconstruction
➢ Jet classification
➢ Tracking
➢ Fast simulation
• Statistical Physics
• Many-Body Quantum Matter
• Quantum Computing
• AI acceleration with classical and quantum hardware
• Nearly all of the objects that you can see in this photograph are:
➢ Planets HUDF (Hubble Ultra Deep Field) Image
- exposure time: ~ 600 hours between July 2002 and September 2012
➢ Stars - the size of this image
➢ Galaxies * horizontal length: 1/13 of the diameter of the full Moon in the sky
* area:1/150 of the area of the full Moon in the sky
- about 30,000 galaxies in this particular image
Oct. 28, 2023 Math behind ML Tech :: Lecture 1 (a) :: J. Rhee 21
- various types of galaxies: spiral, elliptical, and irregular
Generative Adversarial Network (GAN)
• New approach?
➢Which is heavier?
• 911 calls
• https://www.kaggle.com/competitions
• https://dacon.io/competitions
km/h
• The Problem of Overfitting This model fits the data with no error,
➢ Statistical models can produce highly But does it really show an appropriate trend?
complex explanations of relationships
between input and output variables. Overfitting:
➢ The “fit” appears to be excellent. A model follows the errors (noise) too closely.
➢ However, when used with new data, The method yields a small training MSE
models of great complexity do not
perform well. but a large test MSE.
A good model should follow the signal (trend),
not the errors (noise).
Oct. 28, 2023 Math behind ML Tech :: Lecture 1 (a) :: J. Rhee 42
Data Partition and their Roles
• Training Partition
➢ Typically the largest partition.
➢ Used to build the various models.
▪ This process is called “training”.
➢ Overfitting issue might arise.
• Validation Partition
➢ Didn’t participate in the training process.
➢ Instead, this is used to assess the predictive
performance of each model.
➢ This is also used for fine tuning of hyper-
parameters of models (e.g., the number of
nearest neighbors, learning rate, etc.).
➢ Finally, we compare the results of models
using the validation data and choose the best
model.
• Test Partition
➢ Completely new data, never participate in the
training and validating processes.
➢ Used to assess the performance of the chosen
(best) model.
• Bias: the error that is introduced by approximating a real-life problem (which may be
extremely complicated) by a much simpler model. Generally, more flexible methods
result in less bias.
• Irreducible error: random error, which is independent of x and has mean zero.
• The challenge lies in finding a method for which both the variance and the squared
bias are low.
Test MSE
Training MSE
Data Analytics
Social Network Analytics
Text Mining
Oct. 28, 2023 Math behind ML Tech :: Lecture 1 (a) :: J. Rhee 48
Core Ideas in Data Mining
• Classification
• Prediction
• Time Series Forecasting
• Association Rules & Recommenders
• Data & Dimension Reduction (including Cluster
Analysis)
• Data Exploration
• Visualization
• Why Anaconda?
➢It allows Python programming in an “interactive” mode via
Jupyter Notebook, which is quite useful for data science.
➢It permits to install multiple Python environments (e.g., different
versions of Python and libraries).
• Python 3
➢https://www.anaconda.com/products/distribution
▪ Download and install