Presentation
Presentation
Big Data refers to extremely large and complex datasets that cannot
be effectively processed using traditional data management tools. It
involves collecting, storing, processing, and analyzing vast amounts of
structured, semi-structured, and unstructured data to extract
meaningful insights.
Data types that can be obtained: Data types that can be obtained:
• Sensor data: Temperature, • Demographic data: population
humidity, pressure statistics
• Telemetry data: Location, • Economic data: Employment
speed, acceleration rates, GDP figures
• Operational data: Machine • Health data: Disease prevalence,
status, energy consumption healthcare outcomes
• Environmental data: Air quality, • Geographic data: Maps, spatial
soil moisture datasets
Open source Tools
Types of Storage :
Open-source tools :
Open-source tools :
2
TABLEAU
PUBLIC
a free version that
allows you to
create and share
visualizations
online
1
POWER BI
allows users to
create reports and
visualizations on
their local machines.
3
LOOKER
used for exploring,
analyzing, and
sharing real-time
business analytics.
5
KIBANA
Allows users to visualize
and explore data stored
in Elasticsearch using
interactive dashboards,
graphs, charts, maps,
and tables
Which visualization tool should I
choose for my needs?
Machine
Learning
Definition:
Machine Learning (ML) is a subset of Artificial
Intelligence that focuses on developing algorithms that
enable computers to learn and make decisions from
data without being explicitly programmed.
Types of ML:
01 Supervised Learning
02 Unsupervised Learning
03 Reinforcement Learning
1- Supervised
Learning
Supervised learning is when a model is trained on
labeled data, meaning it knows the correct answer
during training .
Classification
Algorithms:
There are two types of learners in machine learning classification:
Eager learners :
Eager learners are ML algorithms that build a model during
the training phase and are ready to make predictions
immediately when the training is complete. These
algorithms "eagerly" build the entire model before being
presented with any new data.
• Logistic Regression.
• Support Vector Machine.
• Decision Trees.
Logistic Regression.
Logistic Regression is a statistical method used for binary
classification problems. It predicts the probability that a given
input belongs to a specific class (usually labeled as 0 or 1) based
on one or more input features.
Decision Tree
A decision tree is a ML method used for classification
or regression. It works like a flowchart, where each
step (called a node) asks a question or makes a
decision to split the data into groups until it reaches a
final decision.
Lazy learners :
Lazy learners or instance-based learners, don’t create any model immediately from the training data, and this is where
the lazy aspect comes from. They just memorize the training data, and each time there is a need to make a prediction,
they search for the nearest neighbor from the whole training data, which makes them very slow during prediction.
• K-Nearest Neighbor(KNN).
• Case-based reasoning.
K-Nearest Neighbor
(KNN)
The KNN algorithm predicts the label or value for a new data
point by looking at the closest K data points (neighbors) in the
training dataset.
Types of
Regression
1. Simple Linear Regression
This assumes that there is a linear relationship between the independent and dependent variables. This means that the change in
3. Polynomial Regression
It’s used to model with non-linear relationships between the dependent variable
and the independent variables. It adds polynomial terms to the linear regression
model to capture more complex relationships.
• No missing values
• Data in numeric format
• Data stored as pandas DataFrames or Series, or NumPy arrays.
Types of
Clustering :
1) K-MEANS :
It divides the data into a fixed number (K) of clusters based on the
similarity of the data points.
DBSCAN Parameters
🔹 ε (epsilon): Maximum distance between two points to be considered neighbors.
🔹 MinPts (minimum points): Minimum number of points required to form a dense cluster.
Dimensionality
Reduction
It’s the process of reducing the number of features (variables) in a dataset while preserving as much important information as possible.
It helps to :
• Improve computational efficiency.
• Reduce overfitting by eliminating redundant features.
• Enhance visualization (especially for high-dimensional data).
Types of Outliers:
• Univariate outliers exist in a single variable in isolation. They are extreme or abnormal values that deviate from the
typical range of values for that specific feature.
• Multivariate outliers are found by combining the values of multiple variables at the same time.
How RL Works:
9.The agent observes the state of the environment.
10.It chooses an action based on its current policy.
11.The environment responds by changing the state and giving a reward.
12.The agent updates its knowledge (policy) to maximize future rewards.
13.The process repeats until the agent learns the best strategy
Types of Reinforcement Learning :
1.Model-Free RL : The agent learns through trial and error without knowing the environment's exact rules.
⚬ Examples: Q-Learning, Deep Q-Networks (DQN).
2.Model-Based RL : The agent builds a model of the environment and uses it to make decisions.
⚬ Examples: Monte Carlo Tree Search (MCTS).
3.On-Policy vs. Off-Policy Learning :
⚬ On-Policy: Learns from the actions it takes (e.g., SARSA).
⚬ Off-Policy: Learns from different policies (e.g., Q-Learning).
Popular Algorithms in RL :
4.Q-Learning : A value-based method that learns the optimal action-value function Q(s,a).
5.Deep Q-Networks (DQN) : An extension of Q-learning that uses neural networks to approximate the Q-
function, enabling it to handle high-dimensional inputs like images.
6.Policy Gradient Methods : Algorithms like REINFORCE and Actor-Critic learn policies directly by optimizing
the expected return.
7.Proximal Policy Optimization (PPO) : A popular policy gradient algorithm known for its stability and
efficiency.
8.Monte Carlo Tree Search (MCTS) : Often used in combination with RL for games like Go and Chess.
Deep Learning
Definition:
DL is a subset of ML, which itself is a subfield of AI. It involves the
use of artificial neural networks to model and solve complex
problems. These neural networks are inspired by the structure and
function of the human brain, consisting of layers of interconnected
nodes (neurons) that process data.
Memory-based :
Memory-based systems represent users and items as a matrix. They are an
extension of the k-nearest neighbors (KNN) algorithm because they aim to find
their “nearest neighbors,” which can be similar users or similar items.
Model-based :
One of the most commonly used model-based collaborative filtering algorithms
is matrix factorization. This dimensionality reduction method decomposes the
user-item matrix into two smaller matrices—one for users and another for
items. The 2 matrices are then multiplied together to predict the missing values
(or the recommendations) in the larger matrix.
2) Content-based :
Content-based filtering recommends items by comparing the features of items
with a user's profile or preferences. It focuses on the attributes of the items
themselves.
4)Context-Aware Filtering :
It’s a recommendation system approach that takes into account the context in which a
user is making a decision or interacting with a system. The context could refer to various
factors such as time, location, device, or mood.
While traditional recommendation systems focus mainly on user preferences and item
attributes, context-aware systems consider the surrounding conditions to refine the
recommendations further.
3. Autoencoders (Variational Autoencoders - VAE) : Learns hidden patterns in user preferences and reconstructs missing
data.
• How it works:
⚬ Compresses (encodes) user preferences into a smaller representation.
⚬ Reconstructs (decodes) preferences to fill in missing interactions.
• Used for: Handling missing data (e.g., cold start problems).
4. Recurrent Neural Networks (RNN - LSTM, GRU) : Models user behavior over time.
• How it works:
⚬ Takes a sequence of user interactions (e.g., past clicks).
⚬ Uses LSTM (Long Short-Term Memory) or GRU (Gated Recurrent Units) to remember past preferences.
⚬ Predicts the next item the user might like.
• Used for: Sequential recommendations (e.g., music, news).
5. Convolutional Neural Networks (CNN) : Extracts important features from content
(images, text, etc.).
• How it works:
⚬ Uses convolutional layers to analyze item characteristics.
⚬ Matches similar items based on extracted features.
• Used for: Content-based recommendations.
8.Recommendation by Clustering
Clustering-based recommendation is a technique in recommendation systems that groups users or items into clusters based on
their similarities. Once the clusters are formed, recommendations are made based on the preferences of other users or items in
the same cluster.
• Build vector representations of items Use techniques like TF-IDF
• Apply clustering algorithms to group similar users or items.
• Associate a user with a cluster of similar items
• Recommend items from the same cluster
THANK YOU
FOR YOUR INTEREST