0% found this document useful (0 votes)
20 views

Mid Summary

Artificial neural networks are inspired by biological neural networks and are made up of interconnected nodes that can learn patterns from data. They use a technique called backpropagation to calculate error gradients and update weights to improve accuracy on new data. Fuzzy logic systems model human reasoning using degrees of membership rather than binary logic.

Uploaded by

Mahmoud Hussam
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Mid Summary

Artificial neural networks are inspired by biological neural networks and are made up of interconnected nodes that can learn patterns from data. They use a technique called backpropagation to calculate error gradients and update weights to improve accuracy on new data. Fuzzy logic systems model human reasoning using degrees of membership rather than binary logic.

Uploaded by

Mahmoud Hussam
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

• Intelligence: The capacity to learn, understand, solve problems, and make decisions.

• Artificial Intelligence (AI): The science of designing machines to perform tasks that
require intelligence if done by humans
• A machine is considered intelligent if it can achieve human-level performance in
cognitive tasks.
• Machine learning: enables computers to adapt and learn from data, examples, or
analogies, rather than relying on explicit programming
• Artificial Neural Networks (ANNs): Computer systems inspired by
the interconnected neuron structures of the brain.
• Generalization: It is the capability of the model to perform
accurately on new, unseen data after being trained on a subset of
data.

Artificial Neural Networks


• Neurons are simple and highly interconnected processors
analogous to the biological neurons in the brain..
• Neuron receives number of input signals but never more than a
one output signal
• Each link has a numerical weight associated with it. Weights are
the basic means of long-term memory in ANNs.
• A neural network ‘learns’ through repeated adjustments of these
weights.
• The neuron uses a transfer or activation function:
𝑋 = ∑𝑛𝑖=1 𝑥𝑖 𝑤𝑖

Step and sign (hard limit functions) Sigmoid function (−∞, ∞) → [0,1] Linear activation function
Decision-making for classification Back-propagation networks Output = neuron weighted input
Pattern recognition tasks Used in linear approximation

• First algorithm → perceptron in 1958 by Frank Rosenblatt


• Perceptron consists of a single neuron with adjustable weights and a hard limiter
• the 𝑛-dimensional space is divided by a hyperplane into two decision regions.
• The hyperplane is defined by the linearly separable function:
𝑛

∑ 𝑥𝑖 𝑤𝑖 − 𝜃 = 0
𝑖=1
• How does the perceptron learn? by making small adjustments in the weights
to reduce the difference between the actual and desired outputs

Page 1 of 13
• The initial weights in the range of [−0.5,0.5]
• The perceptron training algorithm for classification tasks:

• A perceptron is able to represent a function only if there is line that separates all the black
dots from all the white dots (linearly separable). Therefore, a perceptron can learn the
operations AND and OR, but not XOR.
• Single-layer perceptron makes decisions in the same way, regardless of the activation
function used by the perceptron. (regardless of hard or soft limiters).
• To cope with problems which aren’t linearly separable we use multilayer neural
networks. for example, multilayer perceptrons trained by the back-propagation algorithm.
• A multilayer perceptron is a feedforward neural network with one or more hidden layers.
Typically, the network consists of an input layer of source neurons, at least one middle or
hidden layer of computational neurons, and an output layer of computational neurons.
The input signals are propagated in a forward direction on a layer-by-layer basis.
• the input layer rarely includes computing neurons
• With one hidden layer, continuous function of the input signals ca be presented, and with
two hidden layers even discontinuous functions can be represented.
• A hidden layer is called so because it ‘hides’ its desired output. Neurons in the hidden
layer cannot be observed through the input/output behavior of the network.
• Experimental neural networks may have five or even six layers, including three or four
hidden layers, and utilize millions of neurons, but most practical applications use only
three layers, because each additional layer increases the computational burden
exponentially.
• Back-propagation was first proposed in 1969 but was ignored until the 1980s.
• Back-propagation algorithm has two phases. First, a training input pattern is presented
then propagating it from layer to layer until the output pattern is. If this pattern is different
from the desired output, an error is calculated and then propagated backwards through the
network to the input layer. The weights are modified as the error is propagated.

Page 2 of 13
• a back-propagation network is a multilayer network with three or four layers. The layers
are fully connected, that is, every neuron in each layer is connected to every other neuron
in the adjacent forward layer
1 𝑑𝑌 𝑠𝑖𝑔𝑚𝑜𝑖𝑑
• 𝑌 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 = 1+𝑒 −𝑋 → 𝑜𝑢𝑡𝑝𝑢𝑡 ∈ [0,1] → = 𝑌 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 (1 − 𝑌 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 )
𝑑𝑋
• ReLU (Rectified Linear Unit) is preferred due to some of the limitations with sigmoid,
such as the vanishing gradient problem.
• Vanishing Gradient Problem:
o Neural Network Training: Neural networks are trained by adjusting weights. This
is done by backpropagation, which uses gradients from the loss function.
o Gradients indicate the direction and magnitude to adjust the weights for better
accuracy.
o In calculus, the gradient is a vector that points in the direction of the steepest
increase of a function. In the context of neural networks, the gradient tells how
much the loss will change if we change the weights by a small amount.
o In deep networks, especially with certain activation functions, gradients can
become extremely small.
o When gradients are tiny, weight updates are negligible. network learn very slowly
• ReLU is the most commonly used activation function in deep learning models
Returns 0 if it receives any negative input, but returns the value back for positive value x
𝑓(𝑥) = max(0, 𝑥)

• The back-propagation training algorithm

Page 3 of 13
• The sum of the squared errors is a useful indicator of the network’s performance.
• When the sum of squared errors in an entire pass through all training sets, or epoch, is
sufficiently small, a network is considered to have converged.
• The network obtains different weights and threshold values when it starts from different
initial conditions. However, it always converges, with different number of iterations.
• Decision boundaries can’t be drawn with a sigmoid activation function.
However, each neuron in the hidden and output layers can be presented
by a McCulloch and Pitts model, using a sign function.
• Back-propagation learning problem:
a. does not function in the biological world, because biological neurons do not work
backward to adjust the strengths of their interconnections, so it doesn’t emulates
brain-like learning.
b. Calculations are extensive and, as a result, training is slow.
• Accelerated learning when using a hyperbolic tangent instead of sigmoid
2𝑎
𝑌 𝑡𝑎𝑛ℎ = − 𝑎, 𝑎 & 𝑏 𝑎𝑟𝑒 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡𝑠 (1.716, 0.667)
1 + 𝑒 −𝑏𝑥
• Training is accelerated by a momentum term in delta rule (generalized delta rule):
Δ𝑤𝑗𝑘 (𝑝) = 𝛽 × Δ𝑗𝑘 (𝑝 − 1) + 𝛼 × 𝑦𝑖 (𝑝) × 𝛿𝑘 (𝑝)
momentum constant 𝛽 ∈ [0,1) 𝑡𝑦𝑝𝑖𝑐𝑎𝑙𝑙𝑦 0.95
• Adjusting the learning rate parameter during training accelerates the convergence and
leads to the smooth learning curve.
• On the other hand, if the learning rate parameter, 𝛼, is made larger to speed up the
training process, larger changes in the weights may cause instability and oscillations.

Page 4 of 13
Fuzzy Expert Systems
• Fuzzy logic is the theory of fuzzy sets, sets that calibrate vagueness. Fuzzy logic is based
on the idea that all things admit of degrees.
• Boolean or conventional logic uses sharp distinctions.
• Fuzzy logic reflects how people think. It attempts to model common sense and decisions
• Fuzzy, or multi-valued logic was introduced in the 1930s by Jan Lukasiewicz, a Polish
logician and philosopher. In 1965, Lotfi Zadeh rediscovered fuzziness, identified and
explored it.
• Why fuzzy? As Zadeh said, the term is concrete, immediate and descriptive.
• Why logic? (fuzzy logic is just a small part of fuzz set theory). Fuzzy logic is determined
as a set of mathematical principles for knowledge representation based on degrees of
membership rather than on crisp membership of classical binary logic
• Fuzzy logic uses the continuum of logical values between 0 (completely false) and 1
(completely true).
• Crisp set theory is governed by a logic that uses one of only two
values: true or false. This logic cannot represent vague concepts
• In Fuzzy logic, a proposition is not either true or false, but may be
partly true (or partly false) to any degree. This degree is usually taken
as a real number in the interval [0,1]
• the horizontal axis represents the universe of discourse – the range of
all possible values applicable to a chosen variable
• The vertical axis represents the membership value of the fuzzy set
• What is a fuzzy set? a set with fuzzy boundaries.
• Let 𝑋 be the universe of discourse and its elements be denoted as 𝑥. In
classical set theory, crisp set 𝐴 of 𝑋 is defined as function 𝑓𝐴 (𝑥) called the characteristic
function of 𝐴
1, 𝑖𝑓 𝑥 ∈ 𝑋
𝑓𝐴 (𝑥): 𝑋 → 0,1 𝑓𝐴 (𝑥) = {
0, 𝑖𝑓 𝑥 ∉ 𝑋
• In the fuzzy theory, fuzzy set 𝐴 of universe 𝑋 is defined by function 𝜇𝐴(𝑥) called the
membership function of set 𝐴.
𝜇𝐴 (𝑥) = 1 𝑖𝑓 𝑥 𝑖𝑠 𝑡𝑜𝑡𝑎𝑙𝑙𝑦 𝑖𝑛 𝐴
𝜇𝐴 (𝑥): 𝑋 → [0,1] 𝜇𝐴 (𝑥) = 0 𝑖𝑓 𝑥 𝑖𝑠 𝑛𝑜𝑡 𝑖𝑛 𝐴
0 < 𝜇𝐴 (𝑥) < 1 𝑖𝑓 𝑥 𝑖𝑠 𝑝𝑎𝑟𝑡𝑖𝑎𝑙𝑙𝑦 𝑖𝑛 𝐴

• For any element 𝑥 of universe 𝑋, membership function 𝜇𝐴 𝑥 equals the degree to which 𝑥
is an element of set 𝐴.
This degree, a value between 0 and 1, represents the degree of membership, also called
membership value, of element 𝑥 in set 𝐴.

Page 5 of 13
• How to represent a fuzzy set in a computer? The membership function is determined by
experts.
• A new technique to form fuzzy sets based on artificial
neural networks, which learn available system operation
data and then derive the fuzzy sets automatically.
• Fuzzy subset 𝐴 can ve expressed as:
𝐴 = {𝑥1 , 𝜇𝐴 (𝑥1 )}, { 𝑥2 , 𝜇𝐴 (𝑥2 ) , … , {𝑥𝑛 , 𝜇𝐴 (𝑥𝑛 )}
𝑨 = {𝝁𝑨 (𝒙𝟏 )/ 𝒙𝟏 } , {𝝁𝑨 (𝒙𝟐 )/𝒙𝟐 } , … , {𝝁𝑨 (𝒙𝒏 )/𝒙𝒏 }
• Typical functions that can be used are sigmoid, gaussian
and pi. They can represent the real data in fuzzy sets,
but they also increase the time of computation.
in practice, most applications use linear fit functions.

IBM – Python for Data Science


➢ Module 1 - Python Basics
• Expressions describe a type of operation that computers perform.
• In a mathematical operation, the numbers are called operands, and the math symbols
are called operators
• 𝐴//𝐵 → integer division
• Variables are used to store values.
• B=A.replace(‘old’, ‘new’)
• A.find(‘sequence’)→ first index of the sequence.
• # r will tell python that string will be display as raw string
• re.search(pattern, s1)→ to search for a pattern in
the string (returns Boolean)
• The findall() function finds all occurrences of a specified pattern within a string.

➢ Module 2 - Python Data Structures


• A list is a sequenced collection of different objects such as integers, strings, and even
other lists as well. The address of each element within a list is called an index. An
index is used to access and refer to items within a list.
• 𝐿[−𝑖] = 𝐿[𝐿. 𝑙𝑒𝑛𝑔𝑡ℎ − 𝑖], 𝑖 > 0
• L = [ "a", 1]

Page 6 of 13
L.extend(['b', 2])→L=[ "a", 1,'b', 2]
L.append(['b', 2])→L=[ "a", 1,['b', 2]]
• split→ splits the string into a list
• B = A[:]→ to clone lists
• We add elements to the set using S.add('D')
• "element" in A # Verify if the element is in the set
• set2.difference(set1)→ The elements in set2 but not in set1

➢ Module 3 - Python Programming Fundamentals

➢ Module 4 - Working with Data in Python

Page 7 of 13
➢ Module 5 - Working with Numpy Arrays & Simple APIs

Page 8 of 13
Google - Introduction to Machine Learning
• ML is the process of training a piece of software, called a model, to make useful
predictions or generate content from data
• A model is a mathematical relationship derived from data that an ML system uses to make
predictions
• Types of ML Systems, based on how they learn to make predictions:
1. Supervised learning:
Models can make predictions after seeing lots of data with the correct answers and
then discovering the connections between the elements in the data that produce the
correct answers.
o Regression: predicts a numeric value
o Classification: predict the likelihood that something belongs to a category.
▪ Binary classification: output from a class with only two values
▪ Multiclass classification: output from a class of more than 2 values
2. Unsupervised learning:
Makes predictions by being given data that does not contain any correct answers.
And its goal is to identify meaningful patterns among the data.
Uses a model which employs clustering technique which finds data points that
demarcate natural groupings.
3. Reinforcement learning
o Makes predictions by getting rewards or penalties based on actions
performed within an environment.
o It generates a policy that defines the best strategy for getting the most
rewards.
o It is used to train robots to perform tasks, like walking and playing games.
4. Generative AI
o Creates content from user input, for example, creating novels, images,
music…etc.
o Can take a variety of inputs and create a variety of outputs, like text,
images, audio, and video or combinations of them.
o Learns patterns in data with the goal to produce new but similar data.
o To produce unique and creative outputs, generative models are initially
trained using an unsupervised approach, where the model learns to mimic
the data it's trained on. The model is sometimes trained further using
supervised or reinforcement learning on specific data.
• Data is the driving force of ML. Data comes in the form of words and numbers stored in
tables, or as the values of pixels and waveforms captured in images and audio files. We
store related data in datasets.
• Datasets are made up of individual examples that contain features and a label. You could
think of an example as analogous to a single row in a spreadsheet. Features are the
values that a supervised model uses to predict the label. The label is the "answer," or the
value we want the model to predict.
• Labeled examples that contain both features and a label. In contrast, unlabeled
examples contain features, but no label. After creating a model, the model predicts the
label from the features.

Page 9 of 13
• A dataset is characterized by its size and diversity. Size indicates the number of examples.
Diversity indicates the range those examples cover. Good datasets are both large and
highly diverse.
• A large dataset doesn’t guarantee sufficient diversity, and a dataset that is highly diverse
doesn't guarantee sufficient examples.
• Datasets with more features can help a model discover additional patterns and make
better predictions.
• A model is the complex collection of numbers that define the mathematical relationship
from specific input feature patterns to specific output label values. The model discovers
these patterns through training.
• To train a model, we give the model a dataset with labeled examples. The model learns
the mathematical relationship between the features and the label so that it can make the
best predictions on unseen data. Based on the difference between the predicted and actual
values (loss)
• When we evaluate a model, we use a labeled dataset, but we only give the model the
dataset's features. We then compare the model's predictions to the label's true values.
Depending on the model's predictions, we might do more training and evaluating before
deploying the model in a real-world application.
• Once we're satisfied with the results from evaluating the model, we can use the model to
make predictions, called inferences, on unlabeled examples.

Google - Machine Learning Crash Course


➢ Framing:
• A label is the thing we're predicting - the 𝑦 variable in simple linear regression
• A feature is an input variable - the 𝑥 variable in simple linear regression.
• An example is a particular instance of data, 𝒙 ⃑ . (𝑥 is a vector), it has two types:
o Labeled example which includes both feature(s) and the label, used to train the
model
o Unlabeled example contains features but not the label.
• After training the model with labeled examples, the model is used to predict the label
on unlabeled examples.
• A model defines the relationship between features and label. It has two phases:
o Training means creating or learning the model. That is, you show the model
labeled examples and enable the model to gradually learn the relationships
between features and label.
o Inference means applying the trained model to unlabeled examples. That is,
you use the trained model to make useful predictions (𝑦′)
• A regression model predicts continuous values.
• A classification model predicts discrete values.

➢ Descending into ML:


• Linear regression:
𝑛
𝑦 ′ = 𝑏 + Σ𝑖=1 𝑤1 𝑥1′

Page 10 of 13
𝑦 ′ is the predicted label
𝑏 is the bias (the y-intercept), sometimes referred to as 𝑤0
𝑤𝑖 is the weight of feature I, which is the same concept as the "slope" m.
𝑥𝑖 is a feature.
• Training a model simply means learning (determining) good values for all the weights
and the bias from labeled examples.
• Empirical risk minimization: in supervised learning, the process of a machine
learning algorithm building a model by examining many examples and attempting to
find a model that minimizes loss.
• Loss is the penalty for a bad prediction. That is, loss is a number indicating how bad
the model's prediction was on a single example. If the model's prediction is perfect,
the loss is zero; otherwise, the loss is greater. The goal of training a model is to find a
set of weights and biases that have low loss, on average, across all examples.
• squared loss function 𝐿2 = (𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛(𝑥))2 = (𝑦 − 𝑦′)2
• Mean square error (MSE):
1 2
𝑀𝑆𝐸 = ∑ (𝑦 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛(𝑥))
𝑁
(𝑥,𝑦)ϵD

(𝑥, 𝑦) is an example in which


𝑥 is the set of features that the model uses to make predictions.
𝑦 is the example's label (for example, temperature).
𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 (𝑥) is a function of the weights and bias combined with set of features.
𝐷 is a data set containing many labeled examples, which are pairs.
𝑁 is the number of examples in 𝐷.
➢ Reducing Loss:

• A mechanism to find the convergence point is gradient descent algorithm:


1. The first stage in gradient descent is to pick a starting value (a starting point) for
𝑤1 which does not matter much (𝑤1 = 0)
2. The gradient of the loss is equal to the derivative (slope) of the curve.
When there are multiple weights, the gradient is a vector of partial derivatives
with respect to the weights. ∇𝑓(𝑥1 , … 𝑥𝑛 )

Page 11 of 13
The gradient always points in the direction of steepest increase in the loss
function. The gradient descent algorithm takes a step in the direction of the
negative gradient in order to reduce loss as quickly as possible.
3. To determine the next point along the loss function curve, the gradient descent
algorithm adds some fraction of the gradient's magnitude to the starting point
4. The gradient descent then repeats this process, edging ever closer to the minimum.
• The gradient vector has both a direction and a magnitude. Gradient descent algorithms
multiply the gradient by the learning rate (step size) to determine the next point.
• Hyperparameters are the knobs that programmers tweak in machine learning
algorithms. (tuning the learning rate)
• too small learning rate, learning will take too long. Conversely, too large learning rate
will perpetually bounce the next point haphazardly across the bottom of the well.
• There is a Goldilocks learning rate for every regression problem. The Goldilocks
value is related to how flat the loss function is. If you know the gradient of the loss
function is small then you can safely try a larger learning rate, which compensates for
the small gradient and results in a larger step size.
1
• The ideal learning rate in one-dimension is 𝑓′′ (𝑥).
• In practice, finding a "perfect" (or near-perfect) learning rate is not essential for
successful model training. The goal is to find a learning rate large enough that
gradient descent converges efficiently, but not so large that it never converges.
• In gradient descent, a batch is the set of examples you use to calculate the gradient in
a single training iteration.
• redundancy becomes more likely as the batch size grows. Some redundancy can be
useful to smooth out noisy gradients
• Stochastic gradient descent (SGD) it uses only a single example (a batch size of 1)
per iteration. Given enough iterations, SGD works but is very noisy. (randomly)
• Mini-batch stochastic gradient descent (mini-batch SGD) is a compromise
between full-batch iteration and SGD. between 10 and 1,000 random examples.
➢ TensorFlow:
➢ Generalization
• Generalization refers to model's ability to adapt properly to new, previously unseen
data, drawn from the same distribution as the one used to create the model.
• For The ML Fine Print three basic assumptions:
1. Examples are drawn independently and identically (i.i.d.) at random from the
distribution
2. The distribution is stationary: It doesn't change over time
3. Always pulled from the same distribution: Including training, validation, and test
sets
• An overfit model gets a low loss during training but does a poor job predicting new
data.
= Overfitting is an undesirable machine learning behavior that occurs when the
machine learning model gives accurate predictions for training data but not for new
data.
• Overfitting is caused by complexing the model more than necessary. The fundamental
tension of ML is between fitting data well and fitting the data as simply as possible.

Page 12 of 13
• Ockham's philosophy caused developing generalization bounds: a statistical
description of a model's ability to generalize new data based on the complexity of the
model and the model's performance on training data.
• For the model to make good predictions on new, previously unseen data. The data set
is divided into two subsets
training set - a subset to train a model.
test set - a subset to test the trained model.
Assuming that the test set is large, and the same test set is not used over and over.

➢ Training, Test and validation Sets


• Make sure that your test set meets the following two conditions:
Is large enough to yield statistically meaningful results.
Is representative of the data set as a whole. In other words, do not pick a test set with
different characteristics than the training set.
• Never train on test data. If you are seeing surprisingly good results on your
evaluation metrics, it might be a sign that you are accidentally training on the test set.
• Increasing learning rate set will raise Test loss significantly higher than Training loss.
• Use the validation set to evaluate results from the training set. Then, use the test set
to double-check your evaluation after the model has "passed" the validation set.
• In this improved workflow:
1. Pick the model that does best on the validation set.
2. Double-check that model against the test set.
This is a better workflow because it creates fewer exposures to the test set.

Page 13 of 13

You might also like