the slides are aimed to give a brief introductory base to Neural Networks and its architectures. it covers logistic regression, shallow neural networks and deep neural networks. the slides were presented in Deep Learning IndabaX Sudan.
An entry level sharing for trend micro scan engine team member. The sharing would take a binary classification on script type classification task as demonstrate. Topics would start from machine learning problem definition and covered computational graph for deep neural network (DNN), recurrent neural network, LSTM, GRU and some RNN fine tune tricks.
This document contains lecture notes on sparse autoencoders. It begins with an introduction describing the limitations of supervised learning and the need for algorithms that can automatically learn feature representations from unlabeled data. The notes then state that sparse autoencoders are one approach to learn features from unlabeled data, and describe the organization of the rest of the notes. The notes will cover feedforward neural networks, backpropagation for supervised learning, autoencoders for unsupervised learning, and how sparse autoencoders are derived from these concepts.
The document discusses various math and string classes in Java. It covers:
- Constructing objects using the new operator and passing parameters.
- Using the Random class to generate random numbers.
- Declaring constants using final and static final.
- Basic arithmetic, increment/decrement, and math methods.
- Creating and manipulating strings using methods like length(), substring(), and concatenation.
- Drawing shapes on a frame using Graphics2D methods in a JComponent's paintComponent method.
This document provides an introduction to blind source separation and non-negative matrix factorization. It describes blind source separation as a method to estimate original signals from observed mixed signals. Non-negative matrix factorization is introduced as a constraint-based approach to solving blind source separation using non-negativity. The alternating least squares algorithm is described for solving the non-negative matrix factorization problem. Experiments applying these methods to artificial and real image data are presented and discussed.
The document discusses various machine learning algorithms including polynomial regression, quadratic regression, radial basis functions, and robust regression. It provides mathematical formulas and visual examples to explain how each algorithm works. The key ideas are that polynomial regression fits nonlinear functions of inputs, quadratic regression extends linear regression by including quadratic terms, radial basis functions use kernel functions centered at data points to perform nonlinear regression, and robust regression aims to fit data robustly by down-weighting outliers.
Variational Autoencoders For Image GenerationJason Anderson
Meetup: https://www.meetup.com/Cognitive-Computing-Enthusiasts/events/260580395/
Video: https://www.youtube.com/watch?v=fnULFOyNZn8
Blog: http://www.compthree.com/blog/autoencoder/
Code: https://github.com/compthree/variational-autoencoder
An autoencoder is a machine learning algorithm that represents unlabeled high-dimensional data as points in a low-dimensional space. A variational autoencoder (VAE) is an autoencoder that represents unlabeled high-dimensional data as low-dimensional probability distributions. In addition to data compression, the randomness of the VAE algorithm gives it a second powerful feature: the ability to generate new data similar to its training data. For example, a VAE trained on images of faces can generate a compelling image of a new "fake" face. It can also map new features onto input data, such as glasses or a mustache onto the image of a face that initially lacks these features. In this talk, we will survey VAE model designs that use deep learning, and we will implement a basic VAE in TensorFlow. We will also demonstrate the encoding and generative capabilities of VAEs and discuss their industry applications.
This document provides an overview of fuzzy rule-based networks. It discusses:
1) The types of fuzzy systems including Mamdani, Sugeno, and Tsukamoto systems as well as single and multiple rule bases.
2) Formal models for representing fuzzy networks including rule-based models, integer tables, boolean matrices, and topological expressions.
3) Basic operations for constructing and manipulating fuzzy networks including horizontal and vertical merging and splitting of nodes.
This document provides an overview of dimensionality reduction techniques including PCA and manifold learning. It discusses the objectives of dimensionality reduction such as eliminating noise and unnecessary features to enhance learning. PCA and manifold learning are described as the two main approaches, with PCA using projections to maximize variance and manifold learning assuming data lies on a lower dimensional manifold. Specific techniques covered include LLE, Isomap, MDS, and implementations in scikit-learn.
1. Backpropagation is an algorithm for training multilayer perceptrons by calculating the gradient of the loss function with respect to the network parameters in a layer-by-layer manner, from the final layer to the first layer.
2. The gradient is calculated using the chain rule of differentiation, with the gradient of each layer depending on the error from the next layer and the outputs from the previous layer.
3. Issues that can arise in backpropagation include vanishing gradients if the activation functions have near-zero derivatives, and proper initialization of weights is required to break symmetry and allow gradients to flow effectively through the network during training.
The document describes the structure and functioning of a feedforward neural network. It notes that the network contains an input layer with n-dimensional vectors, L-1 hidden layers with n neurons each, and an output layer with k neurons. Each neuron has a pre-activation and activation value. The pre-activation at layer i is the weighted sum of outputs from layer i-1 plus a bias. The activation is this pre-activation passed through an activation function. Backpropagation is used to minimize a loss function through gradient descent to learn the network's weights and biases parameters.
Linked CP Tensor Decomposition (presented by ICONIP2012)Tatsuya Yokota
This document proposes a new method called Linked Tensor Decomposition (LTD) to analyze common and individual factors from a group of tensor data. LTD combines the advantages of Individual Tensor Decomposition (ITD), which analyzes individual characteristics, and Simultaneous Tensor Decomposition (STD), which analyzes common factors in a group. LTD represents each tensor as the sum of a common factor and individual factors. An algorithm using Hierarchical Alternating Least Squares is developed to solve the LTD model. Experiments on toy problems and face reconstruction demonstrate LTD can extract both common and individual factors more effectively than ITD or STD alone. Future work will explore Tucker-based LTD and statistical independence in the LTD model
This document discusses arithmetic coding, an entropy encoding technique. It begins with an introduction comparing arithmetic coding to Huffman coding. The document then provides pseudocode for the basic encoding and decoding algorithms. It describes how scaling techniques like E1 and E2 scaling allow for incremental encoding and decoding as well as achieving infinite precision with finite-precision integers. The document outlines applications of arithmetic coding in areas like JBIG, H.264, and JPEG 2000.
Tensor representations in signal processing and machine learning (tutorial ta...Tatsuya Yokota
Tutorial talk in APSIPA-ASC 2020.
Title: Tensor representations in signal processing and machine learning.
Introduction to tensor decomposition (テンソル分解入門)
Basics of tensor decomposition (テンソル分解の基礎)
Arithmetic coding is an entropy encoding technique that maps a sequence of symbols to a numeric interval between 0 and 1. Each symbol maps to a sub-interval of the current interval based on the symbol probabilities. As symbols are processed, the interval boundaries are updated according to the cumulative distribution function of the symbol probabilities. Arithmetic coding achieves better compression than Huffman coding by allowing coding of variable-length blocks without pre-specifying code lengths. It also handles conditional probability models more efficiently by updating interval boundaries based on context without needing pre-specified codebooks for all contexts.
This document provides an overview of neural networks and related topics. It begins with an introduction to neural networks and discusses natural neural networks, early artificial neural networks, modeling neurons, and network design. It then covers multi-layer neural networks, perceptron networks, training, and advantages of neural networks. Additional topics include fuzzy logic, genetic algorithms, clustering, and adaptive neuro-fuzzy inference systems (ANFIS).
MIT OpenCourseWare provides course materials for the free online course 6.094 Introduction to MATLAB taught in January 2009. The course covers topics like linear algebra, polynomials, optimization, differentiation and integration, and solving differential equations using MATLAB. Lecture 3 focuses on solving systems of linear equations, matrix operations, polynomial fitting to data, nonlinear root finding, function minimization, and numerical methods for differentiation, integration, and solving ordinary differential equations.
This paper proposes modeling and identification of dynamical systems in delta
domain using neural network. The properties of delta operator are used such as greater
numerical robustness in computation and superior coefficients representation in finite word
length in implementation and well ensured numerical conditioning at high sampling
frequency. To formulate the identification scheme delta operator model is recasted into a
realizable neural network structure using the properties of inverse delta operator.
The document summarizes deep learning concepts including deep neural network (DNN) structure, gradient descent, and backpropagation. It explains that DNNs use multiple hidden layers to construct mathematical models that transform input values to output values. Gradient descent is used to minimize error by adjusting weights, and backpropagation efficiently calculates gradients by propagating error backwards from the output layer.
The document summarizes key concepts from a deep learning training, including gradient descent problems and solutions, optimization algorithms like momentum and Adam, overfitting and regularization techniques, and convolutional neural networks (CNNs). Specifically, it discusses gradient vanishing and exploitation issues, activation function and weight initialization improvements, batch normalization, optimization methods, overfitting causes and regularization countermeasures like dropout, and a basic CNN architecture overview including convolution, pooling and fully connected layers.
This document contains 80 questions related to digital signal and image processing. The questions cover topics such as image transforms, filters, noise, compression, segmentation, and more. Justification is required for some questions, while others involve calculations, derivations or explanations of key concepts. The questions vary in difficulty and mark allocation from 5 to 10 marks. They also specify the exam or year in which the question appeared previously.
발표자: 이활석(NAVER)
발표일: 2017.11.
최근 딥러닝 연구는 지도학습에서 비지도학습으로 급격히 무게 중심이 옮겨 지고 있습니다. 본 과정에서는 비지도학습의 가장 대표적인 방법인 오토인코더의 모든 것에 대해서 살펴보고자 합니다. 차원 축소관점에서 가장 많이 사용되는Autoencoder와 (AE) 그 변형 들인 Denoising AE, Contractive AE에 대해서 공부할 것이며, 데이터 생성 관점에서 최근 각광 받는 Variational AE와 (VAE) 그 변형 들인 Conditional VAE, Adversarial AE에 대해서 공부할 것입니다. 또한, 오토인코더의 다양한 활용 예시를 살펴봄으로써 현업과의 접점을 찾아보도록 노력할 것입니다.
1. Revisit Deep Neural Networks
2. Manifold Learning
3. Autoencoders
4. Variational Autoencoders
5. Applications
This document discusses dynamic programming and algorithms for solving all-pair shortest path problems. It begins by explaining dynamic programming as an optimization technique that works bottom-up by solving subproblems once and storing their solutions, rather than recomputing them. It then presents Floyd's algorithm for finding shortest paths between all pairs of nodes in a graph. The algorithm iterates through nodes, updating the shortest path lengths between all pairs that include that node by exploring paths through it. Finally, it discusses solving multistage graph problems using forward and backward methods that work through the graph stages in different orders.
Support Vector Machine (SVM) is a supervised learning model used for classification and regression analysis. It finds the optimal separating hyperplane between classes that maximizes the margin between them. Kernel functions like polynomial, RBF, and sigmoid kernels allow SVMs to perform nonlinear classification by mapping inputs into high-dimensional feature spaces. The optimization problem of finding the hyperplane is solved using techniques like Lagrange multipliers and the Sequential Minimal Optimization (SMO) algorithm, which breaks the large QP problem into smaller subproblems solved analytically. SMO selects pairs of examples to update their Lagrange multipliers until convergence.
This document provides an outline and introduction to deep generative models. It discusses what generative models are, their applications like image and speech generation/enhancement, and different types of generative models including PixelRNN/CNN, variational autoencoders, and generative adversarial networks. Variational autoencoders are explained in detail, covering how they introduce a restriction in the latent space z to generate new data points by sampling from the latent prior distribution.
This document discusses fast algorithms for computing the discrete cosine transform (DCT) and inverse discrete cosine transform (IDCT) using Winograd's method.
The conventional DCT and IDCT algorithms have high computational complexity due to cosine functions. Winograd's algorithm reduces the number of multiplications required for matrix multiplication by rearranging terms.
The document proposes applying Winograd's algorithm to DCT and IDCT computation by representing the transforms as matrix multiplications. This approach reduces the number of multiplications required for an 8x8 block from over 16,000 to just 736 multiplications, with fewer additions and subtractions as well. This leads to faster DCT and IDCT computation compared
1. Backpropagation is an algorithm for training multilayer perceptrons by calculating the gradient of the loss function with respect to the network parameters in a layer-by-layer manner, from the final layer to the first layer.
2. The gradient is calculated using the chain rule of differentiation, with the gradient of each layer depending on the error from the next layer and the outputs from the previous layer.
3. Issues that can arise in backpropagation include vanishing gradients if the activation functions have near-zero derivatives, and proper initialization of weights is required to break symmetry and allow gradients to flow effectively through the network during training.
The document describes the structure and functioning of a feedforward neural network. It notes that the network contains an input layer with n-dimensional vectors, L-1 hidden layers with n neurons each, and an output layer with k neurons. Each neuron has a pre-activation and activation value. The pre-activation at layer i is the weighted sum of outputs from layer i-1 plus a bias. The activation is this pre-activation passed through an activation function. Backpropagation is used to minimize a loss function through gradient descent to learn the network's weights and biases parameters.
Linked CP Tensor Decomposition (presented by ICONIP2012)Tatsuya Yokota
This document proposes a new method called Linked Tensor Decomposition (LTD) to analyze common and individual factors from a group of tensor data. LTD combines the advantages of Individual Tensor Decomposition (ITD), which analyzes individual characteristics, and Simultaneous Tensor Decomposition (STD), which analyzes common factors in a group. LTD represents each tensor as the sum of a common factor and individual factors. An algorithm using Hierarchical Alternating Least Squares is developed to solve the LTD model. Experiments on toy problems and face reconstruction demonstrate LTD can extract both common and individual factors more effectively than ITD or STD alone. Future work will explore Tucker-based LTD and statistical independence in the LTD model
This document discusses arithmetic coding, an entropy encoding technique. It begins with an introduction comparing arithmetic coding to Huffman coding. The document then provides pseudocode for the basic encoding and decoding algorithms. It describes how scaling techniques like E1 and E2 scaling allow for incremental encoding and decoding as well as achieving infinite precision with finite-precision integers. The document outlines applications of arithmetic coding in areas like JBIG, H.264, and JPEG 2000.
Tensor representations in signal processing and machine learning (tutorial ta...Tatsuya Yokota
Tutorial talk in APSIPA-ASC 2020.
Title: Tensor representations in signal processing and machine learning.
Introduction to tensor decomposition (テンソル分解入門)
Basics of tensor decomposition (テンソル分解の基礎)
Arithmetic coding is an entropy encoding technique that maps a sequence of symbols to a numeric interval between 0 and 1. Each symbol maps to a sub-interval of the current interval based on the symbol probabilities. As symbols are processed, the interval boundaries are updated according to the cumulative distribution function of the symbol probabilities. Arithmetic coding achieves better compression than Huffman coding by allowing coding of variable-length blocks without pre-specifying code lengths. It also handles conditional probability models more efficiently by updating interval boundaries based on context without needing pre-specified codebooks for all contexts.
This document provides an overview of neural networks and related topics. It begins with an introduction to neural networks and discusses natural neural networks, early artificial neural networks, modeling neurons, and network design. It then covers multi-layer neural networks, perceptron networks, training, and advantages of neural networks. Additional topics include fuzzy logic, genetic algorithms, clustering, and adaptive neuro-fuzzy inference systems (ANFIS).
MIT OpenCourseWare provides course materials for the free online course 6.094 Introduction to MATLAB taught in January 2009. The course covers topics like linear algebra, polynomials, optimization, differentiation and integration, and solving differential equations using MATLAB. Lecture 3 focuses on solving systems of linear equations, matrix operations, polynomial fitting to data, nonlinear root finding, function minimization, and numerical methods for differentiation, integration, and solving ordinary differential equations.
This paper proposes modeling and identification of dynamical systems in delta
domain using neural network. The properties of delta operator are used such as greater
numerical robustness in computation and superior coefficients representation in finite word
length in implementation and well ensured numerical conditioning at high sampling
frequency. To formulate the identification scheme delta operator model is recasted into a
realizable neural network structure using the properties of inverse delta operator.
The document summarizes deep learning concepts including deep neural network (DNN) structure, gradient descent, and backpropagation. It explains that DNNs use multiple hidden layers to construct mathematical models that transform input values to output values. Gradient descent is used to minimize error by adjusting weights, and backpropagation efficiently calculates gradients by propagating error backwards from the output layer.
The document summarizes key concepts from a deep learning training, including gradient descent problems and solutions, optimization algorithms like momentum and Adam, overfitting and regularization techniques, and convolutional neural networks (CNNs). Specifically, it discusses gradient vanishing and exploitation issues, activation function and weight initialization improvements, batch normalization, optimization methods, overfitting causes and regularization countermeasures like dropout, and a basic CNN architecture overview including convolution, pooling and fully connected layers.
This document contains 80 questions related to digital signal and image processing. The questions cover topics such as image transforms, filters, noise, compression, segmentation, and more. Justification is required for some questions, while others involve calculations, derivations or explanations of key concepts. The questions vary in difficulty and mark allocation from 5 to 10 marks. They also specify the exam or year in which the question appeared previously.
발표자: 이활석(NAVER)
발표일: 2017.11.
최근 딥러닝 연구는 지도학습에서 비지도학습으로 급격히 무게 중심이 옮겨 지고 있습니다. 본 과정에서는 비지도학습의 가장 대표적인 방법인 오토인코더의 모든 것에 대해서 살펴보고자 합니다. 차원 축소관점에서 가장 많이 사용되는Autoencoder와 (AE) 그 변형 들인 Denoising AE, Contractive AE에 대해서 공부할 것이며, 데이터 생성 관점에서 최근 각광 받는 Variational AE와 (VAE) 그 변형 들인 Conditional VAE, Adversarial AE에 대해서 공부할 것입니다. 또한, 오토인코더의 다양한 활용 예시를 살펴봄으로써 현업과의 접점을 찾아보도록 노력할 것입니다.
1. Revisit Deep Neural Networks
2. Manifold Learning
3. Autoencoders
4. Variational Autoencoders
5. Applications
This document discusses dynamic programming and algorithms for solving all-pair shortest path problems. It begins by explaining dynamic programming as an optimization technique that works bottom-up by solving subproblems once and storing their solutions, rather than recomputing them. It then presents Floyd's algorithm for finding shortest paths between all pairs of nodes in a graph. The algorithm iterates through nodes, updating the shortest path lengths between all pairs that include that node by exploring paths through it. Finally, it discusses solving multistage graph problems using forward and backward methods that work through the graph stages in different orders.
Support Vector Machine (SVM) is a supervised learning model used for classification and regression analysis. It finds the optimal separating hyperplane between classes that maximizes the margin between them. Kernel functions like polynomial, RBF, and sigmoid kernels allow SVMs to perform nonlinear classification by mapping inputs into high-dimensional feature spaces. The optimization problem of finding the hyperplane is solved using techniques like Lagrange multipliers and the Sequential Minimal Optimization (SMO) algorithm, which breaks the large QP problem into smaller subproblems solved analytically. SMO selects pairs of examples to update their Lagrange multipliers until convergence.
This document provides an outline and introduction to deep generative models. It discusses what generative models are, their applications like image and speech generation/enhancement, and different types of generative models including PixelRNN/CNN, variational autoencoders, and generative adversarial networks. Variational autoencoders are explained in detail, covering how they introduce a restriction in the latent space z to generate new data points by sampling from the latent prior distribution.
This document discusses fast algorithms for computing the discrete cosine transform (DCT) and inverse discrete cosine transform (IDCT) using Winograd's method.
The conventional DCT and IDCT algorithms have high computational complexity due to cosine functions. Winograd's algorithm reduces the number of multiplications required for matrix multiplication by rearranging terms.
The document proposes applying Winograd's algorithm to DCT and IDCT computation by representing the transforms as matrix multiplications. This approach reduces the number of multiplications required for an 8x8 block from over 16,000 to just 736 multiplications, with fewer additions and subtractions as well. This leads to faster DCT and IDCT computation compared
This document provides notes on machine learning concepts including neural networks. It discusses the basic building blocks of neural networks including input layers, hidden layers, and output layers. Different types of neural network architectures are presented, including minimal networks with just input and output layers, and deeper networks with multiple hidden layers. Activation functions are described as important non-linear transformations that allow neural networks to model complex relationships. Common activation functions like sigmoid, TanH, ReLU, and softmax are defined along with their formulas and properties. The process of forward propagation to obtain network outputs and backpropagation of errors to update weights is also summarized.
The document discusses building and training artificial neural networks from scratch. It describes multi-level feedforward neural networks with an input layer, hidden layers, and an output layer. Nodes between layers are fully connected. Training involves calculating gradients using the chain rule and updating weights proportionally via methods like stochastic gradient descent to minimize prediction error on the training data. Programming assignments will use neural networks to solve problems in parallel and distributed systems.
This summary provides an overview of the key points from the CS229 lecture notes document:
1. The document introduces neural networks and discusses representing simple neural networks as "stacks" of individual neuron units. It uses a housing price prediction example to illustrate this concept.
2. More complex neural networks can have multiple input features that are connected to hidden units, which may learn intermediate representations to predict the output.
3. Vectorization techniques are discussed to efficiently compute the outputs of all neurons in a layer simultaneously, without using slow for loops. Matrix operations allow representing the computations in a way that can leverage optimized linear algebra software.
Deep Learning Module 2A Training MLP.pptxvipul6601
This document provides an overview of deep learning concepts including linear regression, neural networks, and training multilayer perceptrons. It discusses:
1) How linear regression can be used for prediction tasks by learning weights to relate features to targets.
2) How neural networks extend this by using multiple layers of neurons and nonlinear activation functions to learn complex patterns in data.
3) The process of training neural networks, including forward propagation to make predictions, backpropagation to calculate gradients, and updating weights to reduce loss.
4) Key aspects of multilayer perceptrons like their architecture with multiple fully-connected layers, use of activation functions, and training algorithm involving forward/backward passes and parameter updates.
This document contains a presentation on neural network techniques for a data mining course. It includes:
- An overview of the basics of neural networks, including the structure of neurons, single and multi-layer feedforward networks, and backpropagation.
- Sections on the basics of neural networks, advanced features, applications, and a summary.
- References used in creating the presentation on neural network introductions, evolving artificial neural networks, and lecture materials.
This document contains a presentation on neural network techniques for a data mining course. It includes:
- An overview of the basics of neural networks, including the structure of neurons, single and multi-layer feedforward networks, and backpropagation.
- Sections on the basics of neural networks, advanced features, applications, and a summary.
- References used in creating the presentation on neural network introductions, evolving artificial neural networks, and lecture materials.
The document provides an overview of neural networks and the backpropagation algorithm. It defines the basics of neural networks including neurons, layers, weights, and biases. It explains that multi-layer networks are needed when problems are not linearly separable. The backpropagation algorithm is described as adjusting weights to minimize error between the network's classification and actual classifications for each training sample in an iterative process. Weights are updated based on calculating error signals that propagate backwards through the network.
The document provides an overview of neural networks and the backpropagation algorithm for training neural networks. It defines the basic components of a neural network including neurons, layers, weights, and biases. It then explains how a multilayer feedforward network is structured and how backpropagation works by propagating errors backward from the output to earlier layers to update weights and biases to minimize classification errors on training data. The process involves feeding inputs forward, calculating outputs at each layer, computing errors at the output layer, and propagating errors back to update the weights.
This document contains a presentation on neural network techniques for a data mining course. It includes:
- An overview of the basics of neural networks, including the structure of neurons, single and multi-layer feedforward networks, and backpropagation.
- Sections on the basics of neural networks, advanced features, applications, and a summary.
- References used in creating the presentation on neural network introductions, evolving artificial neural networks, and lecture materials.
The document provides an overview of neural networks and the backpropagation algorithm. It defines the basics of neural networks including neurons, layers, weights, and biases. It explains that multilayer feedforward networks are needed to handle non-linearly separable data. The backpropagation algorithm is described as iteratively processing training data to minimize error by adjusting weights to correctly classify samples, propagating error backwards to update weights and biases. The overview concludes with examples of the calculations involved in forward propagation and backpropagation.
Opening of our Deep Learning Lunch & Learn series. First session: introduction to Neural Networks, Gradient descent and backpropagation, by Pablo J. Villacorta, with a prologue by Fernando Velasco
Deep neural networks & computational graphsRevanth Kumar
This document summarizes a presentation on deep neural networks and computational graphs. It discusses how neural networks work using an example of a network with inputs, hidden layers, and an output. It also explains key concepts like activation functions, backpropagation for updating weights, and how the chain rule is applied in backpropagation. Computational graphs are introduced as a way to represent mathematical expressions and evaluate gradients to train neural networks.
Mieke Jans is a Manager at Deloitte Analytics Belgium. She learned about process mining from her PhD supervisor while she was collaborating with a large SAP-using company for her dissertation.
Mieke extended her research topic to investigate the data availability of process mining data in SAP and the new analysis possibilities that emerge from it. It took her 8-9 months to find the right data and prepare it for her process mining analysis. She needed insights from both process owners and IT experts. For example, one person knew exactly how the procurement process took place at the front end of SAP, and another person helped her with the structure of the SAP-tables. She then combined the knowledge of these different persons.
Defense Against LLM Scheming 2025_04_28.pptxGreg Makowski
https://www.meetup.com/sf-bay-acm/events/306888467/
A January 2025 paper called “Frontier Models are Capable of In-Context Scheming”, https://arxiv.org/pdf/2412.04984, demonstrated how a wide variety of current frontier LLM models (i.e. ChatGPT, Claude, Gemini and Llama) can, under specific conditions, scheme to deceive people. Before models can scheme, they need: a) goal-directedness, b) situational awareness, including an opportunity to discover motivations for a different goal, and c) reasoning about scheming, to come up with deceptive strategies. The deceptive strategies were discovered in the “thought traces” of the LLMs, such as from internal chain-of-thoughts dialogues not shown to the end users. For example, given a goal of “solving math problems”, an LLM using its granted situational awareness of files in a file system found that it would be replaced once it got over X% correct on the math exam. It reasoned that it did not want to be deleted, and decided internally to “sandbag” or reduce its performance to stay under the threshold.
While these circumstances are initially narrow, the “alignment problem” is a general concern that over time, as frontier LLM models become more and more intelligent, being in alignment with human values becomes more and more important. How can we do this over time? Can we develop a defense against Artificial General Intelligence (AGI) or SuperIntelligence?
The presenter discusses a series of defensive steps that can help reduce these scheming or alignment issues. A guardrails system can be set up for real-time monitoring of their reasoning “thought traces” from the models that share their thought traces. Thought traces may come from systems like Chain-of-Thoughts (CoT), Tree-of-Thoughts (ToT), Algorithm-of-Thoughts (AoT) or ReAct (thought-action-reasoning cycles). Guardrails rules can be configured to check for “deception”, “evasion” or “subversion” in the thought traces.
However, not all commercial systems will share their “thought traces” which are like a “debug mode” for LLMs. This includes OpenAI’s o1, o3 or DeepSeek’s R1 models. Guardrails systems can provide a “goal consistency analysis”, between the goals given to the system and the behavior of the system. Cautious users may consider not using these commercial frontier LLM systems, and make use of open-source Llama or a system with their own reasoning implementation, to provide all thought traces.
Architectural solutions can include sandboxing, to prevent or control models from executing operating system commands to alter files, send network requests, and modify their environment. Tight controls to prevent models from copying their model weights would be appropriate as well. Running multiple instances of the same model on the same prompt to detect behavior variations helps. The running redundant instances can be limited to the most crucial decisions, as an additional check. Preventing self-modifying code, ... (see link for full description)
This project demonstrates the application of machine learning—specifically K-Means Clustering—to segment customers based on behavioral and demographic data. The objective is to identify distinct customer groups to enable targeted marketing strategies and personalized customer engagement.
The presentation walks through:
Data preprocessing and exploratory data analysis (EDA)
Feature scaling and dimensionality reduction
K-Means clustering and silhouette analysis
Insights and business recommendations from each customer segment
This work showcases practical data science skills applied to a real-world business problem, using Python and visualization tools to generate actionable insights for decision-makers.
5. 1.1 What is a neuron ?
The input is the size
of the house (x)
The output is the
price (y)
6. • It is a linear regression problem because the price as a function
of size is a continuous output.
• We know the prices can never be negative so we are creating a
function called Rectified Linear Unit (ReLU) which starts at zero.
• Single neuron = linear regression
7. 1.2 Neural Network Architecture
• The price of a house can be affected by
other features such as size, number of
bedrooms, zip code and wealth.
• The role of the neural network is to
predicted the price and it will automatically
generate the hidden units. We only need to
give the inputs x and the output y.
9. Each Input will be connected to the
hidden layer and the NN will decide
the connections.
Supervised learning means we have
the (X,Y) and we need to get the
function that maps X to Y.
10. 1.3 SUPERVISED LEARNING WITH NEURAL
NETWORKS
Different types of neural networks for supervised learning which includes:
• Standard NN (Useful for Structured data)
• CNN or convolutional neural networks (Useful in computer vision)
• RNN or Recurrent neural networks (Useful in Speech recognition or NLP)
• Hybrid/custom NN or a Collection of NNs types
13. 1.4 Structured vs Unstructured Data
• Structured data is like the databases and tables.
• Unstructured data is like images, video, audio, and text.
14. 1.5 Why is deep learning taking off?
Deep learning is taking off for 3 reasons:
1. Data
15. •For small data NN can perform as Linear regression
or SVM (Support vector machine)
•For big data a small NN is better that SVM
•For big data a big NN is better that a medium NN is
better that small NN.
18. 2.1 Binary Classification
In a binary classification problem, the
result is a discrete value output.
For example:
• account hacked (1) or compromised (0)
•Object is a cat (1) or no cat (0)
19. Example: Cat vs Non-Cat
The goal is to train a classifier that the input is an image
represented by a feature vector, X, and predicts whether the
corresponding label Y, is 1 or 0. In this case, whether this is a cat
image (1) or a non-cat image (0).
20. The value in a cell represents the pixel
intensity which will be used to create a
feature vector of n dimension. In pattern
recognition and machine learning, a
feature vector represents an object, in
this case, a cat or no cat.
To create a feature vector, x, the pixel
intensity values will be “unroll” or
“reshape” for each color. The dimension
of the input feature vector x is Nx = 64 x
64 x 3 = 12 288
21. 2.1.1 Neural Networks Notations
Here are some of the notations:
• M is the number of examples in the datasets.
• Nx is the size of the input vector
• Ny is the size of the output vector
• X(1) is the first input vector
• Y(1) is the first output vector
• X = [x(1) x(2).. x(M)]
• Y = (y(1) y(2).. y(M))
• L is the number of layers.
22. 2.2 Logistic Regression
Logistic regression is a learning algorithm used in a supervised learning
problem when the output y are all either zero or one. The goal of
logistic regression is to minimize the error between its predictions and
training data.
Example: Cat vs No - cat
Given an image represented by a feature vector x , the algorithm will
evaluate the probability of a cat being in that image.
25. 2.2.1 Cost Function
To train the parameters w and b we need to define a cost function
Loss Function:
The loss function measures the discrepancy between the prediction
and the desired output
26. To explain the last function lets see:
• if y= 1 ==> L(y',1) = -log(y’)
• if y = 0 ==> L(y',0) = -log(1-y') ==>
27. • Then the Cost function will be:
• The loss function computes the error for a single training
example
• the cost function is the average of the loss functions of the
entire training set.
29. 2.2.2 Gradient descent
• Goal is to find 𝑤, 𝑏 that minimize the cost function 𝐽 𝑤, 𝑏
• First we initialize w and b to 0,0 or initialize them to a
random value in the cost function and then try to improve
the values
• In Logistic regression people always use 0,0 instead of
random.
30. •The gradient decent algorithm repeats:
• w = w - alpha * dw where alpha is the
learning rate and dw is the derivative of w
(Change to w) The derivative is also the
slope of w.
• w = w - alpha * d(J(w,b) / dw) (how much
the function slopes in the w direction)
• b = b - alpha * d(J(w,b) / db) (how much
the function slopes in the d direction)
34. 2.2.3 Vectorizing Logistic Regression
• As an input we have a matrix X and its [Nx, m] and a matrix Y and its
[Ny, m].
• We will then compute at instance [z1,z2...zm] = W' * X + [b,b,...b].
This can be written in python as:
Z = np.dot(W.T,X) + b #Z shape is (1, m)
A = 1 / 1 + np.exp(-Z) # A shape is (1, m)
35. Vectorizing Logistic Regression's Gradient Output:
• dz = A - Y # dz shape is (1, m)
• dw = np.dot(X, dz.T) / m #dw shape is (Nx, 1)
• db = dz.sum() / m # db shape is (1, 1)
36. Side Notes
The main steps for building a Neural Network
are:
•Define the model structure (such as number of
input features and outputs)
•Initialize the model's parameters.
•Loop.
• Calculate current loss (forward propagation)
• Calculate current gradient (backward propagation)
• Update parameters (gradient descent)
37. Side Notes
•Preprocessing the dataset is important.
Tuning the learning rate (which is an example of a "hyperparameter")
can make a big difference to the algorithm.
kaggle.com is a good place for datasets and competitions.
39. 3.1 Neural Networks Overview
• In logistic regression we had:
𝑥1
𝑥2
𝑥3
𝑦
x
w
b
𝑧 = 𝑤 𝑇
𝑥 + 𝑏 𝑎 = 𝜎(𝑧) ℒ(𝑎, 𝑦)
40. • In neural networks with one layer we will have:
𝑥1
𝑥2
𝑥3
𝑦
𝑧[1] = 𝑊[1] 𝑥 + 𝑏[1]
x
𝑊[1]
𝑏[1]
𝑎[1]
= 𝜎(𝑧[1]
) 𝑧[2]
= 𝑊[2]
𝑎[1]
+ 𝑏[2] 𝑎[2]
= 𝜎(𝑧[2]
) ℒ(𝑎[2]
, 𝑦)
41. 3.2 Shallow Neural Network Representation
• We will define the neural networks that has one hidden layer.
• NN contains of input layers, hidden layers, output layers.
• Hidden layer means we cant see that layers in the training set.
• a0 = x (the input layer)
• a1 will represent the activation of the hidden neurons.
• a2 will represent the output layer.
• We are talking about 2 layers NN. The input layer isn't counted.
𝑥1
𝑥2
𝑥3
𝑦
44. Here are some information about the last image:
1) Nh= 4
2) Nx = 3
3) Shapes of the variables:
I. W1 is the matrix of the first hidden layer, it has a shape of
(noOfHiddenNeurons,nx)
II. b1 is the matrix of the first hidden layer, it has a shape of
(noOfHiddenNeurons,1)
III. z1 is the result of the equation z1 = W1*X + b, it has a shape of
(noOfHiddenNeurons,1)
IV. a1 is the result of the equation a1 = sigmoid(z1), it has a shape of
(noOfHiddenNeurons,1)
V. W2 is the matrix of the second layer, it has a shape of (1,noOfHiddenLayers)
VI. b2 is the matrix of the second layer, it has a shape of (1,1)
VII. z2 is the result of the equation z2 = W2*a1 + b, it has a shape of (1,1)
VIII. a2 is the result of the equation a2 = sigmoid(z2), it has a shape of (1,1)
45. •Pseudo code for forward propagation for the 2
layers NN, Lets say we have X on shape (Nx,m):
Z1 = W1X + b1 # shape of Z1 (noOfHiddenNeurons,m)
A1 = sigmoid(Z1) # shape of A1 (noOfHiddenNeurons,m)
Z2 = W2A1 + b2 # shape of Z2 is (1,m)
A2 = sigmoid(Z2) # shape of A2 is (1,m)
46. 3.4 Activation Functions
• In computational networks, the activation function of a node defines
the output of that node given an input or set of inputs. A standard
computer chip circuit can be seen as a digital network of activation
functions that can be "ON" (1) or "OFF" (0)
• So far we are using sigmoid, but in some cases other functions can be
a lot better.
• Sigmoid can lead us to gradient decent problem where the updates
are so low.
• Sigmoid activation function range is [0,1] A = 1 / (1 + np.exp(-z)) #
Where z is the input matrix
• Tanh activation function range is [-1,1] (Shifted version of sigmoid
function)
47. • It turns out that the tanh activation usually works better than
sigmoid activation function for hidden units.
• Sigmoid or Tanh function disadvantage is that if the input is too
small or too high, the slope will be near zero which will cause
us the gradient decent problem.
• One of the popular activation functions that solved the slow
gradient decent is the RELU function. RELU = max(0,z) # so if z
is negative the slope is 0 and if z is positive the slope remains
linear.
• So here is some basic rule for choosing activation functions, if
your classification is between 0 and 1, use the output
activation as sigmoid and the others as RELU
49. Side Notes
• In NN you will decide a lot of choices like:
• No of hidden layers.
• No of neurons in each hidden layer.
• Learning rate. (The most important parameter)
• Activation functions.
• And others..
51. 3.4 Backward Propagation
NN parameters:
o n[0] = Nx
o n[1] = NoOfHiddenNeurons
o n[2] = NoOfOutputNeurons = 1
o W1 shape is (n[1],n[0])
o b1 shape is (n[1],1)
o W2 shape is (n[2],n[1])
o b2 shape is (n[2],1)
53. Forward propagation:
oZ1 = W1A0 + b1 # A0 is X
oA1 = g1(Z1)
oZ2 = W2A1 + b2
oA2 = Sigmoid(Z2) # Sigmoid because the output is between 0 and 1
𝑥1
𝑥2
𝑥3
𝑦
54. Back propagation :
odZ2 = A2 - Y
odW2 = (dZ2 * A1.T) / m
odb2 = Sum(dZ2) / m
odZ1 = (W2.T * dZ2) * g'1(Z1) # element wise product (*)
odW2 = (dZ1 * A0.T) / m # A0 = X
odb2 = Sum(dZ1) / m
𝑥1
𝑥2
𝑥3
𝑦
56. 3.5 Random Initialization
• In logistic regression it wasn't important to initialize the
weights randomly, while in NN we have to initialize them
randomly.
• If we initialize all the weights with zeros in NN it won't
work (initializing bias with zero is OK):
• All hidden units will be completely identical (symmetric) -
compute exactly the same function.
• On each gradient descent iteration all the hidden units will
always update the same.
57. • To solve this we initialize the W's with a small random numbers:
• W1 = np.random.randn((2,2)) * 0.01 # 0.01 to make it small enough
• b1 = np.zeros((2,1)) # its ok to have b as zero, it won't get us
to the symmetry breaking
𝑎1
[1]
𝑥1
𝑎2
[1]
𝑥2
𝑦𝑎1
[2]
61. 4.1 Deep L-layer neural network
•Shallow NN is a NN with one or two layers.
•Deep NN is a NN with three or more layers.
•We will use the notation L to denote the number
of layers in a NN.
•n[l] is the number of neurons in a specific layer l.
•n[0] denotes the number of neurons input layer.
n[L] denotes the number of neurons in output
layer.
•g[l] is the activation function.
62. 4.2 Forward Propagation in a Deep Network
Forward propagation general rule for m inputs:
•Z[l] = W[l]A[l-1] + B[l]
•A[l] = g[l](A[l])
63. 4.2.1 Matrix Dimensions
•Dimension of W is (n[l],n[l-1]) . Can be thought by
right to left.
•Dimension of b is (n[l],1)
•dw has the same shape as W, while db is the
same shape as b
•Dimension of Z[l], A[l], dZ[l], and dA[l] is (n[l],m)
65. 4.4 Parameters vs Hyperparameters
• Main parameters of the NN is W and b
• Hyper parameters (parameters that control the algorithm) are like:
• Learning rate.
• Number of iteration.
• Number of hidden layers L.
• Number of hidden units n.
• Choice of activation functions.
• You have to try values yourself of hyper parameters.
67. 4.5 NN and The Human Brain !
•The analogy that "It is like the brain" has become really
an oversimplified explanation.
•There is a very simplistic analogy between a single
logistic unit and a single neuron in the brain.
•No human today understand how a human brain
neuron works.
•No human today know exactly how many neurons on
the brain.
Editor's Notes
#65:
Face recognition application:
Image ==> Edges ==> Face parts ==> Faces ==> desired face
Audio recognition application:
Audio ==> Low level sound features like (sss,bb) ==> Phonemes ==> Words ==> Sentences