0% found this document useful (0 votes)
5 views

Study Structure

Data science study material
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Study Structure

Data science study material
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Python Tutorial:

1. Operator, Keywords
2. Variables, Expression
3. Data Types (Int, String, List, Tuple, Set, Dictionary)
4. Loop & Control Flow ( Conditional statement, for, If, If_else, else, while, break, continuous, pass)
5. Functions (Build_in, User Defined)
6. OOPS Concept
7. DSA in python
8. Exception Handling
9. File Handling
10. Packages and libraries
11. Collections
12. Exercises
Data Preprocessing
1. Pandas
2. Numpy
3. Matplotlib
4. Seaborn
5. Data Cleaning
6. Handling Missing Values
7. Inconsistent Data
8. Data Transformation
9. Data Reduction (PCA)
10. Handling Outlier
11. Feature engineering
Visualization
Matplotlib Seaborn
Line Plot, Scatter Plot, Bar Plot, Pie Chart Heat map
Donut Chart, Gantt Chart, Error Bar Graph Pair plot, count plot
Swarm plot
Advance Plot Point plot
Stacked, area plot, 3D plot, Boxplot Violin plot
KDE plot

Web Development
1. Django
2. Flask
3. Postman
4. Github
5. Web Scrapping

Statistics for Machine learning


Machine Learning
1. Introduction
2. Data Preprocessing (Cleaning, EDA, O-utlier, Scaling, Encoding, Imbalance data)
3. Supervised Learning (Regression & Classification, Gradient Descent)
4. Unsupervised Learning
5. Reinforcement
5a. Algorithm / Models:
Regression Classification Unsupervised Reinforcement
Liner, polynomial, KNNR Logistic, KNN, SVC K-Mean, Mini batch Q- Learning
Stepwise, SVR, Navie bayes DT, RF Mean-Shift, Spectral SARSA, ARIMA
DTR, RF, XGBRegress Catboost, XGBClass DBSCAN, Fuzzy MDP, TD-IDF
Rigde, Lasso, Elastic Net Naive Bayes OPTICS, Hierarchical Monte carlo control

6. Dimensionality Reduction (PCA, LDA, GDA, overfitting & underfitting)


7. Ensembling.
8. Time-Series Analysis
9. Gradient decent Algorithm
(GD, SGD, Mini-Batch GD, Optimization Technique for GD, Momentum-based GD)
10. Projects
9. Interview Questions.

Evaluation and Model Selection


1. Bias Variance Trade-off
2. Model Evaluation Technique
3. Importance of spliting dataset into train, test & validation.
4. Cross Validation technique
Regression Evaluation Technique Classification Evaluation Technique
Mean Absolute Error Accuracy Score,
Mean Squared Error Precision, Recall, F1_score
Root Mean squared Error Confusion_matrix
Mean Absolute Percentage Error, R2 Score ROC-AUC
5. Hyper parameter Tuning (GridsearchCV & Randomized SearchCV)

Natural Language processing


1. Text Processing
2. Remove Stop Words
3. Tokenize (Sentance & Word token)
4. Stemming
5. Lemmatization
6. Large language modeling
7. Sequence to sequence task.
8. Vector space model.
9. capstone projects.
Deep Learning (Neural Networks)
1. Artificial Neural Networks
 Basic Neural Network.
 Single Layer Preceptron
 Multi Layer Preceptron
 forward & Backward propagation
 feed-forward neural Networks
 cost function in neural network
 How does Gradient descent work
 Vanishing or Exploding Gradients problem.
 Batch Normalization in DL
 Difference between sequential and functional API.
 Choose the optimal number of epochs
 Fine-Tuning & Hyperparameters.

2. Activation Functions (Pytorch & Tensorflow)

3. Convolution Neural Networks (Pooling, Padding, Stride)


 Digital Image processing
 Pooling Layer
 convents
 CNN for image classification
 Difference CNN architecture
 pre-Trained model for image classification
 Object Detection and Image Segmentation difference.
 YOLO v2 -Object Detection.

4. Recurrent Neural Networks (seq2seq, LSTM, Gated LSTM)


 Time-Series data.
 RNN architecture
 Sentiment Analysis using RNN
 time-series forecating using RNN
 short-term memory problem in RNN
 Bi-directional RNN architecture
 Intro to Long-short term memory
 LSTM architecture
 LSTM- Derivation of Backpropagation through time.
 Text Generation through LSTM and GRU(Gated Recurrent Unit)

5. Generative Adversarial Networks (GAN using keras, model collapse)


 Generative Learning
 AutoEncoder
 Types of AutoEncoder
 Linear, Stacked, Convolution, Recurrent, Denoising, Sparse Autoencoder.
 Variational AutoEncoder
 Contractive Autoencoder(CAE)
 AutoEncoder with TensorFlow & Pytorch
 Basics of Generative Adversarial Network(GAN)
 Building GAN using keras
 Cycle GAN
 Style GAN

6. Deep Q- Learning (also using it with keras)


7. TensorFlow(Keras)
8. Application of Deep Learning
9. Capstone Projects.

ML – Deployment
1. Using web app (Streamlit) on Heroku
2. Using Flask
3. Models as API using FastAPI
4. AWS Cloud Computing
 Intro of cloud computing
 Autoscaling & DNS
 Virtual Private cloud (VPC)
 Simple storage service (S3)
 Databases and in-memory Data stores (RDS, DynamoDB)
 Application services
 AWS Lambda & CLI
 SNS
 SQS
 Cloudwatch
 Athena
 Quicksight and kinesis.

11. ML – Application
12. ML Projects
13. Data Visualization.
14. Interview Questions.

SQL
1. Basic of SQL
2. Creation of SQL Databases
3. Refining Section (Select distinct, like, not like, Ilike, Limit, Between, Between-And)
4. SQL Queries (Statement & Functions)
5. SQL Clause
6. SQL Operator
7. SQL Aggregate Functions
8. SQL View
9. SQL Indexes
10. Miscellaneous

Power BI Tutorial
1. Introduction & Setup
2. Power Bi Query
3. Power Bi Dashboards & Visualization
4. Power Bi Format_KPI_Chart
5. Power BI Format Cluster Bar
6. Showing trends with line chart
7. DAX Introduction.
Statistics
1. Statistics foundation for data Science.
2. Probability and statistical inferences
3. Hypothesis Testing
Types of Sampling Distribution
Degrees of Freedom
Z-Test
t-Test
Chi-Square Test
4. Confidence Intervel
5. Correlation and covariance
6.

Data Structures and Algorithms


1.Array Basics
2. Problem Solving Techniques
3. Time Complexity & Bit Manipulation.
4. Sorting
5. Searching & String Algorithm
6. Linked list
7. Two Pointer Techniques
8. Stack & Queue -Implementation & Problems.
9. Tree, Trie, Ternary Search tree, Heap Data structure.
10. Recursion & Greedy Algorithm
11. Combinational problems with backtracking
12. Hashing
13. Graph Theory
14. Dynamic Programming

Big Data (Hadoop + Spark)


1. Introduction of Big Data
2. power of Spark
3. Data wrangling with spark
4. Debugging and Optimization
5. Introduction to Data Lakes.

Question and Answer Session:


1. Four Pillars of Deep Learning are artificial Neural Network, backpropagation, Activation Function
and gradient Descent.
2. Three types of Neural Networks are Feedforward, Convolution NN & RNN.
3. Application of Deep Learning in Computer Vision (Object detection & recognition, image
classification & Image Segmentation), NLP (Automatic Text Generation, Language Translation,
sentiment analysis & Speech Recognition), Reinforcement Learning (Games, Robotics &Control
System)

4. Challenges in Deep learning are availability of data, computational Resource, Time-consuming,


Interpretability, Overfitting.

5. Transfer learning is a technique where we can use a pre-trained model which are trained on a
sufficient amount of dataset to perform on a similar task that is related to what is was trained on
initially.

6. Fine-Tuning in transfer learning refers to taking a pre-trained model on one task and further train it
on a new specific task.

7. Tackle Underfitting problem:


 Maximize the training time/ datasets
 Enhance the complexity of the model
 Add more features to the data
 Reduce regular parameter

8. Tackle Overfitting problem


 Reduce the number of features / Dimensionality Redection
 Reduce the complexity of the model
 Use data augmentation techniques
 remove outlier in the dataset

9. what is the difference between Descriptive and inferential statistics?


Desciptive stat describe the features of the dataset whereas inferential stat use the random sample of
data taken form population to infere or predict the popupation.

10 . Standard deviation is the measure of the amount of variations of dispersion in a set of values.

11. P-value is a measure of probability that the observed difference could have occurred by random
chance. P-value < 0.05 is statistically significant.

12. What is the difference between correlation and causation?


Correlation measures the relationship between two variables, where as causation indicates that the one
variable causes an affect on other.

13. Bayes theorem describe the probability of an event, prior to the knowledge of conditions that might
be related to the event.

14. Normal distribution or Gaussian distribution, is a bell-shaped curve ie, data is symmetrically
distributed. The mean, median & mode of the dataset is equal and located at the center.

15. The Law of large number states that as the size of a sample increases, the sample mean will get
close to expected value(Population mean).

16. A confidence interval is a range of values derived from a dataset, that is believed to contain the true
value of unknown parameter with the specific level of confidence (95%).
17. Central Limit Theorem states that the distribution of the sample mean approaches a normal
distribution as the sample size becomes larger, regardless of the population’s distribution.

18. cross-Validation is a technique used to validate the ML model, by splitting the data into train and
validate sets multiple times and averaging the result.

19. Gradient descent is an optimization algorithm used to minimize the cost function in ML by
iteratively adjusting model parameter in the direction of the steepest descent of cost function.
-------------------------------------------------------------------------------------------------------------------------
20. AWS – Amazon Web Service Cloud Practitioner Essentials
Objective:
1. Working definition of AWS
2. Difference between on-premises, hybrid cloud and all-in-cloud.
3. Basic Global infrastructure of AWS
4. Benefits of AWS
5. Core AWS Services
6. Use Cases
7. Shared Responsibility Model
8. Core Security Service within the AWS cloud.
9. Basics of AWS cloud migration.
10. Core Billing, Account management and pricing model.
Module – 1: Intro to Amazon Web Services
Three component of cloud are cloud-based, On-premises and Hybrid
Cloud-Based:
Run all parts of the application in the cloud.
Migrate existing applications to the cloud.
Design and build new applications in the cloud.
In a cloud-based deployment model, you can migrate existing applications to the cloud, or you can
design and build new applications in the cloud. You can build those applications on low-level
infrastructure that requires your IT staff to manage them. Alternatively, you can build them using
higher-level services that reduce the management, architecting, and scaling requirements of the core
infrastructure.

On-Premises:
Deploy resources by using virtualization and resource management tools.
Increase resource utilization by using application management and virtualization technologies.
On-premises deployment is also known as a private cloud deployment. In this model, resources are
deployed on premises by using virtualization and resource management tools.
Hybrid:
Connect cloud-based resources to on-premises infrastructure.
Integrate cloud-based resources with legacy IT applications.
In a hybrid deployment, cloud-based resources are connected to on-premises infrastructure. You might
want to use this approach in a number of situations. For example, you have legacy applications that are
better maintained on premises, or government regulations require your business to keep certain records
on premises.
Benefits of Cloud:
1. Trade Upfront expense for variable expense:
Upfront expense refers to data centers, physical servers, and other resources that you would need to
invest in before using them. Variable expense means you only pay for computing resources you
consume instead of investing heavily in data centers and servers before you know how you’re going to
use them.
By taking a cloud computing approach that offers the benefit of variable expense, companies can
implement innovative solutions while saving on costs.
2. Stop spending money on run and maintain data centers:
A benefit of cloud computing is the ability to focus less on these tasks and more on your applications
and customers.
3.Stop Guessing Capacity
With cloud computing, you don’t have to predict how much infrastructure capacity you will need
before deploying an application.
For example, you can launch Amazon EC2 instances when needed, and pay only for the compute time
you use. Instead of paying for unused resources or having to deal with limited capacity, you can access
only the capacity that you need. You can also scale in or scale out in response to demand.
4. Benefit from massive economies of scale:
By using cloud computing, you can achieve a lower variable cost than you can get on your own.
Because usage from hundreds of thousands of customers can aggregate in the cloud, providers, such as
AWS, can achieve higher economies of scale. The economy of scale translates into lower pay-as-you-
go prices.
5. Increase speed and agility:
The flexibility of cloud computing makes it easier for you to develop and deploy applications.
This flexibility provides you with more time to experiment and innovate. When computing in data
centers, it may take weeks to obtain new resources that you need. By comparison, cloud computing
enables you to access new resources within minutes.
6. Go global in Minutes:
The global footprint of the AWS Cloud enables you to deploy applications to customers around the
world quickly, while providing them with low latency. This means that even if you are located in a
different part of the world than your customers, customers are able to access your applications with
minimal delays.
Later in this course, you will explore the AWS global infrastructure in greater detail. You will examine
some of the services that you can use to deliver content to customers around the world.
A. What is cloud computing?
It’s an on demand delivery of IT resource and applications through the internet with pay-as-you-go
price.
B. What is another name for on-premises deployment? Private Cloud Deployment
C. How does the scale of cloud computing help you to save costs?
The aggregate cloud usage from large number of customers result in lower pay-as-you-go prices.
Amazon Elastic Compute Cloud (Amazon EC2) provides secure, resizable
compute capacity in the cloud as Amazon EC2 instances. (
o
Imagine you are responsible for the architecture of your company's
p
e
resources and need to support new websites. With traditional on-premises
n
resources, you have to do the following: s

Spend money upfront to purchase hardware. i


Wait for the servers to be delivered to you. n
Install the servers in your physical data center.
Make all the necessary configurations. a
By comparison, with an Amazon EC2 instance you can use a virtual server
n
to run applications in the AWS Cloud. e
w
You can provision and launch an Amazon EC2 instance within minutes.
t
You can stop using it when you have finished running a workload.
a
You pay only for the compute time you use when an instance is running,
not when it is stopped or terminated. b
) you need or
You can save costs by paying only for server capacity that
want.
AWS EC2 Instance Families:
1.General Purpose:
General purpose instances provide a balance of compute, memory, and
networking resources. You can use them for a variety of workloads, such as:
application servers
gaming servers
backend servers for enterprise applications
small and medium databases
Suppose that you have an application in which the resource needs for compute,
memory, and networking are roughly equivalent. You might consider running it on
a general purpose instance because the application does not require optimization
in any single resource area.
2. Compute Optimized:
instances are ideal for compute-bound applications that benefit from high-
performance processors. Like general purpose instances, you can use compute
optimized instances for workloads such as web, application, and gaming servers.

However, the difference is compute optimized applications are ideal for high-
performance web servers, compute-intensive applications servers, and dedicated
gaming servers. You can also use compute optimized instances for batch
processing workloads that require processing many transactions in a single group.
3. Memory Optimized Instances:
are designed to deliver fast performance for workloads that process large
datasets in memory. In computing, memory is a temporary storage area. It holds
all the data and instructions that a central processing unit (CPU) needs to be able
to complete actions. Before a computer program or application is able to run, it is
loaded from storage into memory. This preloading process gives the CPU direct
access to the computer program.

Suppose that you have a workload that requires large amounts of data to be
preloaded before running an application. This scenario might be a high-
performance database or a workload that involves performing real-time
processing of a large amount of unstructured data. In these types of use cases,
consider using a memory optimized instance. Memory optimized instances enable
you to run workloads with high memory needs and receive great performance.
4. Accelerated Computing Instance: use hardware accelerators, or coprocessors, to
perform some functions more efficiently than is possible in software running on
CPUs. Examples of these functions include floating-point number calculations,
graphics processing, and data pattern matching.

In computing, a hardware accelerator is a component that can expedite data


processing. Accelerated computing instances are ideal for workloads such as
graphics applications, game streaming, and application streaming.
5. Storage Optimized Instance: are designed for workloads that require high, sequential
read and write access to large datasets on local storage. Examples of workloads
suitable for storage optimized instances include distributed file systems, data
warehousing applications, and high-frequency online transaction processing
(OLTP) systems.

In computing, the term input/output operations per second (IOPS) is a metric that
measures the performance of a storage device. It indicates how many different
input or output operations a device can perform in one second. Storage optimized
instances are designed to deliver tens of thousands of low-latency, random IOPS
to applications.

You can think of input operations as data put into a system, such as records
entered into a database. An output operation is data generated by a server. An
example of output might be the analytics performed on the records in a database.
If you have an application that has a high IOPS requirement, a storage optimized
instance can provide better performance over other instance types not optimized
for this kind of use case.
Pricing Policies:
On-Demand Instances are ideal for short-term, irregular workloads that cannot be
interrupted. No upfront costs or minimum contracts apply. The instances run continuously
until you stop them, and you pay for only the compute time you use.

Sample use cases for On-Demand Instances include developing and testing applications
and running applications that have unpredictable usage patterns. On-Demand Instances
are not recommended for workloads that last a year or longer because these workloads
can experience greater cost savings using Reserved Instances.
Reserved Instances are a billing discount applied to the use of On-Demand Instances in
your account. There are two available types of Reserved Instances:

 Standard Reserved Instances


 Convertible Reserved Instances
You can purchase Standard Reserved and Convertible Reserved Instances for a 1-
year or 3-year term. You realize greater cost savings with the 3-year option.

Standard Reserved Instances: This option is a good fit if you know the EC2
instance type and size you need for your steady-state applications and in which
AWS Region you plan to run them. Reserved Instances require you to state the
following qualifications:
 Instance type and size: For example, m5.xlarge
 Platform description (operating system): For example, Microsoft Windows
Server or Red Hat Enterprise Linux
 Tenancy: Default tenancy or dedicated tenancy
You have the option to specify an Availability Zone for your EC2 Reserved
Instances. If you make this specification, you get EC2 capacity reservation. This
ensures that your desired amount of EC2 instances will be available when you
need them.
Convertible Reserved Instances: If you need to run your EC2 instances in
different Availability Zones or different instance types, then Convertible Reserved
Instances might be right for you. Note: You trade in a deeper discount when you
require flexibility to run your EC2 instances.

At the end of a Reserved Instance term, you can continue using the Amazon EC2
instance without interruption. However, you are charged On-Demand rates until
you do one of the following:
Terminate the instance.
Purchase a new Reserved Instance that matches the instance attributes (instance
family and size, Region, platform, and tenancy).
AWS offers Savings Plans for a few compute services, including Amazon EC2. EC2
Instance Savings Plans reduce your EC2 instance costs when you make an
hourly spend commitment to an instance family and Region for a 1-year or 3-year
term. This term commitment results in savings of up to 72 percent compared to
On-Demand rates. Any usage up to the commitment is charged at the discounted
Savings Plans rate (for example, $10 per hour). Any usage beyond the
commitment is charged at regular On-Demand rates.

The EC2 Instance Savings Plans are a good option if you need flexibility in your
Amazon EC2 usage over the duration of the commitment term. You have the
benefit of saving costs on running any EC2 instance within an EC2 instance family
in a chosen Region (for example, M5 usage in N. Virginia) regardless of Availability
Zone, instance size, OS, or tenancy. The savings with EC2 Instance Savings Plans
are similar to the savings provided by Standard Reserved Instances.

Unlike Reserved Instances, however, you don't need to specify up front what EC2
instance type and size (for example, m5.xlarge), OS, and tenancy to get a
discount. Further, you don't need to commit to a certain number of EC2 instances
over a 1-year or 3-year term. Additionally, the EC2 Instance Savings Plans don't
include an EC2 capacity reservation option.

Later in this course, you'll review AWS Cost Explorer, which you can use to
visualize, understand, and manage your AWS costs and usage over time. If you're
considering your options for Savings Plans, you can use AWS Cost Explorer to
analyze your Amazon EC2 usage over the past 7, 30, or 60 days. AWS Cost
Explorer also provides customized recommendations for Savings Plans. These
recommendations estimate how much you could save on your monthly Amazon
EC2 costs, based on previous Amazon EC2 usage and the hourly commitment
amount in a 1-year or 3-year Savings Plan.
Spot Instances are ideal for workloads with flexible start and end times, or that
can withstand interruptions. Spot Instances use unused Amazon EC2 computing
capacity and offer you cost savings at up to 90% off of On-Demand prices.

Suppose that you have a background processing job that can start and stop as
needed (such as the data processing job for a customer survey). You want to start
and stop the processing job without affecting the overall operations of your
business. If you make a Spot request and Amazon EC2 capacity is available, your
Spot Instance launches. However, if you make a Spot request and Amazon EC2
capacity is unavailable, the request is not successful until capacity becomes
available. The unavailable capacity might delay the launch of your background
processing job.

After you have launched a Spot Instance, if capacity is no longer available or


demand for Spot Instances increases, your instance may be interrupted. This
might not pose any issues for your background processing job. However, in the
earlier example of developing and testing applications, you would most likely
want to avoid unexpected interruptions. Therefore, choose a different EC2
instance type that is ideal for those tasks.
Dedicated Hosts are physical servers with Amazon EC2 instance capacity that is
fully dedicated to your use.

You can use your existing per-socket, per-core, or per-VM software licenses to help
maintain license compliance. You can purchase On-Demand Dedicated Hosts and
Dedicated Hosts Reservations. Of all the Amazon EC2 options that were covered,
Dedicated Hosts are the most expensive.

Module – 2: Compute in the cloud


Module -3: Global Infrastructure and Reliability
Module -4: Networking
Module -5: Storage and Databases
Module -6: Security
Module -7: Monitoring and Analytics
Module-8: Pricing and Support
Module-9: Migration and Innovation
Module-10: The Cloud Journey
Module-11: AWS Certificate cloud practitioner Basics

----------------------------------------------------------------------------------------------------------------------------
SQL Fundamental:

1. Understand database and their structure


2. Extract information from database using SQL code
* Relational database defines the relationship between tables of data inside the database.
Tables:
* It is the backbone of the Databases
* Field name should be a. lowercase b. have no space c. singular d. Two field should not have same
name.
Introduction about Data:
* Different types of data are stored differently and take up different space.
* some operations only apply to certain data types.

You might also like