0% found this document useful (0 votes)
39 views

Introduction To Datasciecne

Uploaded by

anubavroshan
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Introduction To Datasciecne

Uploaded by

anubavroshan
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 50

Data science essentials

(CS227)
Data science essentials

Dr.S.V.VIMALA. M.E.,Ph.D.
Assistant Professor/CSE,PTU
UNIT-I
• Introduction: Data Science -Epicycles of
Analysis-Stating and Refining the
Question- Exploratory Data Analysis- Using
Models to Explore Data-Inference: A
Primer- Formal Modeling-Inference vs.
Prediction : Implications for Modeling
Strategy -Interpreting results.
UNIT-I
• Introduction: Data Science -Epicycles of
Analysis-Stating and Refining the
Question- Exploratory Data Analysis- Using
Models to Explore Data-Inference: A
Primer- Formal Modeling-Inference vs.
Prediction : Implications for Modeling
Strategy -Interpreting results.
Introduction to Data Science
• Big Data is a collection of data that is huge in volume, yet growing
exponentially with time.
• Big data is also a data but with huge size.
• Eg: Social Media( Facebook), Stock market data, e-commerce shopping,
Healthcare, Internet of Things, Advertising & marketing, Credit card
companies, Mobile phone companies etc..
Big Data characteristics:
• Huge volume of data: Rather than thousands or millions of rows, Big
Data can be billions of rows and millions of columns.
• Complexity of data types and structures: Big Data reflects the variety of
new data sources, formats, and structures, including digital traces being
left on the web( emails, texts, blog posts, twitter posts, photographs,
comments under Youtube videos, or likes on Facebook.) and other digital
repositories for subsequent analysis.
• Speed of new data creation and growth: Big Data can
describe high velocity data, with rapid data
ingestion( transporting data from one or more sources
to a target site).
 For example, in 2012 Facebook users posted 700
status updates per second worldwide, which can be
leveraged to infer latent interests or political views of
users and show relevant ads. For instance, an update
in which a woman changes her relationship status
from "single" to "engaged" would trigger ads on
bridal dresses, wedding planning, or name-changing
services.
• As the world entered the period of big data,
the need for its storage also grew.
• The main focus was on building a framework
and solutions to store data.
• Now when Hadoop and other frameworks
have successfully solved the problem of
storage.
• The focus has shifted to the processing of this
data.
• It is a data with so large size and complexity that
none of traditional data management tools
(RDBMS)can store it or process it efficiently.
• Big Data problems require new tools and
technologies to store, manage, and realize the
business benefit.
• Data Science is the secret sauce here.
• data science all help individuals and organizations
tackle enormous data sets and extract valuable
information out of them.
What is Data?
• Data is a collection of information.
• One purpose of Data Science is to structure data, making it interpretable
and easy to work with.
• Data can be categorized into y different groups:
Structured data
Unstructured data
Natural language
Machine-generated
Graph-based
Audio, video, and images.
Structured Data:
 Structured data is data that depends on a data model and
resides in a fixed field within a record.
 Structured data is organized and easier to work with.
 store structured data in tables within databases or Excel files.
 SQL, or Structured Query Language, is the preferred way to
manage and query data that resides in databases.
Unstructured Data:
• Unstructured data is not organized.
• the content is context-specific.
• One example of unstructured data is your regular email
• We must organize the data for analysis purposes.
Natural language:
• Natural language is a special type of unstructured
data; it’s challenging to process because it requires
knowledge of specific data science techniques and
linguistics.
Machine-generated data:
 is information that’s automatically created by a
computer, process, application, or other machine
without human intervention.
 The analysis of machine data relies on highly
scalable tools, due to its high volume and speed.
Examples of machine data are web server logs, call
detail records, network event logs.
Graph-based or network data:
 Graph or network data is, in short, data that focuses on the
relationship or adjacency of objects.
 Graph-based data is a natural way to represent social
networks.
 . For example, imagine the connecting edges to show
“friends” on Facebook.
 Graph databases are used to store graph-based data and are
queried with specialized query languages such as SPARQL.
Introduction to Data Science
Definition of Data science:
 Data Science is collecting, analyzing and interpreting data to gather
insights into the data that can help decision-makers make informed
decisions.
 Data science is the practice of mining large data sets of raw data, both
structured and unstructured, to identify unseen patterns ,derive
meaningful information, and make business decisions from them.
 Data science is a multidisciplinary field of study that applies techniques
and tools to draw meaningful information and actionable insights out of
noisy data. Data science uses subjects like mathematics, statistics,
computer science and artificial intelligence, to gain insights from big
data.

data science is used across a variety of
industries for smarter planning and decision
making.
By using Data Science, companies are able to
make:
• Better decisions (should we choose A or B)
• Predictive analysis (what will happen next?)
• Pattern discoveries (find pattern, or maybe
hidden information in the data)
Examples:
• Have you ever wondered how Amazon, eBay suggest items for you to
buy?
• How Gmail filters your emails in the spam and non-spam categories?
• How Netflix predicts the shows of your liking?
• How do they do it? These are the few questions we ponder from time to
time.
• In reality, doing such tasks are impossible without the availability of
data.
• Data science is all about using data to solve problems.
• The problem could be decision making such as identifying which email is
spam and which is not.
• Or a product recommendation such as which movie to watch?
• Or predicting the outcome such as who will be the next President of the
USA?
• So, the core job of a data scientist is to understand the data, extract
useful information out of it and apply this in solving the problems.
Introduction to Data Science
Where is Data Science Needed?
• Data Science is used in many industries in the world
today, e.g. banking, consultancy, healthcare, and
manufacturing.
Examples of where Data Science is needed:
• For route planning: To discover the best routes to ship
• To foresee delays for flight/ship/train etc. (through
predictive analysis)
• To create promotional offers
• To find the best suited time to deliver goods
• To forecast the next years revenue for a company
• To analyze health benefit of training
• To predict who will win elections.
Data Science can be applied in nearly every part of a
business where data is available. Examples are:
• Consumer goods
• Stock markets
• Industry
• Politics
• Logistic companies
• E-commerce
• Fraud and Risk Detection
• Healthcare
• Internet Search
• Targeted Advertising
• Website Recommendations
Introduction to Data Science
Introduction to Data Science
Prerequisites for Data Science:
A Data Scientist requires expertise in several
backgrounds:
• Machine Learning
• Statistics
• Programming (Python or R)
• Mathematics
• Databases
• Machine learning (ML) is defined as a
discipline of artificial intelligence (AI) that
provides machines the ability to automatically
learn from data and past experiences to
identify patterns and make predictions with
minimal human intervention.
• Statistics is a set of mathematical methods
and tools that enable us to answer important
questions about data.
Who is a Data Scientist?
• Data Scientist is one who practices the art of Data Science.
• The highly popular term of ‘Data Scientist’ was coined
by DJ Patil and Jeff Hammerbacher.
• Data scientists are those who crack complex data
problems with their strong expertise in certain scientific
disciplines.
• They work with several elements related to mathematics,
statistics, computer science, etc (though they may not be
an expert in all these fields).
What skills does a Data Scientist possess?
• Be very innovative and distinctive in his approach in
applying various techniques intelligently to extract data
and get useful insights in solving business problems and
challenges.
• Have the ability to locate and interpret rich data sources.
• Have a hands-on experience in Data mining
techniques such as graph analysis, pattern detection,
decision trees, clustering or statistical analysis.
• Develop operational models, systems and tools by
applying experimental and iterative methods and
techniques.
• Analyze data from a variety of sources and perspectives
and find out hidden insights.
• Perform Data Conditioning – that is, converting
data into a useful form by applying statistical,
mathematical tools and predictive analysis.
• Research, analyze, execute, and present statistical
methods to gain practical insights.
• Manage large amounts of data even during
hardware, software and bandwidth limitations.
• Create visualizations that will help anyone
understand the trends in data analysis with ease.
• Be a team leader and communicate effectively with
other business analysts, product Managers and
Engineers
Data Science is also an Art!

Data Science is also an Art!


• Data science is not only a science or a technique, it is also an ‘Art’.
• Data Science is an art of listening to your intuitions while facing
huge amount of data, classifying it, evaluating it and reaching
conclusions.
• Not everyone is blessed with this art! Data scientists need to be
really creative in visualizing the data in various graphical forms and
present the highly complex data in a very simple and friendly way!
• If a Data scientist is able to convert terrifying Petabytes of
structured as well as unstructured data (images, videos, log files,
etc) into very easy and simple format, he is an – ‘Artist’!
Data science Life cycle:
Step 1: Define Problem Statement(Business
Requirement)
• Before you even begin a Data Science
project, you must define the problem you’re
trying to solve. At this stage, you should be
clear with the objectives of your project.
• success of any project depends on the
quality of questions asked for the dataset.
• Asking questions about dataset will help in
narrowing down to correct data acquisition.
Step 2: Data Collection
• Like the name suggests at this stage you must acquire all
the data needed to solve the problem.
• As it is a well-known fact that there is no Data
Science without Data.
• So, data serves important ingredient for making any
Data Science project.
• Now the question comes where to get the data from.
• Collecting data is not very easy because most of the
time you won’t find data sitting in a database, waiting
for you.
• Instead, you’ll have to go out, do some research and
collect the data or scrape it from the internet.
• Data could be from various sources which could be
– logs from webservers, data from online
repositories, data from databases, social media
data, data in excel sheet, so in short data can come
from any source.
• everywhere data is there. Newspaper, journals,
online, websites, everything is made up of data
only.
• If right questions have been asked in prior step
then this becomes an easy step to narrow down to
correct data sources.
step 3: Data Cleaning
• If you ask a Data Scientist what their least favorite process
in Data Science is, they’re most probably going to tell you
that it is Data Cleaning.
• Data acquired in previous step might not give clear
analytical picture or patterns in the data. Data may be or
may not be in required format.
• Might be data is obtained from different sources but for
analysis data need to be clubbed together from different
sources. This is also referred as structuring the data.
• Apart from this data might have missing values which will
cause obstruction in analysis and model building. There are
various methods to do missing value and duplicate value
treatment.
• So, to understand this data needs to be structured and
cleaned before processing any further. Thus, this step is
also known as Data Cleaning or Data Wrangling.
• Data cleaning is the process of removing redundant,
missing, duplicate and unnecessary data.
• Data preparation is the most time-consuming but
arguably the most essential step in the complete
existence cycle. Your model will be as accurate as your
data. This stage is considered to be one of the most
time-consuming stages in Data Science.
• However, in order to prevent wrongful predictions, it is
important to get rid of any inconsistencies in the data.
Step 4: Exploratory Data Analysis (EDA):
• plays an important role at the stage of summarization of
clean data that helps in identifying the structure, outliers,
anomalies and patterns in the data.
• This is where you retrieve useful insights and study the
behavior of the data.
• Distribution of data inside distinctive variables of a
character is explored graphically the usage of bar-graphs,
Relations between distinct aspects are captured via
graphical representations like scatter plots and warmth
maps.
• Many data visualization strategies are considerably used
to discover each and every characteristic individually and
by means of combining them with different features.
5. Data Modelling:
This stage seems to be most interesting one to
almost all of the data scientists.
 Many people call it “a stage where magic
happens”. But remember magic can happen
only if you have correct props and technique.
In terms of data science “Data” is that prop
and data preparation is that technique. So
before jumping to this step make sure to
spend sufficient amount of time in prior
steps.
• This stage is all about building a model that best solves
your problem.
• A model can be a Machine Learning Algorithm that is
trained and tested using the data.
• This stage always begins with a process called Data
Splicing, where you split your entire data set into two
proportions.
• One for training the model (training data set) and the
other for testing the efficiency of the model (testing
data set).
• This is followed by building the model by using the
training data set and finally evaluating the model by
using the test data set.
• There are various methods to do so. One of them is the
K-fold method where you split your whole data into two
parts, One is Train and the other is test data. On these
bases, you train your model.
• There different kinds of algorithms from regression to
classification, SVM( Support vector machines),
Clustering, etc. Your model can be of a Machine Learning
algorithm.
• Based on the business problem models could be
selected.
• It is essential to identify what is the ask, is it a
classification problem, regression or prediction problem,
time series forecasting or a clustering problem. Once
problem type is sorted out model could be implemented.
Step 6.Optimization and Deployment:
Optimization :
 This is the last stage of the Data Science life-cycle.
 You followed each and every step and hence build a model
that you feel is the best fit.
 But how can you decide how well your model is performing?
 This where optimization comes.
 You test your data and find how well it is performing by
checking its accuracy. In short, you check the efficiency of the
data model and thus try to optimize it for better accurate
prediction.
 For this precision, recall, F1-score for classification problem
could be used. For regression problem R2, MAPE (Moving
Average Percentage Error) or RMSE (Root Mean Square Error)
could be used.
Deployment:
 The users must validate the performance of the
models and if there are any issues with the
model then they must be fixed in this stage.
 Deployment deals with the launch of your
model and let the people outside there to
benefit from that. You can also obtain feedback
from organizations and people to know their
need and then to work more on your model.
• Lastly,visualization of findings should be done.
• It should be in line with business questions.
• It should be meaningful to the organisation and the
stakeholders.
• Presentation through visualization should be such that
it should trigger action in the audience.
• All the above steps make a complete Data Science
project.
• Python and R are the most used languages for Data
Science.
Diabetes Prevention:
What if we could predict the occurrence of diabetes and take appropriate
measures beforehand to prevent it?
In this use case, we will predict the occurrence of diabetes making use of the
entire lifecycle.
Step 1:
• First, we will collect the data based on the medical history of the patient
attributes:
• npreg – Number of times pregnant
• glucose – Plasma glucose concentration
• bp – Blood pressure
• skin – Triceps skinfold thickness
• bmi – Body mass index
• ped – Diabetes pedigree function
• age – Age
• income – Income
Step 2:
• Now, once we have the data, we need to clean and
prepare the data for data analysis.
• This data has a lot of inconsistencies like missing
values, blank columns, abrupt values and incorrect
data format which need to be cleaned.
• Here, we have organized the data into a single
table under different attributes – making it look
more structured.
• Let’s have a look at the sample data below.
• This data has a lot of inconsistencies.
• In the column npreg, “one” is written in words, whereas it
should be in the numeric form like 1.
• In column bp one of the values is 6600 which is impossible
(at least for humans) as bp cannot go up to such huge value.
• As you can see the Income column is blank and also makes
no sense in predicting diabetes. Therefore, it is redundant to
have it here and should be removed from the table.
• So, we will clean and preprocess this data by removing the
outliers, filling up the null values and normalizing the data
type. If you remember, this is our second phase which is
data preprocessing.
• Finally, we get the clean data as shown below which can be
used for analysis.
Step 3:
• Now let’s do some analysis as discussed earlier in Phase 3.
• use visualization techniques like histograms, line graphs, box plots to get
a fair idea of the distribution of data.
Step 4:
• Now, based on insights derived from the previous step, the
best fit for this kind of problem is the decision tree. Let’s
see how?
• Here, the most important parameter is the
level of glucose, so it is our root node.
• Now, the current node and its value
determine the next important parameter to
be taken. It goes on until we get the result in
terms of pos or neg.
• Pos means the tendency of having diabetes is
positive and neg means the tendency of
having diabetes is negative.
Step 5:
• In this phase, to check if our results are
appropriate.
• We will also look for performance constraints if
any. If the results are not accurate, then we need
to replan and rebuild the model.
Step 6:
• Once we have executed the project successfully,
we will share the output for full deployment.

You might also like