Introduction To Datasciecne
Introduction To Datasciecne
(CS227)
Data science essentials
Dr.S.V.VIMALA. M.E.,Ph.D.
Assistant Professor/CSE,PTU
UNIT-I
• Introduction: Data Science -Epicycles of
Analysis-Stating and Refining the
Question- Exploratory Data Analysis- Using
Models to Explore Data-Inference: A
Primer- Formal Modeling-Inference vs.
Prediction : Implications for Modeling
Strategy -Interpreting results.
UNIT-I
• Introduction: Data Science -Epicycles of
Analysis-Stating and Refining the
Question- Exploratory Data Analysis- Using
Models to Explore Data-Inference: A
Primer- Formal Modeling-Inference vs.
Prediction : Implications for Modeling
Strategy -Interpreting results.
Introduction to Data Science
• Big Data is a collection of data that is huge in volume, yet growing
exponentially with time.
• Big data is also a data but with huge size.
• Eg: Social Media( Facebook), Stock market data, e-commerce shopping,
Healthcare, Internet of Things, Advertising & marketing, Credit card
companies, Mobile phone companies etc..
Big Data characteristics:
• Huge volume of data: Rather than thousands or millions of rows, Big
Data can be billions of rows and millions of columns.
• Complexity of data types and structures: Big Data reflects the variety of
new data sources, formats, and structures, including digital traces being
left on the web( emails, texts, blog posts, twitter posts, photographs,
comments under Youtube videos, or likes on Facebook.) and other digital
repositories for subsequent analysis.
• Speed of new data creation and growth: Big Data can
describe high velocity data, with rapid data
ingestion( transporting data from one or more sources
to a target site).
For example, in 2012 Facebook users posted 700
status updates per second worldwide, which can be
leveraged to infer latent interests or political views of
users and show relevant ads. For instance, an update
in which a woman changes her relationship status
from "single" to "engaged" would trigger ads on
bridal dresses, wedding planning, or name-changing
services.
• As the world entered the period of big data,
the need for its storage also grew.
• The main focus was on building a framework
and solutions to store data.
• Now when Hadoop and other frameworks
have successfully solved the problem of
storage.
• The focus has shifted to the processing of this
data.
• It is a data with so large size and complexity that
none of traditional data management tools
(RDBMS)can store it or process it efficiently.
• Big Data problems require new tools and
technologies to store, manage, and realize the
business benefit.
• Data Science is the secret sauce here.
• data science all help individuals and organizations
tackle enormous data sets and extract valuable
information out of them.
What is Data?
• Data is a collection of information.
• One purpose of Data Science is to structure data, making it interpretable
and easy to work with.
• Data can be categorized into y different groups:
Structured data
Unstructured data
Natural language
Machine-generated
Graph-based
Audio, video, and images.
Structured Data:
Structured data is data that depends on a data model and
resides in a fixed field within a record.
Structured data is organized and easier to work with.
store structured data in tables within databases or Excel files.
SQL, or Structured Query Language, is the preferred way to
manage and query data that resides in databases.
Unstructured Data:
• Unstructured data is not organized.
• the content is context-specific.
• One example of unstructured data is your regular email
• We must organize the data for analysis purposes.
Natural language:
• Natural language is a special type of unstructured
data; it’s challenging to process because it requires
knowledge of specific data science techniques and
linguistics.
Machine-generated data:
is information that’s automatically created by a
computer, process, application, or other machine
without human intervention.
The analysis of machine data relies on highly
scalable tools, due to its high volume and speed.
Examples of machine data are web server logs, call
detail records, network event logs.
Graph-based or network data:
Graph or network data is, in short, data that focuses on the
relationship or adjacency of objects.
Graph-based data is a natural way to represent social
networks.
. For example, imagine the connecting edges to show
“friends” on Facebook.
Graph databases are used to store graph-based data and are
queried with specialized query languages such as SPARQL.
Introduction to Data Science
Definition of Data science:
Data Science is collecting, analyzing and interpreting data to gather
insights into the data that can help decision-makers make informed
decisions.
Data science is the practice of mining large data sets of raw data, both
structured and unstructured, to identify unseen patterns ,derive
meaningful information, and make business decisions from them.
Data science is a multidisciplinary field of study that applies techniques
and tools to draw meaningful information and actionable insights out of
noisy data. Data science uses subjects like mathematics, statistics,
computer science and artificial intelligence, to gain insights from big
data.
data science is used across a variety of
industries for smarter planning and decision
making.
By using Data Science, companies are able to
make:
• Better decisions (should we choose A or B)
• Predictive analysis (what will happen next?)
• Pattern discoveries (find pattern, or maybe
hidden information in the data)
Examples:
• Have you ever wondered how Amazon, eBay suggest items for you to
buy?
• How Gmail filters your emails in the spam and non-spam categories?
• How Netflix predicts the shows of your liking?
• How do they do it? These are the few questions we ponder from time to
time.
• In reality, doing such tasks are impossible without the availability of
data.
• Data science is all about using data to solve problems.
• The problem could be decision making such as identifying which email is
spam and which is not.
• Or a product recommendation such as which movie to watch?
• Or predicting the outcome such as who will be the next President of the
USA?
• So, the core job of a data scientist is to understand the data, extract
useful information out of it and apply this in solving the problems.
Introduction to Data Science
Where is Data Science Needed?
• Data Science is used in many industries in the world
today, e.g. banking, consultancy, healthcare, and
manufacturing.
Examples of where Data Science is needed:
• For route planning: To discover the best routes to ship
• To foresee delays for flight/ship/train etc. (through
predictive analysis)
• To create promotional offers
• To find the best suited time to deliver goods
• To forecast the next years revenue for a company
• To analyze health benefit of training
• To predict who will win elections.
Data Science can be applied in nearly every part of a
business where data is available. Examples are:
• Consumer goods
• Stock markets
• Industry
• Politics
• Logistic companies
• E-commerce
• Fraud and Risk Detection
• Healthcare
• Internet Search
• Targeted Advertising
• Website Recommendations
Introduction to Data Science
Introduction to Data Science
Prerequisites for Data Science:
A Data Scientist requires expertise in several
backgrounds:
• Machine Learning
• Statistics
• Programming (Python or R)
• Mathematics
• Databases
• Machine learning (ML) is defined as a
discipline of artificial intelligence (AI) that
provides machines the ability to automatically
learn from data and past experiences to
identify patterns and make predictions with
minimal human intervention.
• Statistics is a set of mathematical methods
and tools that enable us to answer important
questions about data.
Who is a Data Scientist?
• Data Scientist is one who practices the art of Data Science.
• The highly popular term of ‘Data Scientist’ was coined
by DJ Patil and Jeff Hammerbacher.
• Data scientists are those who crack complex data
problems with their strong expertise in certain scientific
disciplines.
• They work with several elements related to mathematics,
statistics, computer science, etc (though they may not be
an expert in all these fields).
What skills does a Data Scientist possess?
• Be very innovative and distinctive in his approach in
applying various techniques intelligently to extract data
and get useful insights in solving business problems and
challenges.
• Have the ability to locate and interpret rich data sources.
• Have a hands-on experience in Data mining
techniques such as graph analysis, pattern detection,
decision trees, clustering or statistical analysis.
• Develop operational models, systems and tools by
applying experimental and iterative methods and
techniques.
• Analyze data from a variety of sources and perspectives
and find out hidden insights.
• Perform Data Conditioning – that is, converting
data into a useful form by applying statistical,
mathematical tools and predictive analysis.
• Research, analyze, execute, and present statistical
methods to gain practical insights.
• Manage large amounts of data even during
hardware, software and bandwidth limitations.
• Create visualizations that will help anyone
understand the trends in data analysis with ease.
• Be a team leader and communicate effectively with
other business analysts, product Managers and
Engineers
Data Science is also an Art!