0% found this document useful (0 votes)

3 views23 pages

02 Data Science

The document provides an overview of key concepts in data science, including algorithms, big data, machine learning, and various types of data such as structured and unstructured data. It discusses the importance of data visualization, the ethical considerations in data usage, and the principles of data privacy and security. Additionally, it highlights the applications of data science across different industries and emphasizes the need for transparency and accountability in data practices.

Uploaded by

Ankit Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views23 pages

02 Data Science

Uploaded by

Ankit Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

02 Data Science

Different terminologies used in data science.

Algorithms
An algorithm is a set of instructions we give a computer so it can take values and manipulate them into a usable
form. An algorithm is a set of instructions designed to perform a specific task.
Big Data
Big data is a term that refers to data sets or combinations of data sets whose size (volume), complexity (variability),
rate of growth (velocity) and consistency (veracity) and value make them difficult to be captured, managed,
processed or analyzed by conventional technologies and tools.
Machine Learning
A process where a computer uses an algorithm to gain understanding about a set of data, then makes predictions
based on its understanding. There are many types of machine learning techniques; most are classified as either
supervised or unsupervised techniques.
Classification
Classification is a supervised machine learning problem. It deals with categorizing a data point based on its similarity to
other data points. You take a set of data where every item already has a category and look at common traits between each
item. You then use those common traits as a guide for what category the new item might have.
Database
As simply as possible, this is a storage space for data. We mostly use databases with a Database Management System
(DBMS), like SQL or MySQL. These are computer applications that allow us to interact with a database to collect and
analyze the information inside.
Data Warehouse
A data warehouse is a system used to do quick analysis of business trends using data from many sources. They’re
designed to make it easy for people to answer important statistical questions without a Ph.D. in database architecture.
Data Wrangling
The process of conversion of data, often through the use of scripting languages, to make it easier to work with is known
as data Wrangling or data munging.
Data munging, also known as data wrangling or data preparation, is the process of transforming and cleaning data so it
can be used for analysis or other purposes. The goal of data munging is to make data more valuable and appropriate
for downstream uses, such as analytics or machine learning
Web Analytics
Statistical or machine learning methods applied to web data such as page views, hits, clicks, and conversions (sales),
generally with a view to learning what web presentations are most effective in achieving the organizational goal (usually
sales). This goal might be to sell products and services on a site, to serve and sell advertising space, to purchase
advertising on other sites or to collect contact information. Key challenges in web analytics are the volume and constant
flow of data. And the navigational complexity and sometimes lengthy gaps that precede users’ relevant web decisions.
Artificial Intelligence (AI)
A discipline involving research and development of machines that are aware of their surroundings. Most work in A.I.
centers on using machine awareness to solve problems or accomplish some task. In case you didn’t know, A.I. is already
here: think self-driving cars, robot surgeons, and the bad guys in your favorite video game.
Business Intelligence (BI)
Similar to data analysis, but more narrowly focused on business metrics. The technical side of BI involves learning
how to effectively use software to generate reports and find important trends. It’s descriptive, rather than predictive. It
is a set of methodologies, process, theories that transform raw data into useful information to help companies make
better decisions.
Data Analytics
Analytics is the systematic computational analysis of data or statistics. It is used for the discovery, interpretation, and
communication of meaningful patterns in data.
Application of Data Science

Data Science is the area of study which involves extracting insights from vast amounts of data by the use of various
scientific methods, algorithms, and processes. It helps you to discover hidden patterns from the raw data.
• Airline Route Planning
• Fraud and Risk Detection
• Healthcare
• Internet Search
• Targeted Advertising
• Website Recommendations
• Advanced Image Recognition
• Speech Recognition
• Gaming
TYPES OF DATA

1. Unstructured data: Word, PDF, Text, images, audio and video

2. Semi Structured data: XML data.
3. Meta Data: Data about data
4. Structured data: Relational data.

Unstructured Big Data Any data with unknown form or the structure is classified as unstructured data. In
addition to the size being huge data poses multiple challenges in terms of its processing for deriving value out of
it. Typical example of unstructured data is, a heterogeneous data source containing a combination of simple text
files, images, audio and videos etc.
Semi-Structured data can contain both the forms of data. We can see Semi structured data in form but it is
actually not defined .With example a table definition in relational DBMS.
Example of semi-structured data is a data represented in XML file. Web pages are generated in scripting of
HTML which is also an example semi structured data.
Personal data stored in a XML file
<rec><name>Amitav</name><gender>Male<gender><age>45</age></rec>
<rec><name>Sudipta</name><gender>Male</gender><age>17</age></rec>
<rec><name>Soumya</name><gender>Male</gender><age>15</age></rec>
Meta Data
Metadata is defined as the data providing information about one or more aspects of the data. It is used to summarize
basic information about data which can make tracking and working with specific data easier.
There are three main types of metadata:
• Descriptive metadata describes a resource identification It can include elements such as title of the book, abstract
and keywords.

• Structural metadata indicates how compound objects are put together e.g. how pages are ordered to form
chapters.

• Administrative metadata provides information to help manage a resource, such as when and how it was
created, file type and other technical information, and who can access it.
Structured Data
Any data that can be stored, accessed and processed in the form of fixed format is termed as a ‘Structured’ data. In
other words, all data which can be stored in database SQL in form of table with rows and columns.
An employee table is an example of structured data

Employee_Id Employee_name Gender Dept Salary

K001 SUJIT M FINANCE 50,000
K002 DIPTA M ADMIN 60,000
K003 SOUMYA F FINANCE 55,000
data visualization

Data visualization is the practice of translating information into a visual context, such as a map or graph, to
make data easier for the human brain to understand and pull insights from. The main goal of data
visualization is to make it easier to identify patterns, trends and outliers in large data sets. The term is often
used interchangeably with information graphics, information visualization and statistical graphics.

Data visualization is one of the steps of the data science process, which states that after data has been
collected, processed and modeled, it must be visualized for conclusions to be made.

When a data scientist is writing advanced predictive analytics or machine learning algorithms, it's important
to be able to visualize the outputs to monitor results and ensure that the models are performing as
intended.
Visualizations help businesses in many ways. Some examples include the following:
•They help isolate factors that affect customer behavior.

•They identify products or services that need to be improved.

•They make data more memorable for stakeholders.

•They help organizations understand when and where to place specific products.

•They can predict sales or revenue volumes.

Benefits of data visualization

•Actionable insights. A broad spectrum of an organization's personnel can understand visuals presented
in business intelligence dashboards. This lets users absorb information quickly, get better insights and figure
out the next steps faster.
•Exploration of complex relationships. Visualization platforms with advanced capabilities can display
complex relationships among data points and metrics, allowing an organization to make faster data-based
decisions.
•Compelling storytelling. Data dashboards that are visually compelling will maintain the audience's interest
with information they can understand.
•Accessibility. Visualization tools make data more accessible and understandable, so that laypersons or
semi-technical users who aren't data scientists can interpret and analyze it.
•Interactivity. Interactive dashboards have the functionality to allow users to click on various aspects of data
displays to get more information. This is especially useful for those with less expertise on the subject area
covered by the data. Static displays don't allow this.
Disadvantages of data visualization

•Complexity. A highly complicated visualization could appear cluttered or make it difficult to glean valuable
insights. More complexity also means users need training on the tools being used or risk creating the wrong
visual type for the data being used.

•Potential for misinterpretation. Users might have good intentions when using a visualization platform, but
they can draw incorrect conclusions from detailed visualizations.

•Data privacy and security. Users must consider the security and privacy of the data being visualized. A
platform might be susceptible to cyberattacks, thus compromising the security of data being used, or a data
set could be used that isn't compliant with data privacy regulations.

•Bias. Visualizations and the data they're based on should be scrutinized to ensure they aren't intentionally or
unintentionally biased. Failing to do so could compromise the credibility of those analyses. For example, a
data set that leaves out key demographics within a population could lead to a biased visualization of that data.
Ethics in Data Science

Ethics in Data Science refers to the responsible and ethical use of the data throughout the entire data
lifecycle. This includes the collection, storage, processing, analysis, and interpretation of various data.
•Privacy: It means respecting an individual's data with confidentiality and consent.
•Transparency: Communicating how data is collected, processed, and used, So it will maintain
transparency.
•Fairness and Bias: Ensuring fairness in data-driven processes and addressing biases that may arise in
algorithms, preventing discrimination against certain groups.
•Accountability: Holding individuals and organizations accountable for their actions and decisions based on
data.
•Security: Implementing robust security measures sensitive data and protects them from unauthorized
access and breaches.
•Data Quality: Ensures the accuracy of the data , completeness and the reliability of the data to prevent any
misinformation.
Transparency and Documentation

Transparent documentation serves as the backbone of ethical decision-making in data science. It involves:
•Data Sources: Clearly outlining where the data originates from, including its collection methods and any
third-party sources involved.

•Methodologies: Describing the techniques, algorithms, and processes used for data analysis and model
creation. This transparency aids in understanding how conclusions are drawn.

•Transformations: Documenting any modifications or preprocessing steps applied to the data before
analysis. It ensures reproducibility and validates the accuracy of results.
Transparency and Accountability

•Transparency means being open and telling everyone about what's happening and letting everyone
know how the data is being used. it's like an open window

•When it comes to Data Science, data is the most valuable thing so, maintaining the transparency in how
the data is being used that means telling where the data comes from and how it's being used

•on the other hand accountability means nothing but responsibility that means taking the responsibility
for how the data is handled

•Together, transparency and accountability create trust and reliability. Transparency builds
understanding, allowing others to see the 'why' and 'how' behind actions.

•Accountability ensures that those responsible for managing data are answerable for their actions and
decisions, fostering a sense of responsibility and trustworthiness in data practices.
Bias Mitigation

Identifying and mitigating biases in data and algorithms is critical for fair outcomes. This includes:
•Data Audits: Regularly auditing datasets for inherent biases based on demographics or historical
imbalances.

•Algorithm Fairness: Assessing algorithms to detect and rectify biases in decision-making processes to
ensure fairness across diverse groups.

•Diverse Representation: Actively seeking diverse perspectives and inclusivity in datasets and model
development to avoid reinforcing existing biases
Data Privacy and Consent

Respecting data privacy laws and obtaining informed consent are foundational principles:
•Informed Consent: Clearly communicating to individuals how their data will be used, ensuring they
understand and agree to its usage.

•Anonymization: Stripping personally identifiable information whenever possible to protect individual

identities.

•Compliance: Adhering to legal frameworks such as GDPR(General data protection regulation-

European union law), HIPAA(Health insurance portability and accountability act- protects patients
health information), or CCPA(California consumer privacy act) to ensure lawful and ethical data
handling.
Security Measures

Safeguarding data against breaches or unauthorized access involves robust security protocols:
•Encryption: Protecting data through encryption methods to ensure confidentiality, especially for sensitive
information.

•Access Control: Implementing strict access controls to limit data access to authorized personnel only.

•Regular Audits: Conducting periodic security audits and assessments to identify vulnerabilities and rectify
them promptly.
Ethical Decision-making

Considering the broader ethical implications of data usage and model outcomes involves:
•Societal Impact Assessment: Evaluating the potential societal consequences of deploying models or
algorithms on different groups or communities.

•Ethical Frameworks: Using established ethical frameworks to guide decision-making and identify
potential ethical dilemmas.

•Continuous Evaluation: Regularly assessing the ethical implications of data usage and model outcomes
throughout the project lifecycle.
The Importance of Ethical Data Usage

Data Scientist are the Heart of Data they hold the data which can make powerful decisions that can
shape the future. the data is more valuable than anything so maintaining ethical standards is not a
obligation but it's a fundamental aspect of a Data scientist ensuring responsible data usage
Ethical Data usage is the main block of trust. When individuals provide their Data to organizations or
platforms, they expect it to maintain with integrity and basic ethics. Respecting their privacy is most
important part as it will increase the organization reputation
Seven Global Privacy Principles
1. Notice (Transparency): Inform individuals about the purposes for which information is collected
2. Choice: Offer individuals the opportunity to choose (or opt out) whether and how personal information they
provide is used or disclosed
3. Consent: Only disclose personal data information to third parties consistent with the principles of notice and
choice
4. Security: Take responsible measures to protect personal information from loss, misuse, and unauthorized
access, disclosure, alteration, and destruction
5. Data Integrity: Assure the reliability of personal information for its intended use and reasonable precautions
and ensure information is accurate, complete, and current
6. Access: Provide individuals with access to personal information data about them
7. Accountability: A fi rm must be accountable for following the principles and must include
mechanisms for assuring compliance

Tafj Dumps
100% (4)
Tafj Dumps
29 pages
Data Privacy Notice and Consent Form
No ratings yet
Data Privacy Notice and Consent Form
1 page
Unit 1
No ratings yet
Unit 1
61 pages
Unit 1
No ratings yet
Unit 1
28 pages
Business Analytics Notes
No ratings yet
Business Analytics Notes
6 pages
CSD 1043: Big Data Fundamentals Week1: Big Data Landscape: Definitions
No ratings yet
CSD 1043: Big Data Fundamentals Week1: Big Data Landscape: Definitions
13 pages
Chapter 2 - Overview for Data Science
No ratings yet
Chapter 2 - Overview for Data Science
31 pages
Explaratory Data Analysis - Python
No ratings yet
Explaratory Data Analysis - Python
16 pages
Chapter Two
No ratings yet
Chapter Two
57 pages
Fda 1
No ratings yet
Fda 1
5 pages
Big Data and Data Science
No ratings yet
Big Data and Data Science
6 pages
Chapter 2-2
No ratings yet
Chapter 2-2
34 pages
Data Science
No ratings yet
Data Science
59 pages
Data Visulaziation
No ratings yet
Data Visulaziation
42 pages
Lecture Notes: Introduction To Data Science and Big Data
No ratings yet
Lecture Notes: Introduction To Data Science and Big Data
5 pages
DAUnit-1
No ratings yet
DAUnit-1
20 pages
CHAPTER-1
No ratings yet
CHAPTER-1
149 pages
Chapter 1 Data Science Fundamentals
No ratings yet
Chapter 1 Data Science Fundamentals
34 pages
Week-1-Lecture
No ratings yet
Week-1-Lecture
26 pages
Database Management System
From Everand
Database Management System
Manish Soni
No ratings yet
Data and Databases
No ratings yet
Data and Databases
15 pages
DS Unit-1 PDF
No ratings yet
DS Unit-1 PDF
50 pages
Data Science and Big Data Analytics Unit 1 notes
No ratings yet
Data Science and Big Data Analytics Unit 1 notes
13 pages
BA NOTES
No ratings yet
BA NOTES
7 pages
Lecture 1 and 2 Powerpoints
No ratings yet
Lecture 1 and 2 Powerpoints
32 pages
CSD101 Fundamentals of Data Science Session 1 and 2
No ratings yet
CSD101 Fundamentals of Data Science Session 1 and 2
53 pages
download
No ratings yet
download
4 pages
Cs3352 Foundation of Data Science
No ratings yet
Cs3352 Foundation of Data Science
80 pages
DSBDA_UNIT1
No ratings yet
DSBDA_UNIT1
232 pages
Introduction to Data Science_students
No ratings yet
Introduction to Data Science_students
237 pages
Unit-2
No ratings yet
Unit-2
15 pages
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
From Everand
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
WINTON CLEM
No ratings yet
Unit 1 - ETI (BDA)
No ratings yet
Unit 1 - ETI (BDA)
20 pages
Data Analytics and Data Processing Essentials
From Everand
Data Analytics and Data Processing Essentials
gareth thomas
No ratings yet
Spreadsheets To Cubes (Advanced Data Analytics for Small Medium Business): Data Science
From Everand
Spreadsheets To Cubes (Advanced Data Analytics for Small Medium Business): Data Science
alasdair gilchrist
No ratings yet
Chap1-Overview of Data Science
No ratings yet
Chap1-Overview of Data Science
50 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages
ITAM Presentation
No ratings yet
ITAM Presentation
46 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
1 1 Intro To Data and Data Science Course Notes
No ratings yet
1 1 Intro To Data and Data Science Course Notes
8 pages
Chapter 2 Introduction To Data Science
No ratings yet
Chapter 2 Introduction To Data Science
50 pages
Unit-5 DS
No ratings yet
Unit-5 DS
20 pages
Emerging Concepts & Trends in Business Analytics
No ratings yet
Emerging Concepts & Trends in Business Analytics
15 pages
Unit4 - DataAnalytics and IoT PDF
No ratings yet
Unit4 - DataAnalytics and IoT PDF
40 pages
ET_Ch-2_Data_Science_ppt (2)
No ratings yet
ET_Ch-2_Data_Science_ppt (2)
28 pages
Week 5 - Database System and Big Data Analytics
No ratings yet
Week 5 - Database System and Big Data Analytics
47 pages
Assignment OF Data Science (AIT 120) : Submitted To: Submitted by
No ratings yet
Assignment OF Data Science (AIT 120) : Submitted To: Submitted by
10 pages
Session 1a-Data and Data Management
No ratings yet
Session 1a-Data and Data Management
30 pages
Module 1
No ratings yet
Module 1
35 pages
L3...
No ratings yet
L3...
14 pages
BUSINESS ANALYTICS NOTES
No ratings yet
BUSINESS ANALYTICS NOTES
31 pages
L01-Fundamentals of Big Data and Data Analytics (1)
No ratings yet
L01-Fundamentals of Big Data and Data Analytics (1)
58 pages
Chapter 2 EMTE@Kibru 014914
No ratings yet
Chapter 2 EMTE@Kibru 014914
40 pages
Chapter 1
No ratings yet
Chapter 1
27 pages
Introduction To Data Analytics
No ratings yet
Introduction To Data Analytics
33 pages
chapter-1 Introduction to Data Analytics
No ratings yet
chapter-1 Introduction to Data Analytics
34 pages
Data and Business Intelligence: Bidgoli, MIS, 10th Edition. © 2021 Cengage
No ratings yet
Data and Business Intelligence: Bidgoli, MIS, 10th Edition. © 2021 Cengage
19 pages
DS_XI_SEC4
No ratings yet
DS_XI_SEC4
49 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Group 8_CHAPTER 8_Project TIM
No ratings yet
Group 8_CHAPTER 8_Project TIM
18 pages
Notes - Business Analytics
No ratings yet
Notes - Business Analytics
138 pages
Knight's Microsoft Business Intelligence 24-Hour Trainer: Leveraging Microsoft SQL Server Integration, Analysis, and Reporting Services with Excel and SharePoint
From Everand
Knight's Microsoft Business Intelligence 24-Hour Trainer: Leveraging Microsoft SQL Server Integration, Analysis, and Reporting Services with Excel and SharePoint
Brian Knight
3/5 (1)
I2s To I2c
No ratings yet
I2s To I2c
9 pages
How To Manage Oracle Grid Infrastructure During Operating System Upgrades (Doc ID 1559762.1)
No ratings yet
How To Manage Oracle Grid Infrastructure During Operating System Upgrades (Doc ID 1559762.1)
3 pages
JA Cha-Ching Program - Consolidation Template - Pre-Post - BLANK
No ratings yet
JA Cha-Ching Program - Consolidation Template - Pre-Post - BLANK
113 pages
Chapter-07-1
No ratings yet
Chapter-07-1
28 pages
DCIT 305: Databases Fundamentals: Session 1: Introduction To Database Fundamentals
No ratings yet
DCIT 305: Databases Fundamentals: Session 1: Introduction To Database Fundamentals
47 pages
Lecture 6 BigData
No ratings yet
Lecture 6 BigData
61 pages
Assignment 1
No ratings yet
Assignment 1
5 pages
TAFC R15 SP2 Release Notes
No ratings yet
TAFC R15 SP2 Release Notes
23 pages
Partitioning Methods & Hierachical Methods
No ratings yet
Partitioning Methods & Hierachical Methods
22 pages
CLS - Xii - Ip - Practical & Project - 2022-23
No ratings yet
CLS - Xii - Ip - Practical & Project - 2022-23
6 pages
File Structures
No ratings yet
File Structures
6 pages
Chapter 5
50% (4)
Chapter 5
2 pages
Oracle Tuning
No ratings yet
Oracle Tuning
29 pages
White Paper: Open Source Master Data Management The Time Is Right
No ratings yet
White Paper: Open Source Master Data Management The Time Is Right
7 pages
Principles of Records Management
No ratings yet
Principles of Records Management
15 pages
Oracle Database 11G Standard Edition & Standard Edition One: Your #1 Choice For Performance, Scalability and Security
No ratings yet
Oracle Database 11G Standard Edition & Standard Edition One: Your #1 Choice For Performance, Scalability and Security
2 pages
Hari Prakash - Data Analytics
No ratings yet
Hari Prakash - Data Analytics
1 page
Job Analysis Information Sheet
No ratings yet
Job Analysis Information Sheet
7 pages
Sampath Polishetty BigData Consultant
No ratings yet
Sampath Polishetty BigData Consultant
7 pages
Template and Guideline To Document in LM Project
No ratings yet
Template and Guideline To Document in LM Project
23 pages
Microsoft Governance
No ratings yet
Microsoft Governance
24 pages
DSA Project Slides
No ratings yet
DSA Project Slides
38 pages
MCQ
No ratings yet
MCQ
20 pages
Data Integration Platform Cloud Hands-On Lab
No ratings yet
Data Integration Platform Cloud Hands-On Lab
18 pages
Asm To Hex
No ratings yet
Asm To Hex
2 pages
CS Project
No ratings yet
CS Project
15 pages
Disk and OE Matrix: EMC VNX5100 and VNX5300 Series Storage Systems
No ratings yet
Disk and OE Matrix: EMC VNX5100 and VNX5300 Series Storage Systems
16 pages

02 Data Science

Uploaded by

02 Data Science

Uploaded by

02 Data Science

Different terminologies used in data science.

1. Unstructured data: Word, PDF, Text, images, audio and video

Employee_Id Employee_name Gender Dept Salary

•They identify products or services that need to be improved.

•They make data more memorable for stakeholders.

•They can predict sales or revenue volumes.

•Anonymization: Stripping personally identifiable information whenever possible to protect individual

•Compliance: Adhering to legal frameworks such as GDPR(General data protection regulation-

You might also like