0% found this document useful (0 votes)
3 views23 pages

02 Data Science

The document provides an overview of key concepts in data science, including algorithms, big data, machine learning, and various types of data such as structured and unstructured data. It discusses the importance of data visualization, the ethical considerations in data usage, and the principles of data privacy and security. Additionally, it highlights the applications of data science across different industries and emphasizes the need for transparency and accountability in data practices.

Uploaded by

Ankit Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views23 pages

02 Data Science

The document provides an overview of key concepts in data science, including algorithms, big data, machine learning, and various types of data such as structured and unstructured data. It discusses the importance of data visualization, the ethical considerations in data usage, and the principles of data privacy and security. Additionally, it highlights the applications of data science across different industries and emphasizes the need for transparency and accountability in data practices.

Uploaded by

Ankit Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

02 Data Science

Different terminologies used in data science.

Algorithms
An algorithm is a set of instructions we give a computer so it can take values and manipulate them into a usable
form. An algorithm is a set of instructions designed to perform a specific task.
Big Data
Big data is a term that refers to data sets or combinations of data sets whose size (volume), complexity (variability),
rate of growth (velocity) and consistency (veracity) and value make them difficult to be captured, managed,
processed or analyzed by conventional technologies and tools.
Machine Learning
A process where a computer uses an algorithm to gain understanding about a set of data, then makes predictions
based on its understanding. There are many types of machine learning techniques; most are classified as either
supervised or unsupervised techniques.
Classification
Classification is a supervised machine learning problem. It deals with categorizing a data point based on its similarity to
other data points. You take a set of data where every item already has a category and look at common traits between each
item. You then use those common traits as a guide for what category the new item might have.
Database
As simply as possible, this is a storage space for data. We mostly use databases with a Database Management System
(DBMS), like SQL or MySQL. These are computer applications that allow us to interact with a database to collect and
analyze the information inside.
Data Warehouse
A data warehouse is a system used to do quick analysis of business trends using data from many sources. They’re
designed to make it easy for people to answer important statistical questions without a Ph.D. in database architecture.
Data Wrangling
The process of conversion of data, often through the use of scripting languages, to make it easier to work with is known
as data Wrangling or data munging.
Data munging, also known as data wrangling or data preparation, is the process of transforming and cleaning data so it
can be used for analysis or other purposes. The goal of data munging is to make data more valuable and appropriate
for downstream uses, such as analytics or machine learning
Web Analytics
Statistical or machine learning methods applied to web data such as page views, hits, clicks, and conversions (sales),
generally with a view to learning what web presentations are most effective in achieving the organizational goal (usually
sales). This goal might be to sell products and services on a site, to serve and sell advertising space, to purchase
advertising on other sites or to collect contact information. Key challenges in web analytics are the volume and constant
flow of data. And the navigational complexity and sometimes lengthy gaps that precede users’ relevant web decisions.
Artificial Intelligence (AI)
A discipline involving research and development of machines that are aware of their surroundings. Most work in A.I.
centers on using machine awareness to solve problems or accomplish some task. In case you didn’t know, A.I. is already
here: think self-driving cars, robot surgeons, and the bad guys in your favorite video game.
Business Intelligence (BI)
Similar to data analysis, but more narrowly focused on business metrics. The technical side of BI involves learning
how to effectively use software to generate reports and find important trends. It’s descriptive, rather than predictive. It
is a set of methodologies, process, theories that transform raw data into useful information to help companies make
better decisions.
Data Analytics
Analytics is the systematic computational analysis of data or statistics. It is used for the discovery, interpretation, and
communication of meaningful patterns in data.
Application of Data Science

Data Science is the area of study which involves extracting insights from vast amounts of data by the use of various
scientific methods, algorithms, and processes. It helps you to discover hidden patterns from the raw data.
• Airline Route Planning
• Fraud and Risk Detection
• Healthcare
• Internet Search
• Targeted Advertising
• Website Recommendations
• Advanced Image Recognition
• Speech Recognition
• Gaming
TYPES OF DATA

1. Unstructured data: Word, PDF, Text, images, audio and video


2. Semi Structured data: XML data.
3. Meta Data: Data about data
4. Structured data: Relational data.

Unstructured Big Data Any data with unknown form or the structure is classified as unstructured data. In
addition to the size being huge data poses multiple challenges in terms of its processing for deriving value out of
it. Typical example of unstructured data is, a heterogeneous data source containing a combination of simple text
files, images, audio and videos etc.
Semi-Structured data can contain both the forms of data. We can see Semi structured data in form but it is
actually not defined .With example a table definition in relational DBMS.
Example of semi-structured data is a data represented in XML file. Web pages are generated in scripting of
HTML which is also an example semi structured data.
Personal data stored in a XML file
<rec><name>Amitav</name><gender>Male<gender><age>45</age></rec>
<rec><name>Sudipta</name><gender>Male</gender><age>17</age></rec>
<rec><name>Soumya</name><gender>Male</gender><age>15</age></rec>
Meta Data
Metadata is defined as the data providing information about one or more aspects of the data. It is used to summarize
basic information about data which can make tracking and working with specific data easier.
There are three main types of metadata:
• Descriptive metadata describes a resource identification It can include elements such as title of the book, abstract
and keywords.

• Structural metadata indicates how compound objects are put together e.g. how pages are ordered to form
chapters.

• Administrative metadata provides information to help manage a resource, such as when and how it was
created, file type and other technical information, and who can access it.
Structured Data
Any data that can be stored, accessed and processed in the form of fixed format is termed as a ‘Structured’ data. In
other words, all data which can be stored in database SQL in form of table with rows and columns.
An employee table is an example of structured data

Employee_Id Employee_name Gender Dept Salary


K001 SUJIT M FINANCE 50,000
K002 DIPTA M ADMIN 60,000
K003 SOUMYA F FINANCE 55,000
data visualization

Data visualization is the practice of translating information into a visual context, such as a map or graph, to
make data easier for the human brain to understand and pull insights from. The main goal of data
visualization is to make it easier to identify patterns, trends and outliers in large data sets. The term is often
used interchangeably with information graphics, information visualization and statistical graphics.

Data visualization is one of the steps of the data science process, which states that after data has been
collected, processed and modeled, it must be visualized for conclusions to be made.

When a data scientist is writing advanced predictive analytics or machine learning algorithms, it's important
to be able to visualize the outputs to monitor results and ensure that the models are performing as
intended.
Visualizations help businesses in many ways. Some examples include the following:
•They help isolate factors that affect customer behavior.

•They identify products or services that need to be improved.

•They make data more memorable for stakeholders.

•They help organizations understand when and where to place specific products.

•They can predict sales or revenue volumes.


Benefits of data visualization

•Actionable insights. A broad spectrum of an organization's personnel can understand visuals presented
in business intelligence dashboards. This lets users absorb information quickly, get better insights and figure
out the next steps faster.
•Exploration of complex relationships. Visualization platforms with advanced capabilities can display
complex relationships among data points and metrics, allowing an organization to make faster data-based
decisions.
•Compelling storytelling. Data dashboards that are visually compelling will maintain the audience's interest
with information they can understand.
•Accessibility. Visualization tools make data more accessible and understandable, so that laypersons or
semi-technical users who aren't data scientists can interpret and analyze it.
•Interactivity. Interactive dashboards have the functionality to allow users to click on various aspects of data
displays to get more information. This is especially useful for those with less expertise on the subject area
covered by the data. Static displays don't allow this.
Disadvantages of data visualization

•Complexity. A highly complicated visualization could appear cluttered or make it difficult to glean valuable
insights. More complexity also means users need training on the tools being used or risk creating the wrong
visual type for the data being used.

•Potential for misinterpretation. Users might have good intentions when using a visualization platform, but
they can draw incorrect conclusions from detailed visualizations.

•Data privacy and security. Users must consider the security and privacy of the data being visualized. A
platform might be susceptible to cyberattacks, thus compromising the security of data being used, or a data
set could be used that isn't compliant with data privacy regulations.

•Bias. Visualizations and the data they're based on should be scrutinized to ensure they aren't intentionally or
unintentionally biased. Failing to do so could compromise the credibility of those analyses. For example, a
data set that leaves out key demographics within a population could lead to a biased visualization of that data.
Ethics in Data Science

Ethics in Data Science refers to the responsible and ethical use of the data throughout the entire data
lifecycle. This includes the collection, storage, processing, analysis, and interpretation of various data.
•Privacy: It means respecting an individual's data with confidentiality and consent.
•Transparency: Communicating how data is collected, processed, and used, So it will maintain
transparency.
•Fairness and Bias: Ensuring fairness in data-driven processes and addressing biases that may arise in
algorithms, preventing discrimination against certain groups.
•Accountability: Holding individuals and organizations accountable for their actions and decisions based on
data.
•Security: Implementing robust security measures sensitive data and protects them from unauthorized
access and breaches.
•Data Quality: Ensures the accuracy of the data , completeness and the reliability of the data to prevent any
misinformation.
Transparency and Documentation

Transparent documentation serves as the backbone of ethical decision-making in data science. It involves:
•Data Sources: Clearly outlining where the data originates from, including its collection methods and any
third-party sources involved.

•Methodologies: Describing the techniques, algorithms, and processes used for data analysis and model
creation. This transparency aids in understanding how conclusions are drawn.

•Transformations: Documenting any modifications or preprocessing steps applied to the data before
analysis. It ensures reproducibility and validates the accuracy of results.
Transparency and Accountability

•Transparency means being open and telling everyone about what's happening and letting everyone
know how the data is being used. it's like an open window

•When it comes to Data Science, data is the most valuable thing so, maintaining the transparency in how
the data is being used that means telling where the data comes from and how it's being used

•on the other hand accountability means nothing but responsibility that means taking the responsibility
for how the data is handled

•Together, transparency and accountability create trust and reliability. Transparency builds
understanding, allowing others to see the 'why' and 'how' behind actions.

•Accountability ensures that those responsible for managing data are answerable for their actions and
decisions, fostering a sense of responsibility and trustworthiness in data practices.
Bias Mitigation

Identifying and mitigating biases in data and algorithms is critical for fair outcomes. This includes:
•Data Audits: Regularly auditing datasets for inherent biases based on demographics or historical
imbalances.

•Algorithm Fairness: Assessing algorithms to detect and rectify biases in decision-making processes to
ensure fairness across diverse groups.

•Diverse Representation: Actively seeking diverse perspectives and inclusivity in datasets and model
development to avoid reinforcing existing biases
Data Privacy and Consent

Respecting data privacy laws and obtaining informed consent are foundational principles:
•Informed Consent: Clearly communicating to individuals how their data will be used, ensuring they
understand and agree to its usage.

•Anonymization: Stripping personally identifiable information whenever possible to protect individual


identities.

•Compliance: Adhering to legal frameworks such as GDPR(General data protection regulation-


European union law), HIPAA(Health insurance portability and accountability act- protects patients
health information), or CCPA(California consumer privacy act) to ensure lawful and ethical data
handling.
Security Measures

Safeguarding data against breaches or unauthorized access involves robust security protocols:
•Encryption: Protecting data through encryption methods to ensure confidentiality, especially for sensitive
information.

•Access Control: Implementing strict access controls to limit data access to authorized personnel only.

•Regular Audits: Conducting periodic security audits and assessments to identify vulnerabilities and rectify
them promptly.
Ethical Decision-making

Considering the broader ethical implications of data usage and model outcomes involves:
•Societal Impact Assessment: Evaluating the potential societal consequences of deploying models or
algorithms on different groups or communities.

•Ethical Frameworks: Using established ethical frameworks to guide decision-making and identify
potential ethical dilemmas.

•Continuous Evaluation: Regularly assessing the ethical implications of data usage and model outcomes
throughout the project lifecycle.
The Importance of Ethical Data Usage

Data Scientist are the Heart of Data they hold the data which can make powerful decisions that can
shape the future. the data is more valuable than anything so maintaining ethical standards is not a
obligation but it's a fundamental aspect of a Data scientist ensuring responsible data usage
Ethical Data usage is the main block of trust. When individuals provide their Data to organizations or
platforms, they expect it to maintain with integrity and basic ethics. Respecting their privacy is most
important part as it will increase the organization reputation
Seven Global Privacy Principles
1. Notice (Transparency): Inform individuals about the purposes for which information is collected
2. Choice: Offer individuals the opportunity to choose (or opt out) whether and how personal information they
provide is used or disclosed
3. Consent: Only disclose personal data information to third parties consistent with the principles of notice and
choice
4. Security: Take responsible measures to protect personal information from loss, misuse, and unauthorized
access, disclosure, alteration, and destruction
5. Data Integrity: Assure the reliability of personal information for its intended use and reasonable precautions
and ensure information is accurate, complete, and current
6. Access: Provide individuals with access to personal information data about them
7. Accountability: A fi rm must be accountable for following the principles and must include
mechanisms for assuring compliance

You might also like