Data Analytics
Data Analytics
3
BASIC DEFINITIONS
Data : Data is a set of values of qualitative or quantitative variables. It is information in raw or unorganized form. It
may be a fact, figure, characters, symbols etc.
Information: Meaningful or organized data is information.
Analytics : Analytics is the discovery , interpretation, and communication of meaningful patterns or summery in
data.
Data Analytics :(DA) is the process of examining data sets in order to draw conclusion about the information it
contains.
Analytics is not a tool or technology, rather it is the way of thinking and acting on data.
WHAT IS DATA
ANALYTICS?
The data analytics practice encompasses many separate processes, which can comprise a data pipeline:
Collecting and ingesting the data
Categorizing the data into structured/unstructured forms, which might also define next actions
Managing the data, usually in databases, data lakes, and/or data warehouses
Storing the data in hot, warm, or cold storage
Performing ETL (extract, transform, load)
Analyzing the data to extract patterns, trends, and insights
Sharing the data to business users or consumers, often in a dashboard or via specific storage
PRIMARY DATA AND SECONDARY DATA
1. Primary Data Collection:
Surveys and Questionnaires
Interviews
Observations
Experiments
Focus Groups
2. Secondary Data Collection:
Published Sources
Primary data Secondary data Online Databases
collection involves collection involves Government and Institutional
the collection of using existing data Records
original data collected by someone Publicly Available Data
directly from the else for a purpose
source or through different from the Past Research Studies
direct interaction original intent.
with the
respondents.
ETL (EXTRACT TRANSFORM LOAD)
Definition: Diagnostic analytics aims to determine the root causes and reasons behind certain events or trends
observed in the data.
Key Characteristics: Involves data exploration, drill-down analysis, and correlation identification. Diagnostic analytics
answers the question of "why did it happen."
Examples: Data mining techniques, regression analysis, cohort analysis.
DESCRIPTIVE ANALYTICS
Definition: Descriptive analytics focuses on summarizing historical data to gain insights into past events and
understand the current state.
Key Characteristics: Involves data aggregation, visualization, and reporting. Descriptive analytics answers the
questions of "what happened" and "what is happening."
Examples: Bar charts, line graphs, dashboards displaying key performance indicators (KPIs).
PREDICTIVE ANALYTICS
Definition: Predictive analytics leverages historical data to make predictions about future outcomes or events.
Key Characteristics: Involves statistical modeling, machine learning algorithms, and pattern recognition. Predictive
analytics answers the question of "what is likely to happen."
Examples: Forecasting models, time series analysis, classification algorithms.
PRESCRIPTIVE ANALYTICS
Definition: Prescriptive analytics recommends the best course of action based on predictive models, optimization
techniques, and business rules.
Key Characteristics: Involves simulation, optimization algorithms, and decision support systems. Prescriptive
analytics answers the question of "what should be done."
Examples: Optimization models, simulation tools, decision support systems.
ROLE OF A DATA ANALYST
A data analyst role is to answer specific questions or address particular challenges that have already been
identified and are known to the business.
To do this, they examine large datasets with the goal of identifying trends and patterns. They then “visualize” their
findings in the form of charts, graphs, and dashboards.
CAREER
OPPORTUNITIES
1. Data Scientist
2. Business Intelligence
Analyst
3. Data Engineer
4. Business Analyst
5. Marketing Analytics
Manager
6. Financial Analyst
7. Quantitative Analyst
8. Risk Analyst
9. Data Governance Analyst
10. Data Visualization
Engineer
Steps involved in data analytics.
23
CORE COMPONENTS OF PANDAS :
SERIES AND DATA FRAME
https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/ 24
FILE HANDLING WITH PANDAS
https://pandas.pydata.org/docs/getting_started/intro_tutorials/02_read_write.html
SAMPLE READING AND WRITING .CSV FILE
import pandas as pd import pandas as pd
# Create a dataframe # Read csv file
raw_data = {'first_name': df = pd.read_csv(r'D:\Python\Tutorial\Example1.csv')
['Sam','Ziva','Kia','Robin'], df
'degree':
['PhD','MBA','','MS'],
'age': [25, 29, 19, 21]}
df = pd.DataFrame(raw_data)
df
#Save the dataframe
df.to_csv(r'Example1.csv')
Output
26
EXPLORING A DATASET USING PANDAS
Download the dataset from: https://drive.google.com/file/d/1q7qK03njlzZRQ7PyYoprn12uVnE1gjH6/view
Import pandas as pd
data_1 = pd.read_csv(r‘<datasetpath>')
data_1.head(6)
data_1.describe()
data_1.memory_usage(deep=True)
data_1['Gender'] = data_1.Gender.astype('category')
data_1.loc[0:4, ['Name', 'Age', 'State']]
data_1['DOB'] = pd.to_datetime(data_1['DOB'])
data_1['State'].value_counts()
data_1.drop_duplicates(inplace=True)
data_1.groupby(by='State').Salary.mean()
data_1.sort_values(by='Name', inplace=True)
data_1['City temp'].fillna(38.5, inplace=True)
Ref: https://www.analyticsvidhya.com/blog/2021/05/pandas-functions-13-most-important/ 27
Convert list into series
of elements
# convert element lists into series of elements, which have index from 0—5
import pandas as pd
my_data=[10,20,30,40,50]
pd.Series(data=my_data)
Convert dictionary into
series of elements
import numpy as np
import pandas as pd
d={'a':10,'b':20,'c':30,'d':40}
#dictionary keys act as index and values with every key act as series
values
pd.Series(d)
28
DATA MANIPULATION: DROP MISSING ELEMENTS
import pandas as pd
import numpy as np
d={'A':[1,2,np.NaN], 'B':[1,np.NaN,np.NaN],'C':[1,2,3]}
# np.NaN is the missing element in DataFrame
df=pd.DataFrame(d) dictionary will get converted in to dataframe
df.dropna() #pandas would drop any row with missing value
df.dropna(axis=1) #drop column with NULL value
29
DATA MANIPULATION: FILLING SUITABLE VALUE
df['A'].fillna(value=df['A'].mean())
#Select column "A" and fill the missing value with mean value of the column A
df['A’].fillna(value=df['A’].std())
#Select column "A" and fill the missing value with standard deviation value of the column A
30
REPLACING A VALUE
import pandas as pd
df = pd.DataFrame({'one':[10,20,30,40,50,2000], 'two':[1000,0,30,40,50,60]})
print df.replace({1000:10,2000:60})
31
GROUPBY() FUNCTION
32
FINDING MAXIMUM VALUE IN EACH LABEL
33
FINDING UNIQUE VALUE & NUMBER OF OCCURRENCE FROM
DATAFRAME
df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz’]})
# col1, col2 & col3 are column labels, each column have their own values
df['col2'].unique() #fetches the unique values available in column
df['col2'].value_counts() # count number of occurrence of every value
34
STATISTICAL FUNCTIONS
import numpy as np
import pandas as pd
s = pd.Series([1,2,3,4,5,4])
print s.pct_change()
df = pd.DataFrame(np.random.randn(5, 2))
print df.pct_change()
s1 = pd.Series(np.random.randn(10))
s2 = pd.Series(np.random.randn(10))
print s1.cov(s2) import numpy as np
frame = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])
print frame['a'].corr(frame['b'])
print frame.corr()
s = pd.Series(np.random.randn(5), index=list('abcde'))
s['d'] = s['b'] # so there's a tie
print s
print s.rank()