0% found this document useful (0 votes)
2 views

Lect1

Uploaded by

shashwat
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Lect1

Uploaded by

shashwat
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Python for Data Science

CS E, PEC
About the course

◻ What is data science?

◻ Data all around

◻ Why Python?
Data All Around
Lots of data is being collected
and warehoused
Web data, e-commerce
Financial transactions, bank/credit transactions
Online trading and purchasing
Social Network
Limitations of the File-Based Approach

• Separated and Isolated Data


• Duplication of data
• Data Dependence
• Difficulty in representing data from the user’s
view
• Data Inflexibility
• Incompatible file formats
Big Data
Hype cycle 2014
Hype cycle 2022
What is Data Science?
• An area that manages, manipulates, extracts,
and interprets knowledge from tremendous
amount of data
• Data science (DS) is a multidisciplinary field of
study with goal to address the challenges in
big data
• Mathematics
• Statistics
• Machine learning and artificial intelligence
• Specialized programming
• Data science principles apply to all data – big
and small
What is Data Science?
• Investigate and analyze a large amount of data
to help decision makers
• Science, engineering, economics, politics,
finance, and education
• Computer Science
• Pattern recognition, visualization, data warehousing, High
performance computing, Databases, AI
• Mathematics
• Mathematical Modeling
• Statistics
• Statistical and Stochastic modeling, Probability.
What is data science?
Data science produces insights. Machine learning produces predictions
Why Python
• Python libraries
• Data manipulation and pre-processing
• Data Summary
• Visualization
• ML Libraries
Applications
• Banking: all transactions
• Airlines: reservations, schedules
• Universities: registration, grades
• Sales: customers, products, purchases
• Online retailers: order tracking, customized
recommendations
• Manufacturing: production, inventory, orders, supply
chain
• Human resources: employee records, salaries, tax
deductions
DATA SCIENCE APPLICATION EXAMPLES
Types of Data We Have
• Relational Data (Tables/Transaction/Legacy
Data)
• Text Data (Web)
• Semi-structured Data (XML)
• Graph Data
• Social Network, Semantic Web (RDF), …
• Streaming Data
• You can afford to scan the data once
Roles
What To Do With These Data?
• Aggregation and Statistics
• Data warehousing and OLAP
• Indexing, Searching, and Querying
• Keyword based search
• Pattern matching (XML/RDF)
• Knowledge discovery
• Data Mining
• Statistical Modeling
What are we going to do with data?

• Descriptive analysis and visualization

• Supervised learning(in particular, regression


and classification)

• Unsupervised learning(in particular, clustering


and dimensionality reduction)
Data Acquisition
• Exploring and defining the methods of
obtaining data

• What data is needed to achieve the goal?


• How much data is needed?
• Where and how can this data be found?
• What legal and privacy concerns should be considered?
Data sources

Web scraping
Secondary data

GitHub
Kaggle
KDnuggets
UCI Machine Learning Repository
US Government’s Open Data
Five Thirty Eight
Amazon Web Services
BuzzFeed
Data is Plural
Harvard HCI
Application Programming Interface (API).
HTTP request/response cycle
Types of Secondary data
• Administrative and Monitoring Data

• Geo Spatial Data


• traditional satellites, micro- and nano-satellites and unaccompanied aerial
vehicles (UAVs, e.g. drones).
• Remote Sensing
• sensors, and through the Internet of Things (IoT).
• Telecom Data
• call detail records, social media data
• Crowd-sourced Data
• mobile apps.
Cleaned vs. raw data
• Already cleaned, filtered, and ready to use
Data Cleaning
• Ensuring Valid Analysis
• outliers, missing values, typos, erroneous survey codes, illogical
values, duplicates, etc.
• Making the Dataset Usable and
Understandable
• code and document the dataset to make it as self-explanatory
Data issues
• ID Variables
• uniquely and fully identifiable

• Illogical Values

• Typos

• Survey Codes and Missing Values


• like "Do not know" or "Decline to answer"

You might also like