Ccps521 Win2023 Week01 Intro

Uploaded by

Charles Kingston

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

68 views44 pages

Ccps521 Win2023 Week01 Intro

Uploaded by

Charles Kingston

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

Introduction to Data

Science (CCPS521)

Session 1:
An Introduction
What is in this Session
Todays Agenda:
- Ice Breaking: Introduce yourself, what type of job you usually do, how
familiar are you with IT Administration: OS, RDBMS, Network, others,
are you/your Organization using data science and D.S. tools, and what
you expect from this course?
- Who Am I? https://ca.linkedin.com/in/riyad-husein-7b53b77
- Skills set Survey
- Course Outlines
- Session 01: Introduction to Data Science
- Lab 01: Entity Relationship Diagram (ERD)
What do you know about
• R and RStudio:
• MySQL and/or RDBMS Engines (Mention Name):
• Python:
• Big Data technologies like Hadoop:
• HDFS
• MapReduce
• YARN
• Spark
• Hive
• Pig
• Other Hadoop related tools: Sqoop, Hbase, etc.
• Linear Regression & Logistic Regression
• Classification & Clustering
Course Outlines
Week 01
❖ Introduction to data science and Data management
❖ Data Modeling and ER model
❖ Relational Databases, SQL
Week 02
❖ Big data, Computational Frameworks
❖ Data Science tools and platforms and Introduction to Python
Course Outlines
Weeks 03-05
❖ Statistical inference
❖ Data Visualization
❖ Data Analytics
❖ Regression Analysis
Week 06
❖ Clustering and Classification
❖ Classification Models
Course Outlines
Weeks 07-08
❖ Neural Networks
❖ Support Vector Machine

Weeks 09-10
❖ Finding text Similarities
❖ Processing text Data
Course Outlines
Weeks 10-11
❖ Network Models
❖ Recommendation Systems
❖ Web Mining and Social Networks
❖ Community Detection

Weeks 12-13
❖Review and Final Exam
Data Science
https://www.youtube.com/playlist?list=PLMrJAkhIeNNQV7wi9r7Kut8liLFMWQOXn
• Data science, also known as data-driven science, is an interdisciplinary field about scientific methods, processes, and systems to
extract knowledge or insights from data in various forms, either structured or unstructured,[1][2] similar to data mining.
• Data science is a "concept to unify statistics, data analysis and their related methods" in order to "understand and analyze actual
phenomena" with data.[3] It employs techniques and theories drawn from many fields within the broad areas
of mathematics, statistics, information science, and computer science, in particular from the subdomains of machine learning,
classification, cluster analysis, data mining, databases, and visualization.
• Turing award winner Jim Gray imagined data science as a "fourth paradigm" of science (empirical, theoretical, computational and
now data-driven) and asserted that "everything about science is changing because of the impact of information technology" and
the data deluge.[4][5]
• When Harvard Business Review called it "The Sexiest Job of the 21st Century" [6] the term became a buzzword, and is now often
applied to business analytics,[7] or even arbitrary use of data, or used as a sexed-up term for statistics.[8] While many university
programs now offer a data science degree, there exists no consensus on a definition or curriculum contents. [7] Because of the
current popularity of this term, there are many "advocacy efforts" surrounding it.
Source Wikipedia: https://en.wikipedia.org/wiki/Data_science
Other sources:
Springer : https://link.springer.com/content/pdf/10.1007/s42979-021-00765-8.pdf?pdf=button%20sticky
Scholar: https://scholar.smu.edu/cgi/viewcontent.cgi?article=1114&context=datasciencereview
YouTube: https://www.youtube.com/watch?v=pzo13OPXZS4&list=PLMrJAkhIeNNQV7wi9r7Kut8liLFMWQOXn
Data Science
Data science is typically a “concept to unify statistics, data analysis, and
their related methods” to understand and analyze the actual
phenomena with data. According to Cao et al. [17] “data science is the
science of data” or “data science is the study of data”, where a data
product is a data deliverable, or data-enabled or guided, which can be a
discovery, prediction, service, suggestion, insight into decision-making,
thought, model, paradigm, tool, or system.
Source: https://link.springer.com/content/pdf/10.1007/s42979-021-
00765-8.pdf?pdf=button%20sticky
Cao et al. : https://dl.acm.org/doi/pdf/10.1145/3076253
What is Data Mining
What is Data Mining? https://www.youtube.com/watch?v=EH3bp5335IU
Data mining is the computing process of discovering patterns in large data
sets involving methods at the intersection of machine learning, statistics,
and database systems.[1] It is an interdisciplinary subfield of computer
science.[1][2][3] The overall goal of the data mining process is to extract
information from a data set and transform it into an understandable
structure for further use.[1] Aside from the raw analysis step, it involves
database and data management aspects, data pre-
processing, model and inference considerations, interestingness
metrics, complexity considerations, post-processing of discovered
structures, visualization, and online updating.[1] Data mining is the analysis
step of the "knowledge discovery in databases" process, or KDD.
Source: https://en.wikipedia.org/wiki/Data_mining
What is Data Mining
Data mining, also known as knowledge discovery in data (KDD), is the
process of uncovering patterns and other valuable information from
large data sets. Given the evolution of data warehousing technology
and the growth of big data, adoption of data mining techniques has
rapidly accelerated over the last couple of decades, assisting companies
by transforming their raw data into useful knowledge. However, despite
the fact that that technology continuously evolves to handle data at a
large-scale, leaders still face challenges with scalability and automation.

Source: https://www.ibm.com/topics/data-mining
What is Data Mining
Data mining is the process of discovering useful patterns and trends in large data sets.
https://learning.oreilly.com/library/view/discovering-knowledge-in/9781118873571/c01.xhtml#:-
:text=Data%20mining%20is,large%20data%20sets.
Data mining: Is a cluster of non-parametric techniques for automatically extracting useful
information and relationships from immense quantities of data (Larose 2014; Han and Kamber
2011).
Data mining is named because it is data-driven, and thus the data miner is essentially the data
explorer or data discoverer. Due to its exploratory character, data mining is named knowledge
discovery in databases (KDD).
Further, because data mining utilizes machine learning algorithms, many people view it as a process
of data modeling by machine learning (Leskovec et al, 2020).
Chong Ho Alex Yu. (2022). Data Mining and Exploration : From
Traditional Statistics to Modern Data Science. CRC Press. (Page 26)
Data Science
Data science: is the umbrella term that first appeared in 2013, encompassing data
mining, data analytics, and big data analytics. A common approach to defining data
science is to view it as a fusion of statistics/data analysis, computer science, and
domain (contextual) knowledge.
Data mining: The term emerged around 1990. It is a specific application of data
science.
Data analytics: Is a modern term of data analysis. It refers to research and
evaluation utilizing modern data science methods. Data analysis refers to analytical
activities based upon traditional statistical methods.
Big data analytics: The term coined between 2010 and 2011. Big data analytics, as
the name implies, is specific to analytics with big data. Nonetheless, data science
includes all levels of analytics; big, medium, and/or small datasets.
Resources (Need active login to TMU libraries):
Chong Ho Alex Yu. (2022). Data Mining and Exploration : From Traditional Statistics to Modern
Data Science. CRC Press. (Page 22)
Machine Learning
Machine learning (ML) is a field of inquiry devoted to understanding
and building methods that 'learn', that is, methods that leverage data
to improve performance on some set of tasks.[1] It is seen as a part
of artificial intelligence.
Machine learning algorithms build a model based on sample data,
known as training data, in order to make predictions or decisions
without being explicitly programmed to do so.[2] Machine learning
algorithms are used in a wide variety of applications, such as in
medicine, email filtering, speech recognition, agriculture,
and computer vision, where it is difficult or unfeasible to develop
conventional algorithms to perform the needed tasks.[3][4]
Source: https://en.wikipedia.org/wiki/Machine_learning
Machine Learning
Machine learning: is a branch of artificial intelligence (AI) and
computer science which focuses on the use of data and algorithms to
imitate the way that humans learn, gradually improving its accuracy.
Sources: https://www.ibm.com/topics/machine-learning ,
https://mitsloan.mit.edu/ideas-made-to-matter/machine-learning-explained
Definition: A computer program is said to learn from experience E with
respect to some class of tasks T and performance measure P, if its
performance at tasks in T, as measured by P, improves with
experience E.
Mitchell, T. M. (1997). Machine learning (Vol. 1). McGraw-hill New York.
What is machine learning?
https://www.youtube.com/watch?v=WXHM_i-fgGo ,
https://www.youtube.com/watch?v=ukzFI9rgwfU
Machine Learning
Machine Learning
What is supervised learning?
• Supervised learning is a machine learning approach that’s defined by its use of labeled
datasets. These datasets are designed to train or “supervise” algorithms into classifying
data or predicting outcomes accurately. Using labeled inputs and outputs, the model can
measure its accuracy and learn over time.
• Supervised learning can be separated into two types of problems when data mining:
classification and regression:
• Classification problems use an algorithm to accurately assign test data into specific
categories, such as separating apples from oranges. Or, in the real world, supervised
learning algorithms can be used to classify spam in a separate folder from your inbox.
Linear classifiers, support vector machines, decision trees and random forest are all
common types of classification algorithms.
• Regression is another type of supervised learning method that uses an algorithm to
understand the relationship between dependent and independent variables. Regression
models are helpful for predicting numerical values based on different data points, such as
sales revenue projections for a given business. Some popular regression algorithms are
linear regression, logistic regression and polynomial regression.
Source: https://www.ibm.com/cloud/blog/supervised-vs-unsupervised-learning
Machine Learning
What is unsupervised learning?
• Unsupervised learning uses machine learning algorithms to analyze and cluster unlabeled data
sets. These algorithms discover hidden patterns in data without the need for human intervention
(hence, they are “unsupervised”).
• Unsupervised learning models are used for three main tasks: clustering, association and
dimensionality reduction:
• Clustering is a data mining technique for grouping unlabeled data based on their similarities or
differences. For example, K-means clustering algorithms assign similar data points into groups,
where the K value represents the size of the grouping and granularity. This technique is helpful for
market segmentation, image compression, etc.
• Association is another type of unsupervised learning method that uses different rules to find
relationships between variables in a given dataset. These methods are frequently used for market
basket analysis and recommendation engines, along the lines of “Customers Who Bought This
Item Also Bought” recommendations.
• Dimensionality reduction is a learning technique used when the number of features (or
dimensions) in a given dataset is too high. It reduces the number of data inputs to a manageable
size while also preserving the data integrity. Often, this technique is used in the preprocessing
data stage, such as when autoencoders remove noise from visual data to improve picture quality.
Source: https://www.ibm.com/cloud/blog/supervised-vs-unsupervised-learning
Machine Learning
The main difference between supervised and unsupervised learning:
Labeled data
The main distinction between the two approaches is the use of labeled
datasets. To put it simply, supervised learning uses labeled input and output
data, while an unsupervised learning algorithm does not.
In supervised learning, the algorithm “learns” from the training dataset by
iteratively making predictions on the data and adjusting for the correct
answer. While supervised learning models tend to be more accurate than
unsupervised learning models, they require upfront human intervention to
label the data appropriately. For example, a supervised learning model can
predict how long your commute will be based on the time of day, weather
conditions and so on. But first, you’ll have to train it to know that rainy
weather extends the driving time.
Source: https://www.ibm.com/cloud/blog/supervised-vs-unsupervised-
learning
Machine Learning
The main difference between supervised and unsupervised learning:
Labeled data
The main distinction between the two approaches is the use of labeled
datasets. To put it simply, supervised learning uses labeled input and output
data, while an unsupervised learning algorithm does not.
In supervised learning, the algorithm “learns” from the training dataset by
iteratively making predictions on the data and adjusting for the correct
answer. While supervised learning models tend to be more accurate than
unsupervised learning models, they require upfront human intervention to
label the data appropriately. For example, a supervised learning model can
predict how long your commute will be based on the time of day, weather
conditions and so on. But first, you’ll have to train it to know that rainy
weather extends the driving time.
Source: https://www.ibm.com/cloud/blog/supervised-vs-unsupervised-
learning
Machine Learning
The main difference between supervised and unsupervised learning:
Labeled data
Unsupervised learning models, in contrast, work on their own to
discover the inherent structure of unlabeled data. Note that they still
require some human intervention for validating output variables. For
example, an unsupervised learning model can identify that online
shoppers often purchase groups of products at the same time.
However, a data analyst would need to validate that it makes sense for
a recommendation engine to group baby clothes with an order of
diapers, applesauce and sippy cups.
Source: https://www.ibm.com/cloud/blog/supervised-vs-
unsupervised-learning
Machine Learning Life Cycle

https://www.datacamp.com/blog/machine-learning-lifecycle-explained
Machine Learning Life Cycle
The 6 steps in a standard machine learning life cycle:
1. Planning
2.Data Preparation
3.Model Engineering
4.Model Evaluation
5.Model Deployment
6.Monitoring and Maintenance
Source: https://www.datacamp.com/blog/machine-learning-lifecycle-
explained
Data Science Life Cycle

https://learn.microsoft.com/en-us/azure/architecture/data-science-process/lifecycle
Data Science Life Cycle
Five lifecycle stages
The TDSP lifecycle is composed of five major stages that are
executed iteratively. These stages include:
1.Business understanding
2.Data acquisition and understanding
3.Modeling
4.Deployment
5.Customer acceptance
Source: https://learn.microsoft.com/en-us/azure/architecture/data-
science-process/lifecycle
Artificial Intelligence
Artificial intelligence: Is the science and engineering of making intelligent
machines, especially intelligent computer programs. It is related to the
similar task of using computers to understand human intelligence, but AI
does not have to confine itself to methods that are biologically observable.
https://www.ibm.com/topics/artificial-intelligence
Artificial intelligence (AI) is the ability of a computer or a robot controlled by
a computer to do tasks that are usually done by humans because they
require human intelligence and discernment. Although there are no AIs that
can perform the wide variety of tasks an ordinary human can do, some AIs
can match humans in specific tasks.
https://www.britannica.com/technology/artificial-intelligence
Artificial intelligence (AI) is a set of technologies that enable computers to
perform a variety of advanced functions, including the ability to see,
understand and translate spoken and written language, analyze data, make
recommendations, and more. https://cloud.google.com/learn/what-is-
artificial-intelligence
Deep Learning – An Introduction

Source: https://medium.com/intro-to-artificial-intelligence/deep-learning-series-1-intro-to-deep-learning-abb1780ee20
What is Big Data
A Bit of History on Data (00:30 – 9:51): https://www.youtube.com/watch?v=gq_T7EgQXkI
Big data exceeds the reach of commonly used hardware environments and software tools to capture, manage
and process it within a tolerable elapsed time for its user population.
Merv Adrian Article in Teradata Magazine Q1/2011
Source: http://docshare04.docshare.tips/files/20905/209055375.pdf
• First time Big Data is mentioned was by Nasa in Article 1997:
https://www.nas.nasa.gov/assets/pdf/techreports/1997/nas-97-010.pdf
• First time it mentioned by media was 2005: A Short History Of Big Data: https://datafloq.com/read/big-
data-history/239
• Big data is a term for data sets that are so large or complex that traditional data processing application
software is inadequate to deal with them. Big data challenges include capturing data, data storage, data
analysis, search, sharing, transfer, visualization, querying, updating and information privacy.
• Lately, the term "big data" tends to refer to the use of predictive analytics, user behavior analytics, or certain
other advanced data analytics methods that extract value from data, and seldom to a particular size of data
set. "There is little doubt that the quantities of data now available are indeed large, but that’s not the most
relevant characteristic of this new data ecosystem."[2] Analysis of data sets can find new correlations to "spot
business trends, prevent diseases, combat crime and so on."[3] Scientists, business executives, practitioners
of medicine, advertising and governments alike regularly meet difficulties with large data-sets in areas
including Internet search, fintech, urban informatics, and business informatics. Scientists encounter
limitations in e-Science work, including meteorology, genomics,[4] connectomics, complex physics
simulations, biology and environmental research.[5]
Source: https://en.wikipedia.org/wiki/Big_data
What is Big Data
The trend is for every individual’s data footprint to grow, but perhaps
more significantly, the amount of data generated by machines as a part
of the Internet of Things will be even greater than that generated by
people. Machine logs, RFID readers, sensor networks, vehicle GPS
traces, retail transactions—all of these contribute to the growing
mountain of data. The volume of data being made publicly available
increases every year, too. Organizations no longer have to merely
manage their own data; success in the future will be dictated to a large
extent by their ability to extract value from other organizations’ data.
Source: Hadoop, The Definitive Guide by Tom White, 4th edition (2015), page 4
What is Big Data
2^0 = 1 Byte = 8 bits, 1 bit can have 0 or 1 (On, Off) (Yes, No), etc
Kilobyte = 1024 = 2^10 (10^3)
Megabyte = 1024x1024 = 2^20 (10^6)
Gigabyte = 1024x1024x1024 = 2^30 (10^9)
Terabyte = 1024x1024x1024x1024 = 2^40 (10^12)
Petabyte = 1024x1024x1024x1024x1024 = 2^50 (10^15)
Exabyte = 1024x1024x1024x1024x1024x1024 = 2^60 (10^18)
Yottabyte = 1024x1024x1024x1024x1024x1024x1024 = 2^70 (10^21)
Brontobyte = 1024x1024x1024x1024x1024x1024x1024x1024 = 2^80 (10^24)

Source: https://en.wikipedia.org/wiki/Exabyte
Big Data Systems
The six Vs
• Volume: Big data implies enormous volumes of data. It used to be employees created data. Now that data is generated by machines, networks and human interaction on systems like social media
the volume of data to be analyzed is massive. Yet, Inderpal states that the volume of data is not as much the problem as other V’s like veracity.
• Variety: Variety refers to the many sources and types of data both structured and unstructured. We used to store data from sources like spreadsheets and databases. Now data comes in the form of
emails, photos, videos, monitoring devices, PDFs, audio, etc. This variety of unstructured data creates problems for storage, mining and analyzing data.
• Velocity: Big Data Velocity deals with the pace at which data flows in from sources like business processes, machines, networks and human interaction with things like social media sites, mobile
devices, etc. The flow of data is massive and continuous. This real-time data can help researchers and businesses make valuable decisions that provide strategic competitive advantages and ROI if
you are able to handle the velocity. Inderpal suggest that sampling data can help deal with issues like volume and velocity.
• Veracity: Big Data Veracity refers to the biases, noise and abnormality in data. Is the data that is being stored, and mined meaningful to the problem being analyzed. Inderpal feel veracity in data
analysis is the biggest challenge when compares to things like volume and velocity. In scoping out your big data strategy you need to have your team and partners work to help keep your data clean
and processes to keep ‘dirty data’ from accumulating in your systems.
• Validity: Like big data veracity is the issue of validity meaning is the data correct and accurate for the intended use. Clearly valid data is key to making the right decisions.
• Volatility: Big data volatility refers to how long is data valid and how long should it be stored. In this world of real time data you need to determine at what point is data no longer relevant to the
current analysis.

Resources:
The 10 Vs of Big Data: https://tdwi.org/articles/2017/02/08/10-vs-of-big-data.aspx
https://insidebigdata.com/2013/09/12/beyond-volume-variety-velocity-issue-big-data-veracity/
https://www.ibm.com/analytics/hadoop/big-data-analytics
Big Data Facilitation, Utilization, and Monetization: Exploring the 3Vs in a New Product Development Process by Jeff S. Johnson , Scott B. Friend, and Hannah S. Lee
http://onlinelibrary.wiley.com.ezproxy.lib.ryerson.ca/doi/10.1111/jpim.12397/pdf
What is Data Lake
What is a Data Lake? https://www.youtube.com/watch?v=aC9_fDoMH6M ,
https://www.youtube.com/watch?v=LxcH6z8TFpI
A data lake is a method of storing data within a system or repository, in its
natural format,[1] that facilitates the collocation of data in various schemata
and structural forms, usually object blobs or files. The idea of data lake is to
have a single store of all data in the enterprise ranging from raw data (which
implies exact copy of source system data) to transformed data which is used
for various tasks including reporting, visualization, analytics and machine
learning. The data lake includes structured data from relational databases
(rows and columns), semi-structured data (CSV, logs, XML, JSON),
unstructured data (emails, documents, PDFs) and even binary data (images,
audio, video) thus creating a centralized data store accommodating all forms
of data.
Sources:
https://en.wikipedia.org/wiki/Data_lake
https://docs.microsoft.com/en-us/azure/architecture/data-guide/scenarios/data-lake
Data Scientist
The Data Scientist https://www.youtube.com/watch?v=i2jwZcWicSY
A data scientist (DST) is someone who knows how to extract meaning
from and interpret data, which requires both tools and methods from
statistics and machine learning, as well as being human. DST spends a
lot of time in the process of collecting, cleaning, and munging data,
because data is never clean. This process requires persistence,
statistics, and software engineering skills—skills that are also necessary
for understanding biases in the data, and for debugging logging output
from code.
Source: https://datasciencedegree.wisconsin.edu/data-science/what-
do-data-scientists-do/
Deep Learning
Introduction to Deep Learning: What Is Deep Learning?
https://www.youtube.com/watch?v=3cSjsTKtN9M
Introduction to Deep Learning: Machine Learning vs Deep Learning
https://www.youtube.com/watch?v=-SgkLEuhfbg&vl=en
Deep learning (also known as deep structured learning or hierarchical learning) is part of a broader family of
machine learning methods based on learning data representations, as opposed to task-specific algorithms.
Learning can be supervised, semi-supervised or unsupervised.

Deep learning architectures such as deep neural networks, deep belief networks and recurrent neural
networks have been applied to fields including computer vision, speech recognition, natural language
processing, audio recognition, social network filtering, machine translation, bioinformatics, drug design,
medical image analysis, material inspection and board game programs, where they have produced results
comparable to and in some cases superior to human experts.

Deep learning models are vaguely inspired by information processing and communication patterns in biological
nervous systems yet have various differences from the structural and functional properties of biological brains
(especially human brains), which make them incompatible with neuroscience evidences
Source: https://en.wikipedia.org/wiki/Deep_learning
The Internet of Things (IoT)
The Internet of Things (IoT) is the ultimate change agent, enabling
commerce and industry to connect, measure, and manage products,
information, operations, and the enterprise. Big Data is getting bigger
due to IoT, and this highly distributed, unstructured data is generated
by a wide range of sensors, beacons, applications, websites, social
media, weather data, computers, smartphones and more. Important
question is; how to monetize IoT data? Analytics and management of
Big Data is one of the most promising IoT opportunities for revenue
growth and value creation today. Without the ability to leverage IoT
analytics, organizations and governments will be left with basic sensor
data and unable to truly harness the tangible value of IoT. The course
includes real-world case scenarios where IoT adoption has huge
potential to deliver monetization, efficiency, productivity, profitability,
competitive advantage and economic prosperity.
Internet of Things (IOT)
https://www.youtube.com/watch?v=PXncS2_63o4
Input Devices (Remove)
Big Data Systems
Information Technology (IT)

Hardware Software

Peripherals Essentials Applications Tools/Utilities System

CPU Memory Storage I/O Power Office-ware Editors

Browsers Anti-Virus
Printers Scanners Cameras Projectors Others

Business Education Health Care

Most known OS:

Linux BIOS OS Virtualization Drivers Compilers DB Engines
Mac OS
Windows Hadoop
Some History, Big Data & Hadoop
A Brief History of Computers: https://www.youtube.com/watch?v=iK0PT5q7GlE
Punched Card Machines: http://www.suomentietokonemuseo.fi/vanhat/eng/laite_eng.htm
Computer Punch Cards - Historical Overview - https://www.youtube.com/watch?v=YXE6HjN8heg

Big Data - Tim Smith: https://www.youtube.com/watch?v=j-0cUmUyb-Y

A Bit of History on Data: https://www.youtube.com/watch?v=gq_T7EgQXkI
Doug Cutting: The Origins of Hadoop https://www.youtube.com/watch?v=ebgXN7VaIZA
Doug Cutting: The Name of Hadoop https://www.youtube.com/watch?v=irK7xHUmkUA
What Is Hadoop? https://www.youtube.com/watch?v=OoEpfb6yga8
What is MapReduce? https://www.youtube.com/watch?v=43fqzaSH0CQ
Learn MapReduce with Playing Cards: https://www.youtube.com/watch?v=bcjSe0xCHbE
Map Reduce – Example: https://www.youtube.com/watch?v=iaHCvhwA8p4
Relational Database Management Systems (RDBMS)
• Database defined
• A database is an organized collection of structured information, or data, typically stored
electronically in a computer system. A database is usually controlled by a database
management system (DBMS). Together, the data and the DBMS, along with the
applications that are associated with them, are referred to as a database system, often
shortened to just database.
• Data within the most common types of databases in operation today is typically modeled
in rows and columns in a series of tables to make processing and data querying efficient.
The data can then be easily accessed, managed, modified, updated, controlled, and
organized. Most databases use structured query language (SQL) for writing and querying
data. https://www.oracle.com/ca-en/database/what-is-database/

A relational database is a type of database that stores and provides access to data points
that are related to one another. Relational databases are based on the relational model, an
intuitive, straightforward way of representing data in tables. In a relational database, each
row in the table is a record with a unique ID called the key. The columns of the table hold
attributes of the data, and each record usually has a value for each attribute, making it
easy to establish the relationships among data points. https://www.oracle.com/ca-en/database/what-is-
a-relational-database/
Relational Database Management Systems (RDBMS)
• What is a database management system (DBMS)?
• A database typically requires a comprehensive database software
program known as a database management system (DBMS). A DBMS
serves as an interface between the database and its end users or
programs, allowing users to retrieve, update, and manage how the
information is organized and optimized. A DBMS also facilitates
oversight and control of databases, enabling a variety of
administrative operations such as performance monitoring, tuning,
and backup and recovery.
• Some examples of popular database software or DBMSs include
MySQL, Microsoft Access, Microsoft SQL Server, FileMaker Pro, Oracle
Database, and dBASE.
• https://www.oracle.com/ca-en/database/what-is-database/
Relational Database Management Systems (RDBMS)
• What is Structured Query Language (SQL)?
• SQL is a programming language used by nearly all relational
databases to query, manipulate, and define data, and to provide
access control. SQL was first developed at IBM in the 1970s with
Oracle as a major contributor, which led to implementation of the SQL
ANSI standard, SQL has spurred many extensions from companies
such as IBM, Oracle, and Microsoft. Although SQL is still widely used
today, new programming languages are beginning to appear
https://www.oracle.com/ca-en/database/what-is-database/
Relational Database Management Systems (RDBMS)
• What is a MySQL database?
• MySQL is an open source relational database management system based on SQL. It was
designed and optimized for web applications and can run on any platform. As new and different
requirements emerged with the internet, MySQL became the platform of choice for web
developers and web-based applications. Because it’s designed to process millions of queries and
thousands of transactions, MySQL is a popular choice for ecommerce businesses that need to
manage multiple money transfers. On-demand flexibility is the primary feature of MySQL.
• MySQL is the DBMS behind some of the top websites and web-based applications in the world,
including Airbnb, Uber, LinkedIn, Facebook, Twitter, and YouTube.
https://www.oracle.com/ca-en/database/what-is-database/

Download MySQL Community Edition: https://dev.mysql.com/downloads/file/?id=514518

Direct Download Page:
https://dev.mysql.com/get/Downloads/MySQLInstaller/mysql-installer-community-8.0.31.0.msi
Relational Database Management Systems (RDBMS)
• MySQL Workbench is a unified visual tool for database architects,
developers, and DBAs. MySQL Workbench provides data modeling, SQL
development, and comprehensive administration tools for server
configuration, user administration, backup, and much more. MySQL
Workbench is available on Windows, Linux and Mac OS X.
https://www.mysql.com/products/workbench/
MySQL Workbench Download: https://dev.mysql.com/downloads/file/?id=514051
Direct download pages:
https://dev.mysql.com/get/Downloads/MySQLGUITools/mysql-workbench-community-8.0.31-winx64.msi
Workbench Tutorial:
https://www.youtube.com/watch?v=chezeWdTHbo
https://www.youtube.com/watch?v=X_umYKqKaF0

Data Mining Concepts and Techniques - Han, Kamber & Pei
No ratings yet
Data Mining Concepts and Techniques - Han, Kamber & Pei
953 pages
Data science basics
No ratings yet
Data science basics
5 pages
Data Science Lecture 1 Introduction
No ratings yet
Data Science Lecture 1 Introduction
27 pages
Unit-1 Data Science
No ratings yet
Unit-1 Data Science
74 pages
data science
No ratings yet
data science
9 pages
01. Introduction
No ratings yet
01. Introduction
20 pages
IDS Complete Notes
No ratings yet
IDS Complete Notes
126 pages
1.1
No ratings yet
1.1
23 pages
CH 1
No ratings yet
CH 1
34 pages
Data Science - Wikipedia
No ratings yet
Data Science - Wikipedia
7 pages
Data Science - AD1102-1
No ratings yet
Data Science - AD1102-1
53 pages
Introduction Am
No ratings yet
Introduction Am
74 pages
Chapter 1 (5)
No ratings yet
Chapter 1 (5)
37 pages
Data Science
No ratings yet
Data Science
3 pages
Data Science A Beginner S Guide 1668243666
100% (1)
Data Science A Beginner S Guide 1668243666
26 pages
Introduction of Data Science
No ratings yet
Introduction of Data Science
15 pages
Data Science Intro
No ratings yet
Data Science Intro
6 pages
Kamlesh Mooc File
No ratings yet
Kamlesh Mooc File
15 pages
FDS - Lecture Notes - III AIML, CSM
No ratings yet
FDS - Lecture Notes - III AIML, CSM
101 pages
Fds Module 1
No ratings yet
Fds Module 1
65 pages
PSAI Unit 1
No ratings yet
PSAI Unit 1
70 pages
Module 1 - What Is Data Science
No ratings yet
Module 1 - What Is Data Science
17 pages
01 Introduction
No ratings yet
01 Introduction
37 pages
Data Science
No ratings yet
Data Science
7 pages
Data Science Fundamentals - Class1
100% (1)
Data Science Fundamentals - Class1
51 pages
02 Introduction_Fall 23-24
No ratings yet
02 Introduction_Fall 23-24
29 pages
23STUCHH010864
No ratings yet
23STUCHH010864
24 pages
Data Science
No ratings yet
Data Science
5 pages
M1.1 DS
No ratings yet
M1.1 DS
57 pages
Module 1
No ratings yet
Module 1
192 pages
1.1 Idml
No ratings yet
1.1 Idml
3 pages
Data Science 1
100% (3)
Data Science 1
133 pages
Concepts and Techniques: - Chapter 1
No ratings yet
Concepts and Techniques: - Chapter 1
48 pages
DSF 1-2
No ratings yet
DSF 1-2
28 pages
Chapter 1
No ratings yet
Chapter 1
47 pages
Week1-1
No ratings yet
Week1-1
18 pages
Chirag Modi Data Science Report
No ratings yet
Chirag Modi Data Science Report
29 pages
0 Introduction
No ratings yet
0 Introduction
43 pages
01Intro
No ratings yet
01Intro
52 pages
Data Science
No ratings yet
Data Science
3 pages
Datascience
75% (8)
Datascience
28 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
20 pages
himadev
No ratings yet
himadev
37 pages
Carmichael MArron 2018 OJO
No ratings yet
Carmichael MArron 2018 OJO
22 pages
Lecture 1 Introduction to Data Science and Analytics
No ratings yet
Lecture 1 Introduction to Data Science and Analytics
37 pages
Chapter 1 - Lecture
No ratings yet
Chapter 1 - Lecture
7 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
16 pages
M-1-FDS-NOTES-PPT (2) (1)
No ratings yet
M-1-FDS-NOTES-PPT (2) (1)
19 pages
Data Science Sample
No ratings yet
Data Science Sample
2 pages
Getting Started With Data Science: Grade VIII
No ratings yet
Getting Started With Data Science: Grade VIII
32 pages
1 - 1 Intro To Data Mining - ch1
No ratings yet
1 - 1 Intro To Data Mining - ch1
18 pages
Learn Data Science Fundamentals (2025)
100% (1)
Learn Data Science Fundamentals (2025)
201 pages
Data Science vs. Statistics: Two Cultures?
No ratings yet
Data Science vs. Statistics: Two Cultures?
22 pages
Chapter 1 (6)
No ratings yet
Chapter 1 (6)
62 pages
1) Data-sci Chapter-1
No ratings yet
1) Data-sci Chapter-1
17 pages
Chapter 1 - Tagged
No ratings yet
Chapter 1 - Tagged
46 pages

Ccps521 Win2023 Week01 Intro

Uploaded by

Ccps521 Win2023 Week01 Intro

Uploaded by

Introduction to Data

Peripherals Essentials Applications Tools/Utilities System

CPU Memory Storage I/O Power Office-ware Editors

Business Education Health Care

Most known OS:

Big Data - Tim Smith: https://www.youtube.com/watch?v=j-0cUmUyb-Y

Download MySQL Community Edition: https://dev.mysql.com/downloads/file/?id=514518

You might also like