0% found this document useful (0 votes)

7 views

UNIT 1

This document provides an overview of the foundations of data science, including its benefits, processes, and types of data. It outlines the data science process, which consists of defining research goals, retrieving and preparing data, exploring data, building models, and presenting findings. Additionally, it discusses various data types such as structured, unstructured, and machine-generated data, along with the importance of data cleansing and integration.

Uploaded by

P SANTHIYA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

UNIT 1

Uploaded by

P SANTHIYA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

lOMoARcPSD|15656136

CS3352 Unit I

Foundations of Data Science (Anna University)

Studocu is not sponsored or endorsed by any college or university

Downloaded by P. SANTHIYA - CSE ([email protected])
lOMoARcPSD|15656136

III SEM CSE

UNIT – I - UNIT I - INTRODUCTION

FOUNDATION OF DATA SCIENCE
INTRODUCTION
Data Science: Benefits and uses – facets of data - Data Science Process: Overview – Defining research goals –
Retrieving data – Data preparation - Exploratory Data analysis – Build the model– presenting findings and
building applications - Data Mining - Data Warehousing – Basic Statistical descriptions of Data

Data
In computing, data is information that has been translated into a form that is efficient for movement or
processing

Data Science
Data science is an evolutionary extension of statistics capable of dealing with the massive amounts of data
produced today. It adds methods from computer science to the repertoire of statistics.

Benefits and uses of data science

Data science and big data are used almost everywhere in both commercial and noncommercial Settings
¥ Commercial companies in almost every industry use data science and big data to gain insights into their
customers, processes, staff, completion, and products.
¥ Many companies use data science to offer customers a better user experience, as well as to cross-sell,
up-sell, and personalize their offerings.
¥ Governmental organizations are also aware of data’s value. Many governmental organizations not only
rely on internal data scientists to discover valuable information, but also share their data with the public.
¥ Nongovernmental organizations (NGOs) use it to raise money and defend their causes.
¥ Universities use data science in their research but also to enhance the study experience of their students.
The rise of massive open online courses (MOOC) produces a lot of data, which allows universities to
study how this type of learning can complement traditional classes.

Facets of data
In data science and big data you’ll come across many different types of data, and each of them tends to require
different tools and techniques. The main categories of data are these:
¥ Structured
¥ Unstructured
¥ Natural language
¥ Machine-generated
¥ Graph-based
¥ Audio, video, and images
¥ Streaming
Let’s explore all these interesting data types.

Structured data
¥ Structured data is data that depends on a data model and resides in a fixed field within a record. As such,
it’s often easy to store structured data in tables within databases or Excel files
¥ SQL, or Structured Query Language, is the preferred way to manage and query data that resides in
databases.

Downloaded by P. SANTHIYA - CSE ([email protected])

lOMoARcPSD|15656136

III SEM CSE

Unstructured data
Unstructured data is data that isn’t easy to fit into a data model because the content is context-specific or
varying. One example of unstructured data is your regular email

Natural language
¥ Natural language is a special type of unstructured data; it’s challenging to process because it requires
knowledge of specific data science techniques and linguistics.

Downloaded by P. SANTHIYA - CSE ([email protected])

lOMoARcPSD|15656136

III SEM CSE

¥ The natural language processing community has had success in entity recognition, topic recognition,
summarization, text completion, and sentiment analysis, but models trained in one domain don’t
generalize well to other domains.
¥ Even state-of-the-art techniques aren’t able to decipher the meaning of every piece of text.

Machine-generated data
¥ Machine-generated data is information that’s automatically created by a computer, process,
application, or other machine without human intervention.
¥ Machine-generated data is becoming a major data resource and will continue to do so.
¥ The analysis of machine data relies on highly scalable tools, due to its high volume and speed.
Examples of machine data are web server logs, call detail records, network event logs, and telemetry.

Graph-based or network data

¥ “Graph data” can be a confusing term because any data can be shown in a graph.
¥ Graph or network data is, in short, data that focuses on the relationship or adjacency of objects.
¥ The graph structures use nodes, edges, and properties to represent and store graphical data.
¥ Graph-based data is a natural way to represent social networks, and its structure allows you to
calculate specific metrics such as the influence of a person and the shortest path between two people.

Audio, image, and video

¥ Audio, image, and video are data types that pose specific challenges to a data scientist.
¥ Tasks that are trivial for humans, such as recognizing objects in pictures, turn out to be challenging for
computers.
¥ MLBAM (Major League Baseball Advanced Media) announced in 2014 that they’ll increase video
capture to approximately 7 TB per game for the purpose of live, in-game analytics.
¥ Recently a company called DeepMind succeeded at creating an algorithm that’s capable of learning
how to play video games.
¥ This algorithm takes the video screen as input and learns to interpret everything via a complex
of deep learning.
3

Downloaded by P. SANTHIYA - CSE ([email protected])

lOMoARcPSD|15656136

III SEM CSE

Streaming data
¥ The data flows into the system when an event happens instead of being loaded into a data store in a
batch.
¥ Examples are the “What’s trending” on Twitter, live sporting or music events, and the stock market.

Data Science Process

Overview of the data science process
The typical data science process consists of six steps through which you’ll iterate, as shown in figure

1. The first step of this process is setting a research goal. The main purpose here is making sure all the
stakeholders understand the what, how, and why of the project. In every serious project this will result
in a project charter.
2. The second phase is data retrieval. You want to have data available for analysis, so this step includes
finding suitable data and getting access to the data from the data owner. The result is data in its raw
form, which probably needs polishing and transformation before it becomes usable.
3. Now that you have the raw data, it’s time to prepare it. This includes transforming the data from a raw
form into data that’s directly usable in your models. To achieve this, you’ll detect and correct different
kinds of errors in the data, combine data from different data sources, and transform it. If you have
successfully completed this step, you can progress to data visualization and modeling.

Downloaded by P. SANTHIYA - CSE ([email protected])

lOMoARcPSD|15656136

III SEM CSE

4. The fourth step is data exploration. The goal of this step is to gain a deep understanding of the data.
You’ll look for patterns, correlations, and deviations based on visual and descriptive techniques. The
insights you gain from this phase will enable you to start modeling.
5. Finally, we get to model building (often referred to as “data modeling” throughout this book). It is now
that you attempt to gain the insights or make the predictions stated in your project charter. Now is the
time to bring out the heavy guns, but remember research has taught us that often (but not always) a
combination of simple models tends to outperform one complicated model. If you’ve done this phase
right, you’re almost done.
6. The last step of the data science model is presenting your results and automating the analysis, if needed.
One goal of a project is to change a process and/or make better decisions. You may still need to convince
the business that your findings will indeed change the business process as expected. Thisis where you
can shine in your influencer role. The importance of this step is more apparent in projectson a strategic
and tactical level. Certain projects require you to perform the business process over and over again, so
automating the project will save time.

Defining research goals

A project starts by understanding the what, the why, and the how of your project. The outcome should be a clear
research goal, a good understanding of the context, well-defined deliverables, and a plan of action with a
timetable. This information is then best placed in a project charter.

Spend time understanding the goals and context of your research

¥ An essential outcome is the research goal that states the purpose of your assignment in a clear and
focused manner.
¥ Understanding the business goals and context is critical for project success.
¥ Continue asking questions and devising examples until you grasp the exact business expectations,
identify how your project fits in the bigger picture, appreciate how your research is going to change the
business, and understand how they’ll use your results

Create a project charter

A project charter requires teamwork, and your input covers at least the following:
¥ A clear research goal
¥ The project mission and context
¥ How you’re going to perform your analysis
¥ What resources you expect to use
¥ Proof that it’s an achievable project, or proof of concepts
¥ Deliverables and a measure of success
¥ A timeline

Retrieving data
¥ The next step in data science is to retrieve the required data. Sometimes you need to go into the field
and design a data collection process yourself, but most of the time you won’t be involved in this step.
¥ Many companies will have already collected and stored the data for you, and what they don’t have can
often be bought from third parties.
¥ More and more organizations are making even high-quality data freely available for public and
commercial use.
¥ Data can be stored in many forms, ranging from simple text files to tables in a database. The objective
now is acquiring all the data you need.

Start with data stored within the company (Internal data)

Downloaded by P. SANTHIYA - CSE ([email protected])

lOMoARcPSD|15656136

III SEM CSE

¥ Most companies have a program for maintaining key data, so much of the cleaning work may already
be done. This data can be stored in official data repositories such as databases, data marts, data
warehouses, and data lakes maintained by a team of IT professionals.
¥ Data warehouses and data marts are home to preprocessed data, data lakes contain data in its natural or
raw format.
¥ Finding data even within your own company can sometimes be a challenge. As companies grow, their
data becomes scattered around many places. the data may be dispersed as people change positions and
leave the company.
¥ Getting access to data is another difficult task. Organizations understand the value and sensitivity of
data and often have policies in place so everyone has access to what they need and nothing more.
¥ These policies translate into physical and digital barriers called Chinese walls. These “walls” are
mandatory and well-regulated for customer data in most countries.

External Data
¥ If data isn’t available inside your organization, look outside your organizations. Companies provide
data so that you, in turn, can enrich their services and ecosystem. Such is the case with Twitter,
LinkedIn, and Facebook.
¥ More and more governments and organizations share their data for free with the world.
¥ A list of open data providers that should get you started.

Data Preparation (Cleansing, Integrating, Transforming Data)

Your model needs the data in a specific format, so data transformation will always come into play. It’s a good
habit to correct data errors as early on in the process as possible. However, this isn’t always possible in a realistic
setting, so you’ll need to take corrective actions in your program.

Cleansing data
Data cleansing is a sub process of the data science process that focuses on removing errors in your data so
your data becomes a true and consistent representation of the processes it originates from.
¥ The first type is the interpretation error, such as when you take the value in your data for granted, like
saying that a person’s age is greater than 300 years.
¥ The second type of error points to inconsistencies between data sources or against your company’s
standardized values.
An example of this class of errors is putting “Female” in one table and “F” in another when they represent
the same thing: that the person is female.

Overview of common errors

Downloaded by P. SANTHIYA - CSE ([email protected])

lOMoARcPSD|15656136

III SEM CSE

Sometimes you’ll use more advanced methods, such as simple modeling, to find and identify data errors;
diagnostic plots can be especially insightful. For example, in figure we use a measure to identify data points
that seem out of place. We do a regression to get acquainted with the data and detect the influence of individual
observations on the regression line.

Data Entry Errors

¥ Data collection and data entry are error-prone processes. They often require human intervention, and
introduce an error into the chain.

Downloaded by P. SANTHIYA - CSE ([email protected])

lOMoARcPSD|15656136

III SEM CSE

¥ Data collected by machines or computers isn’t free from errors. Errors can arise from human sloppiness,
whereas others are due to machine or hardware failure.
¥ Detecting data errors when the variables you study don’t have many classes can be done by tabulating
the data with counts.
¥ When you have a variable that can take only two values: “Good” and “Bad”, you can create a frequency
table and see if those are truly the only two values present. In table the values “Godo” and “Bade” point
out something went wrong in at least 16 cases.

Most errors of this type are easy to fix with simple assignment statements and if-thenelse
rules:
if x == “Godo”:
x = “Good”
if x == “Bade”:
x = “Bad”

Redundant Whitespace
¥ Whitespaces tend to be hard to detect but cause errors like other redundant characters would.
¥ The whitespace cause the miss match in the string such as “FR ” – “FR”, dropping the observations that
couldn’t be matched.
¥ If you know to watch out for them, fixing redundant whitespaces is luckily easy enough in most
programming languages. They all provide string functions that will remove the leading and trailing
whitespaces. For instance, in Python you can use the strip() function to remove leading and trailing
spaces.

Fixing Capital Letter Mismatches

Capital letter mismatches are common. Most programming languages make a distinction between “Brazil”
and “brazil”.
In this case you can solve the problem by applying a function that returns both strings in lowercase, such as
.lower() in Python. “Brazil”.lower() == “brazil”.lower() should result in true.

Impossible Values and Sanity Checks

Here you check the value against physically or theoretically impossible values such as people taller than 3
meters or someone with an age of 299 years. Sanity checks can be directly expressed with rules:
check = 0 <= age <= 120

Outliers
An outlier is an observation that seems to be distant from other observations or, more specifically, one
observation that follows a different logic or generative process than the other observations. The easiest way to
find outliers is to use a plot or a table with the minimum and maximum values.
The plot on the top shows no outliers, whereas the plot on the bottom shows possible outliers on the upper side
when a normal distribution is expected.

Downloaded by P. SANTHIYA - CSE ([email protected])

lOMoARcPSD|15656136

III SEM CSE

Dealing with Missing Values

Missing values aren’t necessarily wrong, but you still need to handle them separately; certain modeling
techniques can’t handle missing values. They might be an indicator that something went wrong in your data
collection or that an error happened in the ETL process. Common techniques data scientists use are listed in
table

Integrating data
Your data comes from several different places, and in this substep we focus on integrating these different
sources. Data varies in size, type, and structure, ranging from databases and Excel files to text documents.

The Different Ways of Combining Data

You can perform two operations to combine information from different data sets.
¥ Joining
¥ Appending or stacking

Joining Tables
¥ Joining tables allows you to combine the information of one observation found in one table with the
information that you find in another table. The focus is on enriching a single observation.
¥ Let’s say that the first table contains information about the purchases of a customer and the other table
contains information about the region where your customer lives.
¥ Joining the tables allows you to combine the information so that you can use it for your model, as
shown in figure.

Figure. Joining two tables on the item and region key

Downloaded by P. SANTHIYA - CSE ([email protected])

lOMoARcPSD|15656136

III SEM CSE

To join tables, you use variables that represent the same object in both tables, such as a date, a country name,
or a Social Security number. These common fields are known as keys. When these keys also uniquely define
the records in the table they are called primary keys.
The number of resulting rows in the output table depends on the exact join type that you use

Appending Tables
¥ Appending or stacking tables is effectively adding observations from one table to another table.
¥ One table contains the observations from the month January and the second table contains observations
from the month February. The result of appending these tables is a larger one with the observations from
January as well as February.

Figure. Appending data from tables is a common operation but requires an equal structure in the tables begin
appended,

Transforming data

Certain models require their data to be in a certain shape. Transforming your data so it takes a suitable form for
data modeling.

Relationships between an input variable and an output variable aren’t always linear. Take, for instance, a
relationship of the form y = aebx. Taking the log of the independent variables simplifies the estimation problem
dramatically. Transforming the input variables greatly simplifies the estimation problem. Other timesyou might
want to combine two variables into a new variable.

Downloaded by P. SANTHIYA - CSE ([email protected])

lOMoARcPSD|15656136

III SEM CSE

Reducing the Number of Variables

¥ Having too many variables in your model makes the model difficult to handle, and certain techniques
don’t perform well when you overload them with too many input variables. For instance, all the
techniques based on a Euclidean distance perform well only up to 10 variables.
¥ Data scientists use special methods to reduce the number of variables but retain the maximum amount
of data.

Downloaded by P. SANTHIYA - CSE ([email protected])

lOMoARcPSD|15656136

III SEM CSE

Figure shows how reducing the number of variables makes it easier to understand the key values. It also shows
how two variables account for 50.6% of the variation within the data set (component1 = 27.8% + component2
= 22.8%). These variables, called “component1” and “component2,” are both combinations of the original
variables. They’re the principal components of the underlying data structure

Turning Variables into Dummies

¥ Dummy variables can only take two values: true(1) or false(0). They’re used to indicate the absence of
a categorical effect that may explain the observation.
¥ In this case you’ll make separate columns for the classes stored in one variable and indicate it with 1 if
the class is present and 0 otherwise.
¥ An example is turning one column named Weekdays into the columns Monday through Sunday. You
use an indicator to show if the observation was on a Monday; you put 1 on Monday and 0 elsewhere.
¥ Turning variables into dummies is a technique that’s used in modeling and is popular with, but not
exclusive to, economists.

Figure. Turning variables into dummies is a data transformation that breaks a variable that has multiple
classes into multiple variables, each having only two possible values: 0 or 1

Exploratory data analysis

During exploratory data analysis you take a deep dive into the data (see figure below). Information becomes
much easier to grasp when shown in a picture, therefore you mainly use graphical techniques to gain an
understanding of your data and the interactions between variables.
The goal isn’t to cleanse the data, but it’s common that you’ll still discover anomalies you missed before,
forcing you to take a step back and fix them.

Downloaded by P. SANTHIYA - CSE ([email protected])

lOMoARcPSD|15656136

III SEM CSE

Downloaded by P. SANTHIYA - CSE ([email protected])

lOMoARcPSD|15656136

III SEM CSE

¥ The visualization techniques you use in this phase range from simple line graphs or histograms, as
shown in below figure , to more complex diagrams such as Sankey and network graphs.
¥ Sometimes it’s useful to compose a composite graph from simple graphs to get even more insight
into the data Other times the graphs can be animated or made interactive to make it easier and, let’s
admit it, way more fun

The techniques we described in this phase are mainly visual, but in practice they’re certainly not limited to
visualization techniques. Tabulation, clustering, and other modeling techniques can also be a part of exploratory
analysis. Even building simple models can be a part of this step.

Build the models

¥ With clean data in place and a good understanding of the content, you’re ready to build models with the
goal of making better predictions, classifying objects, or gaining an understanding of the system that
you’re modeling.
¥ This phase is much more focused than the exploratory analysis step, because you know what you’re
looking for and what you want the outcome to be.

Building a model is an iterative process. The way you build your model depends on whether you go with classic
statistics or the somewhat more recent machine learning school, and the type of technique you want to use.
Either way, most models consist of the following main steps:
¥ Selection of a modeling technique and variables to enter in the model
¥ Execution of the model
¥ Diagnosis and model comparison

Model and variable selection

You’ll need to select the variables you want to include in your model and a modeling technique. You’ll need
to consider model performance and whether your project meets all the requirements to use your model, as well
as other factors:
¥ Must the model be moved to a production environment and, if so, would it be easy to implement?
¥ How difficult is the maintenance on the model: how long will it remain relevant if left untouched?
¥ Does the model need to be easy to explain?

Model execution
¥ Once you’ve chosen a model you’ll need to implement it in code.

Downloaded by P. SANTHIYA - CSE ([email protected])

lOMoARcPSD|15656136

III SEM CSE

¥ Most programming languages, such as Python, already have libraries such as StatsModels or Scikit-
learn. These packages use several of the most popular techniques.
¥ Coding a model is a nontrivial task in most cases, so having these libraries available can speed up the
process. As you can see in the following code, it’s fairly easy to use linear regression with StatsModels
or Scikit-learn
¥ Doing this yourself would require much more effort even for the simple techniques. The following
listing shows the execution of a linear prediction model.

Model diagnostics and model comparison

¥ You’ll be building multiple models from which you then choose the best one based on multiple
criteria. Working with a holdout sample helps you pick the best-performing model.
¥ A holdout sample is a part of the data you leave out of the model building so it can be used to evaluate
the model afterward.
¥ The principle here is simple: the model should work on unseen data. You use only a fraction of your
data to estimate the model and the other part, the holdout sample, is kept out of the equation.
¥ The model is then unleashed on the unseen data and error measures are calculated to evaluate it.
¥ Multiple error measures are available, and in figure we show the general idea on comparing models.
The error measure used in the example is the mean square error.

Formula for mean square error.

Mean square error is a simple measure: check for every prediction how far it was from the truth, square this
error, and add up the error of every prediction.

Downloaded by P. SANTHIYA - CSE ([email protected])

lOMoARcPSD|15656136

III SEM CSE

Above figure compares the performance of two models to predict the order size from the price. The first
model is size = 3 * price and the second model is size = 10.
¥ To estimate the models, we use 800 randomly chosen observations out of 1,000 (or 80%), without
showing the other 20% of data to the model.
¥ Once the model is trained, we predict the values for the other 20% of the variables based on those for
which we already know the true value, and calculate the model error with an error measure.
¥ Then we choose the model with the lowest error. In this example we chose model 1 because it has the
lowest total error.

Many models make strong assumptions, such as independence of the inputs, and you have to verify that these
assumptions are indeed met. This is called model diagnostics.

Presenting findings and building applications

¥ Sometimes people get so excited about your work that you’ll need to repeat it over and over again
because they value the predictions of your models or the insights that you produced.
¥ This doesn’t always mean that you have to redo all of your analysis all the time. Sometimes it’s sufficient
that you implement only the model scoring; other times you might build an application that automatically
updates reports, Excel spreadsheets, or PowerPoint presentations. The last stage of the data science
process is where your soft skills will be most useful, and yes, they’re extremely important.

Data mining
Data mining is the process of discovering actionable information from large sets of data. Data mining uses
mathematical analysis to derive patterns and trends that exist in data. Typically, these patterns cannot be
discovered by traditional data exploration because the relationships are too complex or because there is too
much data.
These patterns and trends can be collected and defined as a data mining model. Mining models can be applied
to specific scenarios, such as:
¥ Forecasting: Estimating sales, predicting server loads or server downtime
16

Downloaded by P. SANTHIYA - CSE ([email protected])

lOMoARcPSD|15656136

III SEM CSE

¥ Risk and probability: Choosing the best customers for targeted mailings, determining the probable
break-even point for risk scenarios, assigning probabilities to diagnoses or other outcomes
¥ Recommendations: Determining which products are likely to be sold together, generating
recommendations
¥ Finding sequences: Analyzing customer selections in a shopping cart, predicting next likely events
¥ Grouping: Separating customers or events into cluster of related items, analyzing and predicting
affinities

Building a mining model is part of a larger process that includes everything from asking questions about the
data and creating a model to answer those questions, to deploying the model into a working environment. This
process can be defined by using the following six basic steps:
1. Defining the Problem
2. Preparing Data
3. Exploring Data
4. Building Models
5. Exploring and Validating Models
6. Deploying and Updating Models

The following diagram describes the relationships between each step in the process, and the technologies in
Microsoft SQL Server that you can use to complete each step.

Defining the Problem

The first step in the data mining process is to clearly define the problem, and consider ways that data can be
utilized to provide an answer to the problem.

This step includes analyzing business requirements, defining the scope of the problem, defining the metrics by
which the model will be evaluated, and defining specific objectives for the data mining project. These tasks
translate into questions such as the following:
¥ What are you looking for? What types of relationships are you trying to find?
¥ Does the problem you are trying to solve reflect the policies or processes of the business?
¥ Do you want to make predictions from the data mining model, or just look for interesting patterns and
associations?
¥ Which outcome or attribute do you want to try to predict?

Downloaded by P. SANTHIYA - CSE ([email protected])

lOMoARcPSD|15656136

III SEM CSE

¥ What kind of data do you have and what kind of information is in each column? If there are multiple
tables, how are the tables related? Do you need to perform any cleansing, aggregation, or processing to
make the data usable?
¥ How is the data distributed? Is the data seasonal? Does the data accurately represent the processes of the
business?

Preparing Data
¥ The second step in the data mining process is to consolidate and clean the data that was identified in the
Defining the Problem step.
¥ Data can be scattered across a company and stored in different formats, or may contain inconsistencies
such as incorrect or missing entries.
¥ Data cleaning is not just about removing bad data or interpolating missing values, but about finding
hidden correlations in the data, identifying sources of data that are the most accurate, and determining
which columns are the most appropriate for use in analysis

Exploring Data
Exploration techniques include calculating the minimum and maximum values, calculating mean and standard
deviations, and looking at the distribution of the data. For example, you might determine by reviewing the
maximum, minimum, and mean values that the data is not representative of your customers or business
processes, and that you therefore must obtain more balanced data or review the assumptions that are the basis
for your expectations. Standard deviations and other distribution values can provide useful information about
the stability and accuracy of the results.

Building Models
The mining structure is linked to the source of data, but does not actually contain any data until you process it.
When you process the mining structure, SQL Server Analysis Services generates aggregates and otherstatistical
information that can be used for analysis. This information can be used by any mining model that is based on
the structure.

Exploring and Validating Models

Before you deploy a model into a production environment, you will want to test how well the model performs.
Also, when you build a model, you typically create multiple models with different configurations and test all
models to see which yields the best results for your problem and your data.

Deploying and Updating Models

After the mining models exist in a production environment, you can perform many tasks, depending on your
needs. The following are some of the tasks you can perform:
¥ Use the models to create predictions, which you can then use to make business decisions.
¥ Create content queries to retrieve statistics, rules, or formulas from the model.
¥ Embed data mining functionality directly into an application. You can include Analysis Management
Objects (AMO), which contains a set of objects that your application can use to create, alter, process,
and delete mining structures and mining models.
¥ Use Integration Services to create a package in which a mining model is used to intelligently separate
incoming data into multiple tables.
¥ Create a report that lets users directly query against an existing mining model
¥ Update the models after review and analysis. Any update requires that you reprocess the models.
¥ Update the models dynamically, as more data comes into the organization, and making constant changes
to improve the effectiveness of the solution should be part of the deployment strategy.

Downloaded by P. SANTHIYA - CSE ([email protected])

lOMoARcPSD|15656136

III SEM CSE

Data warehousing
Data warehousing is the process of constructing and using a data warehouse. A data warehouse is constructed
by integrating data from multiple heterogeneous sources that support analytical reporting, structured and/or ad
hoc queries, and decision making. Data warehousing involves data cleaning, data integration, and data
consolidations.

Characteristics of data warehouse

The main characteristics of a data warehouse are as follows:
¥ Subject-Oriented
A data warehouse is subject-oriented since it provides topic-wise information rather than the overall
processes of a business. Such subjects may be sales, promotion, inventory, etc
¥ Integrated
A data warehouse is developed by integrating data from varied sources into a consistent format. The
data must be stored in the warehouse in a consistent and universally acceptable manner in terms of
naming, format, and coding. This facilitates effective data analysis.
¥ Non-Volatile
Data once entered into a data warehouse must remain unchanged. All data is read-only. Previous
data is not erased when current data is entered. This helps you to analyze what has happened and when.
¥ Time-Variant
The data stored in a data warehouse is documented with an element of time, either explicitly or
implicitly. An example of time variance in Data Warehouse is exhibited in the Primary Key, which must
have an element of time like the day, week, or month.

Database vs. Data Warehouse

Although a data warehouse and a traditional database share some similarities, they need not be the same idea.
The main difference is that in a database, data is collected for multiple transactional purposes. However, in a
data warehouse, data is collected on an extensive scale to perform analytics. Databases provide real-time data,
while warehouses store data to be accessed for big analytical queries.

Data Warehouse Architecture

Usually, data warehouse architecture comprises a three-tier structure.
Bottom Tier
The bottom tier or data warehouse server usually represents a relational database system. Back-end tools are
used to cleanse, transform and feed data into this layer.
Middle Tier
The middle tier represents an OLAP server that can be implemented in two ways.
The ROLAP or Relational OLAP model is an extended relational database management system that maps
multidimensional data process to standard relational process.
The MOLAP or multidimensional OLAP directly acts on multidimensional data and operations.
Top Tier
This is the front-end client interface that gets data out from the data warehouse. It holds various tools like
query tools, analysis tools, reporting tools, and data mining tools.

How Data Warehouse Works

Data Warehousing integrates data and information collected from various sources into one comprehensive
database. For example, a data warehouse might combine customer information from an organization’s point-
of-sale systems, its mailing lists, website, and comment cards. It might also incorporate confidential

Downloaded by P. SANTHIYA - CSE ([email protected])

lOMoARcPSD|15656136

III SEM CSE

information about employees, salary information, etc. Businesses use such components of data warehouse to
analyze customers.

Data mining is one of the features of a data warehouse that involves looking for meaningful data patterns in
vast volumes of data and devising innovative strategies for increased sales and profits.

Types of Data Warehouse

There are three main types of data warehouse.

Enterprise Data Warehouse (EDW)

This type of warehouse serves as a key or central database that facilitates decision-support services throughout
the enterprise. The advantage to this type of warehouse is that it provides access to cross-organizational
information, offers a unified approach to data representation, and allows running complex queries.

Operational Data Store (ODS)

This type of data warehouse refreshes in real-time. It is often preferred for routine activities like storing
employee records. It is required when data warehouse systems do not support reporting needs of the business.

Data Mart
A data mart is a subset of a data warehouse built to maintain a particular department, region, or business unit.
Every department of a business has a central repository or data mart to store data. The data from the data mart
is stored in the ODS periodically. The ODS then sends the data to the EDW, where it is stored and used.

Summary
In this chapter you learned the data science process consists of six steps:
¥ Setting the research goal—Defining the what, the why, and the how of your project in a project
charter.
¥ Retrieving data—Finding and getting access to data needed in your project. This data is either found
within the company or retrieved from a third party.
¥ Data preparation—Checking and remediating data errors, enriching the data with data from other data
sources, and transforming it into a suitable format for your models.
¥ Data exploration—Diving deeper into your data using descriptive statistics and visual techniques.
¥ Data modeling—Using machine learning and statistical techniques to achieve your project goal.
¥ Presentation and automation—Presenting your results to the stakeholders and industrializing your
analysis process for repetitive reuse and integration with other tools.

Downloaded by P. SANTHIYA - CSE ([email protected])

Ram 2013 Abs (Diagrama)
No ratings yet
Ram 2013 Abs (Diagrama)
1 page
CS3352 FDS Notes - 03 - by WWW - Notesfree.in
No ratings yet
CS3352 FDS Notes - 03 - by WWW - Notesfree.in
138 pages
Data Science Unit 1 Notes
No ratings yet
Data Science Unit 1 Notes
22 pages
CS 3353 FDS Unit 1 Notes Jpr
No ratings yet
CS 3353 FDS Unit 1 Notes Jpr
39 pages
unit_1
No ratings yet
unit_1
9 pages
CS3352 Fds.pptx
No ratings yet
CS3352 Fds.pptx
23 pages
Unit - I - Introduction
No ratings yet
Unit - I - Introduction
77 pages
Unit 1 To 5
No ratings yet
Unit 1 To 5
202 pages
CS3352 - Foundations of Data Science
No ratings yet
CS3352 - Foundations of Data Science
142 pages
Data Science
No ratings yet
Data Science
244 pages
Defining Data Science
100% (1)
Defining Data Science
167 pages
Facets of Data
No ratings yet
Facets of Data
6 pages
ocs353-data-science-fundamentals-notes
No ratings yet
ocs353-data-science-fundamentals-notes
145 pages
Ds Unit 1
No ratings yet
Ds Unit 1
18 pages
mod 3
No ratings yet
mod 3
96 pages
Facets of Data
0% (1)
Facets of Data
22 pages
UNIT-1
No ratings yet
UNIT-1
25 pages
Data Science SPPU
No ratings yet
Data Science SPPU
115 pages
11.course Materials (Unit Wise
No ratings yet
11.course Materials (Unit Wise
138 pages
Unit 1
No ratings yet
Unit 1
26 pages
Explaratory Data Analysis - Python
No ratings yet
Explaratory Data Analysis - Python
16 pages
FDS Notes
No ratings yet
FDS Notes
148 pages
Stucor Cs3352 Ad
No ratings yet
Stucor Cs3352 Ad
138 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
36 pages
UNIT 1 PPT 1
No ratings yet
UNIT 1 PPT 1
27 pages
Foundation of Data Science
100% (2)
Foundation of Data Science
143 pages
Module 1 Data Science
No ratings yet
Module 1 Data Science
8 pages
Fods Notes
No ratings yet
Fods Notes
139 pages
Chapter 2. Introduction To Data Science
No ratings yet
Chapter 2. Introduction To Data Science
40 pages
NeedForDS FACETS
No ratings yet
NeedForDS FACETS
13 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
36 pages
FDS CH1
No ratings yet
FDS CH1
4 pages
Data Science - Unit 1 MDM
No ratings yet
Data Science - Unit 1 MDM
64 pages
UNIT 1
No ratings yet
UNIT 1
84 pages
Chapter 2 - Introduction to Data Science (2)
No ratings yet
Chapter 2 - Introduction to Data Science (2)
35 pages
Data v2
No ratings yet
Data v2
25 pages
CH1 Introduction To Data Science BS
No ratings yet
CH1 Introduction To Data Science BS
69 pages
TE Sem1 UNIT 1 (Data Science and Visualization) HONOURS - TE (SEM V)
No ratings yet
TE Sem1 UNIT 1 (Data Science and Visualization) HONOURS - TE (SEM V)
28 pages
Data Science: October 2021
No ratings yet
Data Science: October 2021
51 pages
CH-2 Data Science Emerging Technology
No ratings yet
CH-2 Data Science Emerging Technology
20 pages
Cs3352 Foundation of Data Science
No ratings yet
Cs3352 Foundation of Data Science
80 pages
FDSUNIT 1
No ratings yet
FDSUNIT 1
27 pages
Chapter 2 - Intro. To Data Sciences
No ratings yet
Chapter 2 - Intro. To Data Sciences
27 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
41 pages
Fundamentals of Data Science
100% (3)
Fundamentals of Data Science
62 pages
FDS - UNIT 1
No ratings yet
FDS - UNIT 1
233 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
22UCS303 DS-Unit I-N
No ratings yet
22UCS303 DS-Unit I-N
42 pages
Chapter 2 - Intro to Data Sciences[2]
No ratings yet
Chapter 2 - Intro to Data Sciences[2]
41 pages
DSUP Chapter 1 PDF
No ratings yet
DSUP Chapter 1 PDF
31 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
IDS_sem ans unit 1
No ratings yet
IDS_sem ans unit 1
10 pages
FDS NOTES
No ratings yet
FDS NOTES
137 pages
DS R Unit-1
No ratings yet
DS R Unit-1
41 pages
Chapter 2. Introduction To Data Science
100% (2)
Chapter 2. Introduction To Data Science
45 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
30 pages
Data Science - FYBCA-Sem-II
No ratings yet
Data Science - FYBCA-Sem-II
13 pages
FDS Unit 1
No ratings yet
FDS Unit 1
21 pages
IDS - Lecture 1
No ratings yet
IDS - Lecture 1
52 pages
Data Science Mastery: From Beginner to Expert in Big Data Analytics
From Everand
Data Science Mastery: From Beginner to Expert in Big Data Analytics
Kameron Hussain
No ratings yet
Mastering Data Mining Techniques
From Everand
Mastering Data Mining Techniques
Dhaanyalakshmi Ahuja
No ratings yet
cs3352-foundations-of-data-science-unit-ii
No ratings yet
cs3352-foundations-of-data-science-unit-ii
34 pages
UNIT I (c)
No ratings yet
UNIT I (c)
63 pages
UNIT I(b)
No ratings yet
UNIT I(b)
34 pages
UNIT I (d)
No ratings yet
UNIT I (d)
32 pages
Unit I(a)
No ratings yet
Unit I(a)
17 pages
Department of Mathematics and Statistics
No ratings yet
Department of Mathematics and Statistics
4 pages
Counters in Digital Electronics - Javatpoint
No ratings yet
Counters in Digital Electronics - Javatpoint
5 pages
Fun Maths Homework Year 4
100% (1)
Fun Maths Homework Year 4
6 pages
Naukri_SaiKarna[4y_2m]
No ratings yet
Naukri_SaiKarna[4y_2m]
2 pages
Ics A - Airbnb Case Study Submission
No ratings yet
Ics A - Airbnb Case Study Submission
7 pages
Illustrated Parts List: Dana Spicer Drive Axles AXIP0200 April 2011
No ratings yet
Illustrated Parts List: Dana Spicer Drive Axles AXIP0200 April 2011
53 pages
Water Pump AssembleSMCS 1361 016
No ratings yet
Water Pump AssembleSMCS 1361 016
3 pages
Tugas 2 Aplikasi Perkantoran (Axel Theo Winata Ursia)
No ratings yet
Tugas 2 Aplikasi Perkantoran (Axel Theo Winata Ursia)
18 pages
Edexcel A Level Core 3 Notes
No ratings yet
Edexcel A Level Core 3 Notes
36 pages
Criminal Detection Based Suspect Prediction
No ratings yet
Criminal Detection Based Suspect Prediction
9 pages
File 123
100% (1)
File 123
11 pages
Lecture 1 - Paperless Office
No ratings yet
Lecture 1 - Paperless Office
4 pages
Information For Authors - APS
No ratings yet
Information For Authors - APS
12 pages
Guppyassemblymanual PDF
100% (1)
Guppyassemblymanual PDF
11 pages
Y10 03 CT15 Slides
No ratings yet
Y10 03 CT15 Slides
13 pages
Reflection Paper On Email Communication Everinda Putri
No ratings yet
Reflection Paper On Email Communication Everinda Putri
7 pages
Datasheet GTB6-P4211 1052438 en
No ratings yet
Datasheet GTB6-P4211 1052438 en
8 pages
Mqa Cep
No ratings yet
Mqa Cep
3 pages
Fire Department: Residential Apartment Building Fire Safety
No ratings yet
Fire Department: Residential Apartment Building Fire Safety
4 pages
Competition Analysis Madrid-Barcelona Corridor
No ratings yet
Competition Analysis Madrid-Barcelona Corridor
20 pages
What Is Switchgear?: Quick Links To Fundamentals of Low-Voltage Switchgear
No ratings yet
What Is Switchgear?: Quick Links To Fundamentals of Low-Voltage Switchgear
29 pages
Sesit Venkovni Rolety en
No ratings yet
Sesit Venkovni Rolety en
50 pages
Experiment No.: Aim: - To Measure Three Phase Power and Power Factor in A Balanced Three Phase Circuit Using
No ratings yet
Experiment No.: Aim: - To Measure Three Phase Power and Power Factor in A Balanced Three Phase Circuit Using
5 pages
1ZKM9121-03 Rev5 - Transformer TOB (2) Bushing (171103)
No ratings yet
1ZKM9121-03 Rev5 - Transformer TOB (2) Bushing (171103)
12 pages
Curriculum Map English 10
No ratings yet
Curriculum Map English 10
11 pages
RS170 Spec Engl
No ratings yet
RS170 Spec Engl
2 pages
Abstract
No ratings yet
Abstract
23 pages
Low Cost Toilet for Rural Household
No ratings yet
Low Cost Toilet for Rural Household
71 pages
A Novel Intelligent Controller-Based Power Management System With Instantaneous Reference Current in Hybrid Energy-Fed Electric Vehicle
No ratings yet
A Novel Intelligent Controller-Based Power Management System With Instantaneous Reference Current in Hybrid Energy-Fed Electric Vehicle
17 pages