0% found this document useful (0 votes)
7 views

UNIT 1

This document provides an overview of the foundations of data science, including its benefits, processes, and types of data. It outlines the data science process, which consists of defining research goals, retrieving and preparing data, exploring data, building models, and presenting findings. Additionally, it discusses various data types such as structured, unstructured, and machine-generated data, along with the importance of data cleansing and integration.

Uploaded by

P SANTHIYA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

UNIT 1

This document provides an overview of the foundations of data science, including its benefits, processes, and types of data. It outlines the data science process, which consists of defining research goals, retrieving and preparing data, exploring data, building models, and presenting findings. Additionally, it discusses various data types such as structured, unstructured, and machine-generated data, along with the importance of data cleansing and integration.

Uploaded by

P SANTHIYA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

lOMoARcPSD|15656136

CS3352 Unit I

Foundations of Data Science (Anna University)

Studocu is not sponsored or endorsed by any college or university


Downloaded by P. SANTHIYA - CSE ([email protected])
lOMoARcPSD|15656136

III SEM CSE

UNIT – I - UNIT I - INTRODUCTION


FOUNDATION OF DATA SCIENCE
INTRODUCTION
Data Science: Benefits and uses – facets of data - Data Science Process: Overview – Defining research goals –
Retrieving data – Data preparation - Exploratory Data analysis – Build the model– presenting findings and
building applications - Data Mining - Data Warehousing – Basic Statistical descriptions of Data

Data
In computing, data is information that has been translated into a form that is efficient for movement or
processing

Data Science
Data science is an evolutionary extension of statistics capable of dealing with the massive amounts of data
produced today. It adds methods from computer science to the repertoire of statistics.

Benefits and uses of data science


Data science and big data are used almost everywhere in both commercial and noncommercial Settings
¥ Commercial companies in almost every industry use data science and big data to gain insights into their
customers, processes, staff, completion, and products.
¥ Many companies use data science to offer customers a better user experience, as well as to cross-sell,
up-sell, and personalize their offerings.
¥ Governmental organizations are also aware of data’s value. Many governmental organizations not only
rely on internal data scientists to discover valuable information, but also share their data with the public.
¥ Nongovernmental organizations (NGOs) use it to raise money and defend their causes.
¥ Universities use data science in their research but also to enhance the study experience of their students.
The rise of massive open online courses (MOOC) produces a lot of data, which allows universities to
study how this type of learning can complement traditional classes.

Facets of data
In data science and big data you’ll come across many different types of data, and each of them tends to require
different tools and techniques. The main categories of data are these:
¥ Structured
¥ Unstructured
¥ Natural language
¥ Machine-generated
¥ Graph-based
¥ Audio, video, and images
¥ Streaming
Let’s explore all these interesting data types.

Structured data
¥ Structured data is data that depends on a data model and resides in a fixed field within a record. As such,
it’s often easy to store structured data in tables within databases or Excel files
¥ SQL, or Structured Query Language, is the preferred way to manage and query data that resides in
databases.

Downloaded by P. SANTHIYA - CSE ([email protected])


lOMoARcPSD|15656136

III SEM CSE

Unstructured data
Unstructured data is data that isn’t easy to fit into a data model because the content is context-specific or
varying. One example of unstructured data is your regular email

Natural language
¥ Natural language is a special type of unstructured data; it’s challenging to process because it requires
knowledge of specific data science techniques and linguistics.

Downloaded by P. SANTHIYA - CSE ([email protected])


lOMoARcPSD|15656136

III SEM CSE

¥ The natural language processing community has had success in entity recognition, topic recognition,
summarization, text completion, and sentiment analysis, but models trained in one domain don’t
generalize well to other domains.
¥ Even state-of-the-art techniques aren’t able to decipher the meaning of every piece of text.

Machine-generated data
¥ Machine-generated data is information that’s automatically created by a computer, process,
application, or other machine without human intervention.
¥ Machine-generated data is becoming a major data resource and will continue to do so.
¥ The analysis of machine data relies on highly scalable tools, due to its high volume and speed.
Examples of machine data are web server logs, call detail records, network event logs, and telemetry.

Graph-based or network data


¥ “Graph data” can be a confusing term because any data can be shown in a graph.
¥ Graph or network data is, in short, data that focuses on the relationship or adjacency of objects.
¥ The graph structures use nodes, edges, and properties to represent and store graphical data.
¥ Graph-based data is a natural way to represent social networks, and its structure allows you to
calculate specific metrics such as the influence of a person and the shortest path between two people.

Audio, image, and video


¥ Audio, image, and video are data types that pose specific challenges to a data scientist.
¥ Tasks that are trivial for humans, such as recognizing objects in pictures, turn out to be challenging for
computers.
¥ MLBAM (Major League Baseball Advanced Media) announced in 2014 that they’ll increase video
capture to approximately 7 TB per game for the purpose of live, in-game analytics.
¥ Recently a company called DeepMind succeeded at creating an algorithm that’s capable of learning
how to play video games.
¥ This algorithm takes the video screen as input and learns to interpret everything via a complex
of deep learning.
3

Downloaded by P. SANTHIYA - CSE ([email protected])


lOMoARcPSD|15656136

III SEM CSE

Streaming data
¥ The data flows into the system when an event happens instead of being loaded into a data store in a
batch.
¥ Examples are the “What’s trending” on Twitter, live sporting or music events, and the stock market.

Data Science Process


Overview of the data science process
The typical data science process consists of six steps through which you’ll iterate, as shown in figure

1. The first step of this process is setting a research goal. The main purpose here is making sure all the
stakeholders understand the what, how, and why of the project. In every serious project this will result
in a project charter.
2. The second phase is data retrieval. You want to have data available for analysis, so this step includes
finding suitable data and getting access to the data from the data owner. The result is data in its raw
form, which probably needs polishing and transformation before it becomes usable.
3. Now that you have the raw data, it’s time to prepare it. This includes transforming the data from a raw
form into data that’s directly usable in your models. To achieve this, you’ll detect and correct different
kinds of errors in the data, combine data from different data sources, and transform it. If you have
successfully completed this step, you can progress to data visualization and modeling.

Downloaded by P. SANTHIYA - CSE ([email protected])


lOMoARcPSD|15656136

III SEM CSE

4. The fourth step is data exploration. The goal of this step is to gain a deep understanding of the data.
You’ll look for patterns, correlations, and deviations based on visual and descriptive techniques. The
insights you gain from this phase will enable you to start modeling.
5. Finally, we get to model building (often referred to as “data modeling” throughout this book). It is now
that you attempt to gain the insights or make the predictions stated in your project charter. Now is the
time to bring out the heavy guns, but remember research has taught us that often (but not always) a
combination of simple models tends to outperform one complicated model. If you’ve done this phase
right, you’re almost done.
6. The last step of the data science model is presenting your results and automating the analysis, if needed.
One goal of a project is to change a process and/or make better decisions. You may still need to convince
the business that your findings will indeed change the business process as expected. Thisis where you
can shine in your influencer role. The importance of this step is more apparent in projectson a strategic
and tactical level. Certain projects require you to perform the business process over and over again, so
automating the project will save time.

Defining research goals


A project starts by understanding the what, the why, and the how of your project. The outcome should be a clear
research goal, a good understanding of the context, well-defined deliverables, and a plan of action with a
timetable. This information is then best placed in a project charter.

Spend time understanding the goals and context of your research


¥ An essential outcome is the research goal that states the purpose of your assignment in a clear and
focused manner.
¥ Understanding the business goals and context is critical for project success.
¥ Continue asking questions and devising examples until you grasp the exact business expectations,
identify how your project fits in the bigger picture, appreciate how your research is going to change the
business, and understand how they’ll use your results

Create a project charter


A project charter requires teamwork, and your input covers at least the following:
¥ A clear research goal
¥ The project mission and context
¥ How you’re going to perform your analysis
¥ What resources you expect to use
¥ Proof that it’s an achievable project, or proof of concepts
¥ Deliverables and a measure of success
¥ A timeline

Retrieving data
¥ The next step in data science is to retrieve the required data. Sometimes you need to go into the field
and design a data collection process yourself, but most of the time you won’t be involved in this step.
¥ Many companies will have already collected and stored the data for you, and what they don’t have can
often be bought from third parties.
¥ More and more organizations are making even high-quality data freely available for public and
commercial use.
¥ Data can be stored in many forms, ranging from simple text files to tables in a database. The objective
now is acquiring all the data you need.

Start with data stored within the company (Internal data)

Downloaded by P. SANTHIYA - CSE ([email protected])


lOMoARcPSD|15656136

III SEM CSE

¥ Most companies have a program for maintaining key data, so much of the cleaning work may already
be done. This data can be stored in official data repositories such as databases, data marts, data
warehouses, and data lakes maintained by a team of IT professionals.
¥ Data warehouses and data marts are home to preprocessed data, data lakes contain data in its natural or
raw format.
¥ Finding data even within your own company can sometimes be a challenge. As companies grow, their
data becomes scattered around many places. the data may be dispersed as people change positions and
leave the company.
¥ Getting access to data is another difficult task. Organizations understand the value and sensitivity of
data and often have policies in place so everyone has access to what they need and nothing more.
¥ These policies translate into physical and digital barriers called Chinese walls. These “walls” are
mandatory and well-regulated for customer data in most countries.

External Data
¥ If data isn’t available inside your organization, look outside your organizations. Companies provide
data so that you, in turn, can enrich their services and ecosystem. Such is the case with Twitter,
LinkedIn, and Facebook.
¥ More and more governments and organizations share their data for free with the world.
¥ A list of open data providers that should get you started.

Data Preparation (Cleansing, Integrating, Transforming Data)


Your model needs the data in a specific format, so data transformation will always come into play. It’s a good
habit to correct data errors as early on in the process as possible. However, this isn’t always possible in a realistic
setting, so you’ll need to take corrective actions in your program.

Cleansing data
Data cleansing is a sub process of the data science process that focuses on removing errors in your data so
your data becomes a true and consistent representation of the processes it originates from.
¥ The first type is the interpretation error, such as when you take the value in your data for granted, like
saying that a person’s age is greater than 300 years.
¥ The second type of error points to inconsistencies between data sources or against your company’s
standardized values.
An example of this class of errors is putting “Female” in one table and “F” in another when they represent
the same thing: that the person is female.

Overview of common errors

Downloaded by P. SANTHIYA - CSE ([email protected])


lOMoARcPSD|15656136

III SEM CSE

Sometimes you’ll use more advanced methods, such as simple modeling, to find and identify data errors;
diagnostic plots can be especially insightful. For example, in figure we use a measure to identify data points
that seem out of place. We do a regression to get acquainted with the data and detect the influence of individual
observations on the regression line.

Data Entry Errors


¥ Data collection and data entry are error-prone processes. They often require human intervention, and
introduce an error into the chain.

Downloaded by P. SANTHIYA - CSE ([email protected])


lOMoARcPSD|15656136

III SEM CSE

¥ Data collected by machines or computers isn’t free from errors. Errors can arise from human sloppiness,
whereas others are due to machine or hardware failure.
¥ Detecting data errors when the variables you study don’t have many classes can be done by tabulating
the data with counts.
¥ When you have a variable that can take only two values: “Good” and “Bad”, you can create a frequency
table and see if those are truly the only two values present. In table the values “Godo” and “Bade” point
out something went wrong in at least 16 cases.

Most errors of this type are easy to fix with simple assignment statements and if-thenelse
rules:
if x == “Godo”:
x = “Good”
if x == “Bade”:
x = “Bad”

Redundant Whitespace
¥ Whitespaces tend to be hard to detect but cause errors like other redundant characters would.
¥ The whitespace cause the miss match in the string such as “FR ” – “FR”, dropping the observations that
couldn’t be matched.
¥ If you know to watch out for them, fixing redundant whitespaces is luckily easy enough in most
programming languages. They all provide string functions that will remove the leading and trailing
whitespaces. For instance, in Python you can use the strip() function to remove leading and trailing
spaces.

Fixing Capital Letter Mismatches


Capital letter mismatches are common. Most programming languages make a distinction between “Brazil”
and “brazil”.
In this case you can solve the problem by applying a function that returns both strings in lowercase, such as
.lower() in Python. “Brazil”.lower() == “brazil”.lower() should result in true.

Impossible Values and Sanity Checks


Here you check the value against physically or theoretically impossible values such as people taller than 3
meters or someone with an age of 299 years. Sanity checks can be directly expressed with rules:
check = 0 <= age <= 120

Outliers
An outlier is an observation that seems to be distant from other observations or, more specifically, one
observation that follows a different logic or generative process than the other observations. The easiest way to
find outliers is to use a plot or a table with the minimum and maximum values.
The plot on the top shows no outliers, whereas the plot on the bottom shows possible outliers on the upper side
when a normal distribution is expected.

Downloaded by P. SANTHIYA - CSE ([email protected])


lOMoARcPSD|15656136

III SEM CSE

Dealing with Missing Values


Missing values aren’t necessarily wrong, but you still need to handle them separately; certain modeling
techniques can’t handle missing values. They might be an indicator that something went wrong in your data
collection or that an error happened in the ETL process. Common techniques data scientists use are listed in
table

Integrating data
Your data comes from several different places, and in this substep we focus on integrating these different
sources. Data varies in size, type, and structure, ranging from databases and Excel files to text documents.

The Different Ways of Combining Data


You can perform two operations to combine information from different data sets.
¥ Joining
¥ Appending or stacking

Joining Tables
¥ Joining tables allows you to combine the information of one observation found in one table with the
information that you find in another table. The focus is on enriching a single observation.
¥ Let’s say that the first table contains information about the purchases of a customer and the other table
contains information about the region where your customer lives.
¥ Joining the tables allows you to combine the information so that you can use it for your model, as
shown in figure.

Figure. Joining two tables on the item and region key


9

Downloaded by P. SANTHIYA - CSE ([email protected])


lOMoARcPSD|15656136

III SEM CSE

To join tables, you use variables that represent the same object in both tables, such as a date, a country name,
or a Social Security number. These common fields are known as keys. When these keys also uniquely define
the records in the table they are called primary keys.
The number of resulting rows in the output table depends on the exact join type that you use

Appending Tables
¥ Appending or stacking tables is effectively adding observations from one table to another table.
¥ One table contains the observations from the month January and the second table contains observations
from the month February. The result of appending these tables is a larger one with the observations from
January as well as February.

Figure. Appending data from tables is a common operation but requires an equal structure in the tables begin
appended,

Transforming data

Certain models require their data to be in a certain shape. Transforming your data so it takes a suitable form for
data modeling.

Relationships between an input variable and an output variable aren’t always linear. Take, for instance, a
relationship of the form y = aebx. Taking the log of the independent variables simplifies the estimation problem
dramatically. Transforming the input variables greatly simplifies the estimation problem. Other timesyou might
want to combine two variables into a new variable.

10

Downloaded by P. SANTHIYA - CSE ([email protected])


lOMoARcPSD|15656136

III SEM CSE

Reducing the Number of Variables


¥ Having too many variables in your model makes the model difficult to handle, and certain techniques
don’t perform well when you overload them with too many input variables. For instance, all the
techniques based on a Euclidean distance perform well only up to 10 variables.
¥ Data scientists use special methods to reduce the number of variables but retain the maximum amount
of data.

11

Downloaded by P. SANTHIYA - CSE ([email protected])


lOMoARcPSD|15656136

III SEM CSE

Figure shows how reducing the number of variables makes it easier to understand the key values. It also shows
how two variables account for 50.6% of the variation within the data set (component1 = 27.8% + component2
= 22.8%). These variables, called “component1” and “component2,” are both combinations of the original
variables. They’re the principal components of the underlying data structure

Turning Variables into Dummies

¥ Dummy variables can only take two values: true(1) or false(0). They’re used to indicate the absence of
a categorical effect that may explain the observation.
¥ In this case you’ll make separate columns for the classes stored in one variable and indicate it with 1 if
the class is present and 0 otherwise.
¥ An example is turning one column named Weekdays into the columns Monday through Sunday. You
use an indicator to show if the observation was on a Monday; you put 1 on Monday and 0 elsewhere.
¥ Turning variables into dummies is a technique that’s used in modeling and is popular with, but not
exclusive to, economists.

Figure. Turning variables into dummies is a data transformation that breaks a variable that has multiple
classes into multiple variables, each having only two possible values: 0 or 1

Exploratory data analysis

During exploratory data analysis you take a deep dive into the data (see figure below). Information becomes
much easier to grasp when shown in a picture, therefore you mainly use graphical techniques to gain an
understanding of your data and the interactions between variables.
The goal isn’t to cleanse the data, but it’s common that you’ll still discover anomalies you missed before,
forcing you to take a step back and fix them.

12

Downloaded by P. SANTHIYA - CSE ([email protected])


lOMoARcPSD|15656136

III SEM CSE

13

Downloaded by P. SANTHIYA - CSE ([email protected])


lOMoARcPSD|15656136

III SEM CSE

¥ The visualization techniques you use in this phase range from simple line graphs or histograms, as
shown in below figure , to more complex diagrams such as Sankey and network graphs.
¥ Sometimes it’s useful to compose a composite graph from simple graphs to get even more insight
into the data Other times the graphs can be animated or made interactive to make it easier and, let’s
admit it, way more fun

The techniques we described in this phase are mainly visual, but in practice they’re certainly not limited to
visualization techniques. Tabulation, clustering, and other modeling techniques can also be a part of exploratory
analysis. Even building simple models can be a part of this step.

Build the models


¥ With clean data in place and a good understanding of the content, you’re ready to build models with the
goal of making better predictions, classifying objects, or gaining an understanding of the system that
you’re modeling.
¥ This phase is much more focused than the exploratory analysis step, because you know what you’re
looking for and what you want the outcome to be.

Building a model is an iterative process. The way you build your model depends on whether you go with classic
statistics or the somewhat more recent machine learning school, and the type of technique you want to use.
Either way, most models consist of the following main steps:
¥ Selection of a modeling technique and variables to enter in the model
¥ Execution of the model
¥ Diagnosis and model comparison

Model and variable selection


You’ll need to select the variables you want to include in your model and a modeling technique. You’ll need
to consider model performance and whether your project meets all the requirements to use your model, as well
as other factors:
¥ Must the model be moved to a production environment and, if so, would it be easy to implement?
¥ How difficult is the maintenance on the model: how long will it remain relevant if left untouched?
¥ Does the model need to be easy to explain?

Model execution
¥ Once you’ve chosen a model you’ll need to implement it in code.

14

Downloaded by P. SANTHIYA - CSE ([email protected])


lOMoARcPSD|15656136

III SEM CSE

¥ Most programming languages, such as Python, already have libraries such as StatsModels or Scikit-
learn. These packages use several of the most popular techniques.
¥ Coding a model is a nontrivial task in most cases, so having these libraries available can speed up the
process. As you can see in the following code, it’s fairly easy to use linear regression with StatsModels
or Scikit-learn
¥ Doing this yourself would require much more effort even for the simple techniques. The following
listing shows the execution of a linear prediction model.

Model diagnostics and model comparison


¥ You’ll be building multiple models from which you then choose the best one based on multiple
criteria. Working with a holdout sample helps you pick the best-performing model.
¥ A holdout sample is a part of the data you leave out of the model building so it can be used to evaluate
the model afterward.
¥ The principle here is simple: the model should work on unseen data. You use only a fraction of your
data to estimate the model and the other part, the holdout sample, is kept out of the equation.
¥ The model is then unleashed on the unseen data and error measures are calculated to evaluate it.
¥ Multiple error measures are available, and in figure we show the general idea on comparing models.
The error measure used in the example is the mean square error.

Formula for mean square error.

Mean square error is a simple measure: check for every prediction how far it was from the truth, square this
error, and add up the error of every prediction.

15

Downloaded by P. SANTHIYA - CSE ([email protected])


lOMoARcPSD|15656136

III SEM CSE

Above figure compares the performance of two models to predict the order size from the price. The first
model is size = 3 * price and the second model is size = 10.
¥ To estimate the models, we use 800 randomly chosen observations out of 1,000 (or 80%), without
showing the other 20% of data to the model.
¥ Once the model is trained, we predict the values for the other 20% of the variables based on those for
which we already know the true value, and calculate the model error with an error measure.
¥ Then we choose the model with the lowest error. In this example we chose model 1 because it has the
lowest total error.

Many models make strong assumptions, such as independence of the inputs, and you have to verify that these
assumptions are indeed met. This is called model diagnostics.

Presenting findings and building applications

¥ Sometimes people get so excited about your work that you’ll need to repeat it over and over again
because they value the predictions of your models or the insights that you produced.
¥ This doesn’t always mean that you have to redo all of your analysis all the time. Sometimes it’s sufficient
that you implement only the model scoring; other times you might build an application that automatically
updates reports, Excel spreadsheets, or PowerPoint presentations. The last stage of the data science
process is where your soft skills will be most useful, and yes, they’re extremely important.

Data mining
Data mining is the process of discovering actionable information from large sets of data. Data mining uses
mathematical analysis to derive patterns and trends that exist in data. Typically, these patterns cannot be
discovered by traditional data exploration because the relationships are too complex or because there is too
much data.
These patterns and trends can be collected and defined as a data mining model. Mining models can be applied
to specific scenarios, such as:
¥ Forecasting: Estimating sales, predicting server loads or server downtime
16

Downloaded by P. SANTHIYA - CSE ([email protected])


lOMoARcPSD|15656136

III SEM CSE

¥ Risk and probability: Choosing the best customers for targeted mailings, determining the probable
break-even point for risk scenarios, assigning probabilities to diagnoses or other outcomes
¥ Recommendations: Determining which products are likely to be sold together, generating
recommendations
¥ Finding sequences: Analyzing customer selections in a shopping cart, predicting next likely events
¥ Grouping: Separating customers or events into cluster of related items, analyzing and predicting
affinities

Building a mining model is part of a larger process that includes everything from asking questions about the
data and creating a model to answer those questions, to deploying the model into a working environment. This
process can be defined by using the following six basic steps:
1. Defining the Problem
2. Preparing Data
3. Exploring Data
4. Building Models
5. Exploring and Validating Models
6. Deploying and Updating Models

The following diagram describes the relationships between each step in the process, and the technologies in
Microsoft SQL Server that you can use to complete each step.

Defining the Problem

The first step in the data mining process is to clearly define the problem, and consider ways that data can be
utilized to provide an answer to the problem.

This step includes analyzing business requirements, defining the scope of the problem, defining the metrics by
which the model will be evaluated, and defining specific objectives for the data mining project. These tasks
translate into questions such as the following:
¥ What are you looking for? What types of relationships are you trying to find?
¥ Does the problem you are trying to solve reflect the policies or processes of the business?
¥ Do you want to make predictions from the data mining model, or just look for interesting patterns and
associations?
¥ Which outcome or attribute do you want to try to predict?

17

Downloaded by P. SANTHIYA - CSE ([email protected])


lOMoARcPSD|15656136

III SEM CSE

¥ What kind of data do you have and what kind of information is in each column? If there are multiple
tables, how are the tables related? Do you need to perform any cleansing, aggregation, or processing to
make the data usable?
¥ How is the data distributed? Is the data seasonal? Does the data accurately represent the processes of the
business?

Preparing Data
¥ The second step in the data mining process is to consolidate and clean the data that was identified in the
Defining the Problem step.
¥ Data can be scattered across a company and stored in different formats, or may contain inconsistencies
such as incorrect or missing entries.
¥ Data cleaning is not just about removing bad data or interpolating missing values, but about finding
hidden correlations in the data, identifying sources of data that are the most accurate, and determining
which columns are the most appropriate for use in analysis

Exploring Data
Exploration techniques include calculating the minimum and maximum values, calculating mean and standard
deviations, and looking at the distribution of the data. For example, you might determine by reviewing the
maximum, minimum, and mean values that the data is not representative of your customers or business
processes, and that you therefore must obtain more balanced data or review the assumptions that are the basis
for your expectations. Standard deviations and other distribution values can provide useful information about
the stability and accuracy of the results.

Building Models
The mining structure is linked to the source of data, but does not actually contain any data until you process it.
When you process the mining structure, SQL Server Analysis Services generates aggregates and otherstatistical
information that can be used for analysis. This information can be used by any mining model that is based on
the structure.

Exploring and Validating Models


Before you deploy a model into a production environment, you will want to test how well the model performs.
Also, when you build a model, you typically create multiple models with different configurations and test all
models to see which yields the best results for your problem and your data.

Deploying and Updating Models


After the mining models exist in a production environment, you can perform many tasks, depending on your
needs. The following are some of the tasks you can perform:
¥ Use the models to create predictions, which you can then use to make business decisions.
¥ Create content queries to retrieve statistics, rules, or formulas from the model.
¥ Embed data mining functionality directly into an application. You can include Analysis Management
Objects (AMO), which contains a set of objects that your application can use to create, alter, process,
and delete mining structures and mining models.
¥ Use Integration Services to create a package in which a mining model is used to intelligently separate
incoming data into multiple tables.
¥ Create a report that lets users directly query against an existing mining model
¥ Update the models after review and analysis. Any update requires that you reprocess the models.
¥ Update the models dynamically, as more data comes into the organization, and making constant changes
to improve the effectiveness of the solution should be part of the deployment strategy.

18

Downloaded by P. SANTHIYA - CSE ([email protected])


lOMoARcPSD|15656136

III SEM CSE

Data warehousing
Data warehousing is the process of constructing and using a data warehouse. A data warehouse is constructed
by integrating data from multiple heterogeneous sources that support analytical reporting, structured and/or ad
hoc queries, and decision making. Data warehousing involves data cleaning, data integration, and data
consolidations.

Characteristics of data warehouse


The main characteristics of a data warehouse are as follows:
¥ Subject-Oriented
A data warehouse is subject-oriented since it provides topic-wise information rather than the overall
processes of a business. Such subjects may be sales, promotion, inventory, etc
¥ Integrated
A data warehouse is developed by integrating data from varied sources into a consistent format. The
data must be stored in the warehouse in a consistent and universally acceptable manner in terms of
naming, format, and coding. This facilitates effective data analysis.
¥ Non-Volatile
Data once entered into a data warehouse must remain unchanged. All data is read-only. Previous
data is not erased when current data is entered. This helps you to analyze what has happened and when.
¥ Time-Variant
The data stored in a data warehouse is documented with an element of time, either explicitly or
implicitly. An example of time variance in Data Warehouse is exhibited in the Primary Key, which must
have an element of time like the day, week, or month.

Database vs. Data Warehouse

Although a data warehouse and a traditional database share some similarities, they need not be the same idea.
The main difference is that in a database, data is collected for multiple transactional purposes. However, in a
data warehouse, data is collected on an extensive scale to perform analytics. Databases provide real-time data,
while warehouses store data to be accessed for big analytical queries.

Data Warehouse Architecture


Usually, data warehouse architecture comprises a three-tier structure.
Bottom Tier
The bottom tier or data warehouse server usually represents a relational database system. Back-end tools are
used to cleanse, transform and feed data into this layer.
Middle Tier
The middle tier represents an OLAP server that can be implemented in two ways.
The ROLAP or Relational OLAP model is an extended relational database management system that maps
multidimensional data process to standard relational process.
The MOLAP or multidimensional OLAP directly acts on multidimensional data and operations.
Top Tier
This is the front-end client interface that gets data out from the data warehouse. It holds various tools like
query tools, analysis tools, reporting tools, and data mining tools.

How Data Warehouse Works

Data Warehousing integrates data and information collected from various sources into one comprehensive
database. For example, a data warehouse might combine customer information from an organization’s point-
of-sale systems, its mailing lists, website, and comment cards. It might also incorporate confidential

19

Downloaded by P. SANTHIYA - CSE ([email protected])


lOMoARcPSD|15656136

III SEM CSE

information about employees, salary information, etc. Businesses use such components of data warehouse to
analyze customers.

Data mining is one of the features of a data warehouse that involves looking for meaningful data patterns in
vast volumes of data and devising innovative strategies for increased sales and profits.

Types of Data Warehouse


There are three main types of data warehouse.

Enterprise Data Warehouse (EDW)


This type of warehouse serves as a key or central database that facilitates decision-support services throughout
the enterprise. The advantage to this type of warehouse is that it provides access to cross-organizational
information, offers a unified approach to data representation, and allows running complex queries.

Operational Data Store (ODS)


This type of data warehouse refreshes in real-time. It is often preferred for routine activities like storing
employee records. It is required when data warehouse systems do not support reporting needs of the business.

Data Mart
A data mart is a subset of a data warehouse built to maintain a particular department, region, or business unit.
Every department of a business has a central repository or data mart to store data. The data from the data mart
is stored in the ODS periodically. The ODS then sends the data to the EDW, where it is stored and used.

Summary
In this chapter you learned the data science process consists of six steps:
¥ Setting the research goal—Defining the what, the why, and the how of your project in a project
charter.
¥ Retrieving data—Finding and getting access to data needed in your project. This data is either found
within the company or retrieved from a third party.
¥ Data preparation—Checking and remediating data errors, enriching the data with data from other data
sources, and transforming it into a suitable format for your models.
¥ Data exploration—Diving deeper into your data using descriptive statistics and visual techniques.
¥ Data modeling—Using machine learning and statistical techniques to achieve your project goal.
¥ Presentation and automation—Presenting your results to the stakeholders and industrializing your
analysis process for repetitive reuse and integration with other tools.

20

Downloaded by P. SANTHIYA - CSE ([email protected])

You might also like