0% found this document useful (0 votes)
17 views

What Is Data Warehouse

A data warehouse architecture defines the overall data communication and presentation architecture for end users within an enterprise. Common architectures include basic data warehouse, warehouse with staging area, and warehouse with staging area and data marts. The document then explains data scales, types of data collection methods, steps of data processing, data marts, data lakes, and knowledge discovery in databases with advantages and disadvantages.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

What Is Data Warehouse

A data warehouse architecture defines the overall data communication and presentation architecture for end users within an enterprise. Common architectures include basic data warehouse, warehouse with staging area, and warehouse with staging area and data marts. The document then explains data scales, types of data collection methods, steps of data processing, data marts, data lakes, and knowledge discovery in databases with advantages and disadvantages.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

what is Data warehouse ? Draw figure and explain its architecture?

A data warehouse architecture is a method of defining the overall architecture of


data communication processing and presentation that exist for end-clients
computing within the enterprise. Each data warehouse is different, but all are
characterized by standard vital components.

Three common architectures are:

o Data Warehouse Architecture: Basic


o Data Warehouse Architecture: With Staging Area
o Data Warehouse Architecture: With Staging Area and Data Marts
o Operational System
o An operational system is a method used in data warehousing to refer to
a system that is used to process the day-to-day transactions of an organization.
o Flat Files
o A Flat file system is a system of files in which transactional data is stored, and
every file in the system must have a different name.
o Meta Data
o A set of data that defines and gives information about other data.
o Data Warehouse Staging Area is a temporary location where a record from
source systems is copied.
o Properties of Data Warehouse
Architectures
o 1. Separation: Analytical and transactional processing should be keep apart as
much as possible.
o 2. Scalability: Hardware and software architectures should be simple to upgrade
the data volume, which has to be managed and processed, and the number of
user's requirements, which have to be met, progressively increase.
o 3. Extensibility: The architecture should be able to perform new operations and
technologies without redesigning the whole system.
o 4. Security: Monitoring accesses are necessary because of the strategic data
stored in the data warehouses.
o 5. Administerability: Data Warehouse management should not be complicated.

Types of Data Warehouse Architectures


Explain Data scale & types of Data scales with examples?

In statistics and data analysis, the data scale refers to the level of measurement
used to quantify data points. Essentially, it tells us how meaningful comparisons
and calculations we can make based on the data's values. There are four main
types of data scales, each with its own characteristics and limitations:

1. Nominal Scale:  Characteristics: Categorizes data into distinct groups without


any inherent order or ranking. Imagine sorting books by genre. Each genre
(fantasy, history, etc.) is distinct, but there's no order between them.  Examples:
Eye color (blue, green, brown), blood type (A, B, AB, O), job titles (doctor, teacher,
engineer).  Operations allowed: Counting and identifying frequencies within each
category.

2. Ordinal Scale:  Characteristics: Data points are ranked or ordered, but the
intervals between ranks are not necessarily equal. Think of movie ratings (1-5
stars). While we know 4 stars is "better" than 2 stars, the difference in quality
might not be the same between all levels.  Examples: Customer satisfaction
ratings (poor, average, good, excellent), socioeconomic status (low, middle, high),
degree of injury (minor, moderate, severe).  Operations allowed: Ranking,
identifying median and mode, comparing relative order.

3. Interval Scale:  Characteristics: Data points are ordered with equal intervals
between them, but there is no true zero point. Consider temperature in Celsius.
The difference between 20°C and 30°C is the same as 0°C and 10°C, but a
temperature of 0°C doesn't mean "no heat" at all.  Examples: Temperature
(Celsius, Fahrenheit), height and weight, IQ scores.  Operations allowed: All
operations of ordinal scales plus calculations like addition, subtraction, finding
mean and standard deviation.

4. Ratio Scale:  Characteristics: Data points are ordered with equal intervals and
have a true zero point, meaning the absence of the measured quantity. Imagine
money. A balance of $0 truly means no money, and the difference between $10
and $20 is the same as $20 and $30.  Examples: Age, time, distance, salary,
What Are the Different Data Collection Methods?

Primary and secondary methods of data collection are two approaches used to gather
information for research or analysis purposes. Let's explore each data collection method
in detail:

1. Primary Data Collection:

Primary data collection involves the collection of original data directly from the source or
through direct interaction with the respondents.

a. Surveys and Questionnaires: Researchers design structured questionnaires or surveys


to collect data from individuals or groups.
b. Interviews: Interviews involve direct interaction between the researcher and the
respondent.
c. Observations: Researchers observe and record behaviors, actions, or events in their
natural setting.
d. Experiments: Experimental studies involve the manipulation of variables to observe their
impact on the outcome.
e. Focus Groups: Focus groups bring together a small group of individuals who discuss
specific topics in a moderated setting.

2. Secondary Data Collection:

Secondary data collection involves using existing data collected by someone else for a
purpose different from the original intent.

a. Published Sources: Researchers refer to books, academic journals, magazines,


newspapers, government reports, and other published materials that contain relevant
data.

b. Online Databases: Numerous online databases provide access to a wide range of


secondary data, such as research articles, statistical information, economic data, and
social surveys.

c. Government and Institutional Records: Government agencies, research institutions,


and organizations often maintain databases or records that can be used for research
purposes.
d. Publicly Available Data: Data shared by individuals, organizations, or communities on
public platforms, websites, or social media can be accessed and utilized for research.

e. Past Research Studies: Previous research studies and their findings can serve as
valuable secondary data sources.

Explain the steps of data processing?

Stages of data processing Data processing involves transforming raw


data into valuable information, and it usually follows these key steps:

1. Data Collection: This first step gathers OR collection data from


various sources like sensors, databases, websites, surveys, or
experiments. The chosen method depends on your specific data
needs and goals.

2. Data Preparation: Here, you make the raw data usable for
analysis. This often involves:  Cleaning: Removing errors,
inconsistencies, and missing values.  Transformation: Formatting
data into a consistent structure, converting units, and handling
outliers.  Integration: Combining data from multiple sources if
needed.

3. Data Input: The prepared data is then loaded into a chosen


platform for analysis, like a data warehouse, spreadsheet, or
statistical software.

4. Data Processing: This is where you analyze and manipulate the


data to extract insights. This can involve:  Descriptive statistics:
Summarizing the data through measures like mean, median, and
standard deviation.  Data visualization: Creating charts, graphs, and
other visual representations to understand patterns and trends. 
Modeling: Building statistical or machine learning models to predict
future outcomes or relationships within the data.

5. Data Output: The extracted insights are presented in a clear and


concise way, often through reports, dashboards, or visualizations.

6. Data Storage: Finally, the processed data is saved securely for


future use, analysis, or reference.

Explain Data Mart and in Detail?


A data mart is a subject-oriented, integrated, time-variant,
non-volatile collection of data in support of decision-making
processes for a specific department or business unit within
an organization.
1. Subject-oriented: Data marts are built around specific topics or areas of
interest, such as marketing, sales, finance, or human resources.
2. Integrated: Data marts integrate data from various sources, both internal
and external, into a single, consistent format.
3. Time-variant: Data marts typically track data over time, allowing users to
analyze trends and patterns.
4. Non-volatile: Unlike operational databases that are constantly being
updated, data marts are relatively static.

5. Decision-making support: Ultimately, the purpose of a data mart is to


support decision-making processes within a specific department or business
unit.
Data Lake Explained in Detail ?
A data lake is essentially a giant container that can hold a massive
amount of data in its raw, native format. Imagine it like a digital
warehouse, but instead of neatly organizing everything into shelves
and categories, it just throws everything in together.
 Scalability: Data lakes can scale up easily to accommodate whatever amount
of data you throw at them.

 Flexibility: You can store any type of data in a data lake, regardless of its
structure or format. This makes them ideal for organizations that deal with a
lot of diverse data.

 Accessibility: Data lakes are designed to be easily accessible by data analysts


and scientists.

-Cost-effectiveness: Compared to data warehouses, data lakes are typically


more costeffective, especially for storing large amounts of data.

 Complexity: Managing a data lake can be complex, especially as it grows in


size.

 Data quality: Because data lakes store everything, it's easy for low-quality or
irrelevant data to creep in.

 Security: Ensuring the security of all that data in a data lake is crucial
Explain KDD in detail with advantages & disadvantages with diagram &
example?

in the context of computer science, “Data Mining” can be referred to as


knowledge mining from data, knowledge extraction, data/pattern analysis,
data archaeology, and data dredging. Data Mining also known as Knowledge
Discovery in Databases, refers to the nontrivial extraction of implicit,
previously unknown and potentially useful information from data stored in
databases.

KDD (Knowledge Discovery in Databases) is a process that involves the


extraction of useful, previously unknown, and potentially valuable information
from large datasets.

Data Cleaning Data cleaning is defined as removal of noisy and irrelevant data
from collection. 1. Cleaning in case of Missing values. 2. Cleaning noisy data,
where noise is a random or variance error. 3. Cleaning with Data discrepancy
detection and Data transformation tools

Data Integration Data integration is defined as heterogeneous data from


multiple sources combined in a common source(DataWarehouse). Data
integration using Data Migration tools, Data Synchronization tools and
ETL(Extract-LoadTransformation) process.

Data Selection Data selection is defined as the process where data relevant to
the analysis is decided and retrieved from the data collection. For this we can
use Neural network, Decision Trees, Naive bayes, Clustering, and Regression
methods

Data Transformation Data Transformation is defined as the process of


transforming data into appropriate form required by mining procedure. Data
Transformation is a two step process:

1. Data Mapping: Assigning elements from source base to destination to


capture transformations. 2. Code generation: Creation of the actual
transformation program
2. Code generation: Creation of the actual transformation program.

Data Mining Data mining is defined as techniques that are applied to extract
patterns potentially useful. It transforms task relevant data into patterns, and
decides purpose of model using classification or characterization.

Pattern Evaluation Pattern Evaluation is defined as identifying strictly


increasing patterns representing knowledge based on given measures. I

Knowledge Representation This involves presenting the results in a way that is


meaningful and can be used to make decisions.

Advantages of KDD

1. Improves decision-making:
2- Increased efficiency:
3- Better customer service:
4- Fraud detection:
5- Predictive modeling

Disadvantages of KDD

1. Privacy concerns: 3-Data Quality:

2. Complexity: 4-High cost:

You might also like