0% found this document useful (0 votes)
3 views

Data Management & Data Architecture

The document outlines various tools and technologies used in data analytics, including programming languages, data visualization tools, database management systems, and cloud platforms. It also discusses the evolution of data analytics from manual processes to advanced analytics and machine learning, highlighting real-life examples from companies like Netflix and Amazon. Additionally, it emphasizes the importance of data management and architecture in ensuring efficient data analysis and insights generation.

Uploaded by

triveni k
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Data Management & Data Architecture

The document outlines various tools and technologies used in data analytics, including programming languages, data visualization tools, database management systems, and cloud platforms. It also discusses the evolution of data analytics from manual processes to advanced analytics and machine learning, highlighting real-life examples from companies like Netflix and Amazon. Additionally, it emphasizes the importance of data management and architecture in ensuring efficient data analysis and insights generation.

Uploaded by

triveni k
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Tools of Data Analytics:-

In the field of data analytics, a wide range of tools and technologies


are available to help professionals collect, clean, process, analyze, and

visualize data. These tools empower analysts to make sense of vast


datasets and extract valuable insights efficiently. Here are some of the
essential tools of the trade in data analytics:

Programming Languages:

Python: Python is a versatile and popular programming language for data


analytics. It offers libraries such as Pandas, NumPy, Matplotlib, and
Seaborn for data manipulation, analysis, and visualization.

R: R is another widely used programming language for statistical analysis


and data visualization. It has a rich ecosystem of packages like ggplot2,
dplyr, and tidyr.

Data Visualization Tools:

Tableau: Tableau is a powerful data visualization tool that allows users


to create interactive and shareable dashboards. It's known for its ease of
use and ability to connect to various data sources.

Power BI: Microsoft's Power BI is a business intelligence tool that


provides robust data visualization and reporting capabilities. It's
particularly well-suited for organizations using Microsoft products.

D3.js: D3.js is a JavaScript library for creating custom data


visualizations. It gives analysts full control over the visualization design
and interactivity.

Database Management Systems:

SQL: SQL (Structured Query Language) is essential for querying and


managing relational databases. It's used for data retrieval, manipulation,
and data cleaning.
MySQL, PostgreSQL, SQLite: These are popular open-source relational
database management systems (RDBMS) often used in data analytics
projects.

NoSQL databases: For handling unstructured or semi-structured data,


NoSQL databases like MongoDB, Cassandra, and Elasticsearch are
valuable.

Big Data Tools:

Hadoop: Hadoop is an open-source framework for distributed storage


and processing of big data. Hadoop's ecosystem includes tools like
HDFS, MapReduce, and Hive.

Spark: Apache Spark is another big data framework known for its speed
and versatility. It's used for data processing, machine learning, and graph
analytics.

Statistical Analysis Software:

IBM SPSS(statistical package for social sciences): SPSS is a statistical


software package that provides advanced statistical analysis, data mining,
and predictive analytics capabilities.

SAS: SAS (statistical analysis system)offers a suite of analytics solutions


for data analysis, machine learning, and statistical modeling.

Machine Learning Libraries:

Scikit(scipy tookkit / scipy)-Learn: Scikit-Learn is a Python library that


provides tools for machine learning, including classification, regression,
clustering, and model evaluation.scipy is used to solve scientific &
mathematical problems.

TensorFlow and PyTorch: These libraries are popular for deep learning
and neural network development.

Data Cleaning and Preprocessing Tools:

OpenRefine: OpenRefine (formerly Google Refine) is a tool for cleaning


and transforming messy data, making it suitable for analysis.
Trifacta: Trifacta offers data wrangling capabilities that simplify the
process of cleaning and structuring data.

Note: Data wrangling is the process of converting raw data into a usable
form

Cloud Platforms:-

Amazon Web Services (AWS): AWS provides a wide range of cloud-


based data analytics services, including Amazon Redshift, Amazon
Athena, and Amazon SageMaker.

Google Cloud Platform (GCP): GCP offers BigQuery, Dataflow, and


other services for data analytics and machine learning.

Microsoft Azure: Azure provides services like Azure SQL Data


Warehouse and Azure Machine Learning for data analytics.

The Evolution of Data Analytics:-


Early Stages:

Manual Data Analysis (Pre-20th Century): Before the advent of


computers, data analysis was a manual and time-consuming process.
Analysts relied on charts, graphs, and basic statistical methods for
insights.

Emergence of Computers:

1950s-1960s: Mainframe Era: With the rise of mainframe computers,


organizations started using rudimentary data processing for business
applications. This marked the beginning of more systematic data
handling.

1970s: Decision Support Systems (DSS): Decision Support Systems


emerged, integrating computer systems with data analysis tools to assist
in decision-making.

Database Management Systems (DBMS):


1980s: Rise of Relational Databases: The advent of relational database
management systems (RDBMS) streamlined data storage and retrieval,
laying the foundation for structured data handling.

Data Warehousing and Business Intelligence:

1990s: Data Warehousing and BI: Data warehousing gained prominence,


allowing organizations to consolidate and analyze data from various
sources. Business Intelligence tools facilitated more accessible reporting
and analysis.

Big Data Era

Early 2000s: Big Data Emergence: With the proliferation of the internet,
social media, and sensors, the volume of data exploded. The term "Big
Data" emerged, emphasizing the challenges and opportunities posed by
massive datasets.

Mid-2010s: Hadoop and NoSQL: Technologies like Hadoop and


NoSQL databases addressed the scalability issues associated with
handling large volumes of unstructured and semi-structured data.

Advanced Analytics and Machine Learning:

2010s-2020s: Advanced Analytics: The integration of machine learning


and advanced analytics became more prevalent. Predictive analytics, data
mining, and artificial intelligence played a crucial role in deriving
insights from complex datasets.

Current Trends:

Real-time Analytics and Edge Computing: The need for real-time


insights led to the development of technologies like edge computing,
allowing data analysis closer to the source.

Augmented Analytics: The use of machine learning to automate data


preparation, insight discovery, and sharing insights has become more
widespread, making analytics more accessible to non-experts.
Exponential Growth in Data: The sheer volume of data generated
continues to grow exponentially, challenging organizations to develop
strategies for effective data management and analysis.

REAL LIFE EXAMPLES OF DATA ANALYTICS:-

Netflix's Content Recommendation: Netflix, the popular streaming


service, uses data analytics to recommend content to its users. They
collect data on user preferences, viewing history, and more. By analyzing
this data, they can suggest movies and TV shows that users are likely to
enjoy. This personalized recommendation system has contributed to their
tremendous growth and customer satisfaction.

Amazon's Supply Chain Optimization: Amazon employs data analytics


to optimize its supply chain. They use historical data, real-time data, and
predictive analytics to forecast demand, manage inventory efficiently, and
reduce shipping times. This has allowed them to offer fast and reliable
delivery to their customers.

Healthcare Predictive Analytics: Hospitals and healthcare providers are


using data analytics to predict patient outcomes, identify disease trends,
and improve patient care.

For example: 1.the Cleveland Clinic uses predictive analytics to identify


patients at risk of re-admission, allowing them to provide early
interventions and reduce healthcare cost.

2.Mount Sinai Health System in New York demonstrates the power of


using data to prevent re-admissions, improve patient outcomes and drive
cost savings.

Uber's Dynamic Pricing: Uber uses real-time data analytics to


implement surge pricing during periods of high demand. This data-driven
approach helps balance supply and demand, ensuring that customers can
get a ride when they need one and drivers are available during peak times.
Walmart's Inventory Management: Walmart utilizes data analytics to
manage its vast inventory efficiently. They collect data on sales, weather
patterns, and even social media trends. By analyzing this data, they can
make informed decisions about inventory levels, reducing costs and
ensuring products are available when customers want them.

Sports Analytics: In the sports industry, data analytics has become


increasingly important. Teams use analytics to assess player performance,
make strategic decisions during games, and even identify potential talent.
The "Moneyball" story, which revolves around the Oakland Athletics' use
of analytics to build a competitive baseball team, is a well-known
example.

E-commerce Personalization: Companies like eBay and Amazon


employ data analytics to personalize the online shopping experience.
They analyze customer behavior and browsing history to recommend
products and tailor the website interface, leading to increased sales and
customer satisfaction.

DATA MANAGEMENT IN DATA ANALYTICS:

Data management in data analytics refers to the process of collecting,


organizing, storing, and maintaining data in a way that makes it accessible, accurate,
and secure for analysis. Effective data management is essential for ensuring that
analytics can be performed efficiently and that the insights derived are trustworthy.
Here are the key components involved in data management for analytics:

Data Collection: Gathering data from various sources, such as databases,


sensors, social media, business transactions, and external data providers. This
step is crucial to ensure that the data is comprehensive and relevant to the
analysis.

Data Storage: Organizing data in databases or data warehouses. It could


involve cloud storage, on-premises databases, or hybrid systems, depending on
the organization's needs. A well-structured storage system helps in easily
retrieving and querying the data when needed.

Data Cleaning: Ensuring that the data is accurate, complete, and free from
inconsistencies. Data cleaning involves removing duplicates, handling missing
values, correcting errors, and standardizing data formats to improve the quality
of data before analysis.

Data Integration: Combining data from multiple sources to create a unified


view. This might involve merging datasets from different departments,
applications, or external providers to form a comprehensive dataset that is
useful for analysis.

Data Security: Implementing measures to protect data from unauthorized


access, corruption, or loss. This includes encryption, access control, data
masking, and regular backups, especially when dealing with sensitive or
personal information.

Data Governance: Establishing policies and procedures that ensure data is


managed consistently, ethically, and in compliance with regulations. It also
ensures that there is accountability in the data management process.

Data Accessibility: Ensuring that the right people have access to the data they
need, when they need it, without compromising security. This can include
using tools like dashboards, APIs, or data lakes that allow for easy access and
sharing of data across different teams.

Data Quality Assurance: Continuously monitoring and improving data


quality to ensure that data remains reliable and useful for analysis over time.
This can involve setting data quality metrics and using automated tools for
monitoring data health.

Data Analysis and Reporting: Once the data is organized, cleaned, and made
accessible, it’s ready for analysis. Effective data management helps ensure that
data analysts and data scientists can extract meaningful insights, generate
reports, and build predictive models without running into data quality or
accessibility issues.

DESIGN DATA ARCHITECTURE:

Data Sources

Internal Data: CRM, ERP, databases, logs, etc.

External Data: APIs, third-party services, IoT, etc

Data Ingestion Data Ingestion Data Ingestion


(Batch) (Real-Time) (Hybrid)
ETL/ELT Stream processing Combination of batch
Batch & Real-time feeds & stream
jobsScheduling
Custom Pipelines
Data Storage Layer

Data Lake: Raw data (AWS S3, Azure Data Lake)

Data Warehouse: Structured, optimized for queries (Snowflake, Redshift, BigQuery)

NoSQL: Unstructured data (MongoDB, Cassandra, etc.)


Data Processing Layer
| - Data Marts: Department-specific data (Marketing, Finance)
ETL/ELT: Data Transformation (Apache Spark, Databricks)

Batch Processing: Big Data Processing (Hadoop, Spark)

Stream Processing: Real-time Analytics (Flink, Spark)

Data Wrangling & Cleansing: Prepare data for analysis

Data Analytics Layer

BI Tools: Dashboards, reporting (Power BI, Tableau)

Data Science: Predictive analytics (ML models)

Prescriptive Analytics: Actionable insights (AI)

Self-Service BI: Empower business users to analyze data

Data Presentation Layer

Dashboards & Reports: Interactive visualizations

API s: Enable programmatic data access

Alerts & Notifications: Real-time triggers for users

Data Governance & Security

Data Quality: Ensure accuracy and consistency

Data Security: Encryption, Role-based access (RBAC)


Compliance: Regulatory adherence (GDPR, HIPAA, etc.)
Data architecture design is set of standards which are composed of
certain policies, rules, models and standards which manages, what type of
data is collected, from where it is collected, the arrangement of collected
data, storing that data, utilizing and securing the data into the systems and
data warehouses for further analysis.
 Data architecture design is important for creating a vision of
interactions occurring between data systems.
 Data architecture also describes the type of data structures applied to
manage data and it provides an easy way for data preprocessing.
Designing a data architecture involves creating a blueprint for how
data will be collected, processed, stored, and consumed to meet business
and analytics goals. The above diagram shows to design a modern data
architecture for analytics, with an emphasis on scalability, flexibility, and
efficiency.

Key Components of the Data Architecture:

Data Sources:

1. Internal: Includes databases (SQL, NoSQL), CRMs, ERPs,


and operational systems that generate data.
2. External: APIs, external data providers, social media, web
scraping, and third-party services.

Data Ingestion Layer:

Data ingestion refers to the process of importing, transferring, or


loading data from various external sources into a system or storage
infrastructure.

1. Batch Processing: Periodic data ingestion processes using


ETL tools or batch jobs.
2. Real-Time Ingestion: Tools like AWS Kinesis, or Google
Pub/Sub for continuous data flow from IoT, event logs, or
real-time sources.
3. Hybrid: A combination of both batch and real-time
ingestion processes to handle different data types and
requirements.

Data Storage Layer:

1. Data Lake: Centralized, cost-effective storage for raw,


unprocessed data (e.g., AWS S3(Simple Storage Service),
Google Cloud Storage).
2. Data Warehouse: Structured data storage optimized for
analytical querying (e.g., Amazon Redshift, Snowflake,
BigQuery).
3. NoSQL Database: Used for semi-structured or unstructured
data that doesn't fit relational models (e.g., MongoDB,
Cassandra).
4. Data Marts: Small, subject-specific data stores for
departments like finance, marketing, etc.

Data Processing Layer:

1. ETL/ELT: Extract, Transform, and Load (or Load,


Transform, Extract) processes for converting raw data into
structured and clean data for analytics.
2. Batch Processing: Uses big data frameworks like Apache

Spark or Hadoop to process large datasets.


3. Stream Processing: Stream processing is the processing of

data in real time. Real-time analytics and data transformation


using Apache Flink or Spark Streaming.

Data Analytics Layer:

1. BI Tools: Tools like Power BI, Tableau are used for

reporting and building interactive dashboards.


2. Data Science: Predictive analytics powered by machine

learning models, using tools like Python (scikit-learn,


TensorFlow), R, or cloud ML platforms like AWS
SageMaker or Azure ML.
3. Prescriptive Analytics: Optimizing business processes or

providing actionable insights using algorithms and AI.

Data Presentation Layer:

1. Dashboards & Reports: Visualize data trends, KPIs(key


performance indicator), and actionable insights for business
decision-makers.
2. APIs: Expose data and analytical models via APIs to other
systems, services, or applications.
3. Alerts & Notifications: Real-time alerts and notifications
based on predefined business rules or triggers.

Data Governance & Security:

1. Data Quality: Ensure that data is accurate, clean, and


consistently meets the organization's standards.
2. Security: Apply security measures such as encryption, role-
based access control (RBAC), and other security protocols to
protect sensitive data.
3. Compliance: Ensure compliance with data regulations
(GDPR, HIPAA, etc.).
4. Data Lineage: Track the flow of data from its source to its
consumption to ensure transparency and traceability.

Tools and Technologies:

 Data Ingestion: Kafka, AWS Kinesis, Apache Flink, AWS Glue,


Talend
 Data Storage: AWS S3 (Data Lake), Snowflake, Amazon
Redshift, BigQuery, MongoDB, Cassandra
 Data Processing: Apache Spark, Databricks, Apache Hadoop,
Apache Flink
 Data Analytics: Tableau, Power BI, Jupyter Notebooks, AWS
SageMaker, TensorFlow, scikit-learn
 Data Governance: Apache Atlas, Amundsen, DataHub
 Security: AWS KMS, Azure Key Vault, IAM (Identity and Access
Management), data encryption technologies
Advanced Analytics Techniques

Advanced analytics techniques refer to a set of sophisticated and


complex methods used to analyze and interpret data, uncover patterns,
trends, and insights, and make informed business decisions. These
techniques go beyond basic statistical analysis and traditional business
intelligence approaches. Advanced analytics leverages cutting-edge
computational and statistical methods to handle large volumes of data and
extract meaningful information. Some common advanced analytics
techniques include:

Machine Learning (ML): ML algorithms enable computers to learn


from data and make predictions or decisions without explicit
programming. Supervised learning, unsupervised learning, and
reinforcement learning are common ML approaches.

Predictive Analytics: This involves using statistical algorithms and


machine learning techniques to identify the likelihood of future outcomes
based on historical data. Predictive analytics is widely used in forecasting
and risk assessment.

Data Mining: Data mining involves the extraction of patterns and


knowledge from large datasets. It includes techniques such as clustering,
association rule mining, and anomaly detection.

Natural Language Processing (NLP): NLP techniques enable


computers to understand, interpret, and generate human-like text. This is
especially useful for analyzing unstructured data, such as social media
posts or customer reviews.

Text Analytics: This involves analyzing and extracting insights from


textual data. Techniques include sentiment analysis, named entity
recognition, and topic modeling.

Big Data Analytics: Big data analytics involves processing and


analyzing massive datasets that traditional databases cannot handle.
Technologies like Hadoop and Spark are commonly used in big data
analytics.
Prescriptive Analytics: This type of analytics goes beyond predicting
future outcomes and recommends actions to achieve desired outcomes. It
combines predictive analytics, optimization, and simulation.

Time Series Analysis: This technique is used to analyze time-ordered


data points to identify patterns, trends, and seasonality. It is commonly
applied in finance, economics, and forecasting.

Cluster Analysis: This involves grouping similar data points together


based on certain characteristics. It is often used for customer
segmentation and anomaly detection.

Optimization Techniques: These methods aim to find the best solution


to a problem by optimizing certain parameters. Linear programming,
integer programming, and genetic algorithms are examples of
optimization techniques.

Simulation Modeling: This involves creating computer models that


simulate the behavior of complex systems. It is used to understand how
changes in variables can impact the overall system.

Spatial Analytics: This involves analyzing geographic or spatial data to


identify patterns and trends. Geographic Information System (GIS)
technology is commonly used in spatial analytics.

Fraud Analytics: Utilizing advanced analytics to detect and prevent


fraudulent activities. Machine learning algorithms can analyze patterns in
transactions to identify anomalies indicative of fraud.

Social Network Analysis (SNA): SNA examines relationships and


interactions between entities, such as individuals or organizations, to
reveal patterns and structures within social networks.

Preservation Analysis: This is commonly used in maintenance and


reliability engineering to predict the remaining useful life of equipment
and assets, helping organizations optimize maintenance schedules.

Cohort Analysis: Involves grouping individuals with shared


characteristics to analyze trends and behaviors over time. It's often used
in marketing to understand customer behavior.
A/B Testing (Split Testing): This is a statistical method used to compare
two versions of a product or webpage to determine which performs better.
It's commonly used in marketing and product development.

Bayesian Analysis: A statistical method based on Bayes' theorem that


updates probabilities as new data becomes available. It's particularly
useful in situations with limited data.

Deep Learning: A subset of machine learning that involves neural


networks with multiple layers (deep neural networks). Deep learning is
especially effective in tasks like image and speech recognition.

Quantitative Risk Analysis: Involves assessing and quantifying risks


using statistical models and simulations. It helps organizations understand
the potential impact of risks and prioritize risk management strategies.

Pattern Recognition: This involves identifying and classifying patterns


within data. It's used in various applications, such as image recognition,
speech recognition, and medical diagnosis.

Bayesian Networks: A probabilistic graphical model that represents a set


of variables and their probabilistic dependencies. It's used for reasoning
under uncertainty and is applied in various fields, including healthcare
and finance.

Ensemble Learning: This involves combining multiple machine learning


models to improve predictive performance and reduce overfitting.

Prescriptive Maintenance: This uses advanced analytics to predict when


equipment or machinery is likely to fail, allowing for proactive
maintenance and minimizing downtime.

Organizations use advanced analytics techniques to gain a


competitive advantage, improve decision-making processes, and uncover
hidden insights within their data. These techniques are applied across
various industries ,including finance, healthcare, marketing,
manufacturing, and more.

Data Management:
 Data management is the process of managing tasks like extracting
data, storing data, transferring data, processing data, and then
securing data with low-cost consumption.
 Main motive of data management is to manage and safeguard the
people’s and organization data in an optimal way so that they can
easily create, access, delete, and update the data.
 Because data management is an essential process in each and every
enterprise growth, without which the policies and decisions can’t be
made for business advancement. The better the data management the
better productivity in business.
 Large volumes of data like big data are harder to manage traditionally
so there must be the utilization of optimal technologies and tools for
data management such as Hadoop, Scala, Tableau, AWS, etc. Which
can further used for big data analysis in achieving improvements in
patterns.
 Data management can be achieved by training the employees
necessarily and maintenance by DBA, data analyst, and data
architects.

Data Collection:
 Data collection is the process of acquiring, collecting, extracting, and
storing the voluminous amount of data which may be in the structured or
unstructured form like text, video, audio, XML files, records, or other
image files used in later stages of data analysis.
 In the process of data analysis, “Data collection” is the initial step
before starting to analyze the patterns or useful information in data.
 The data which is to be analyzed must be collected from different valid
sources.
 The data which is collected is known as raw data which is not useful
now but on cleaning the impure and utilizing that data for further analysis
forms information, the information obtained is known as “knowledge”.
 The main goal of data collection is to collect information-rich data.
 Data collection starts with asking some questions such as what type of
data is to be collected and what is the source of collection.
Various sources of Data:
The data sources are divided mainly into two types known as:
1. Primary data
2. Secondary data

1. Primary data:
The data which is Raw, original, and extracted directly from the
official sources is known as primary data. This type of data is collected
directly by performing techniques such as questionnaires, interviews, and
surveys. The data collected must be according to the demand and
requirements of the target audience on which analysis is performed
otherwise it would be a burden in the data processing.
Few methods of collecting primary data:
1. Interview method:
The data collected during this process is through interviewing the
target audience by a person called interviewer and the person who
answers the interview is known as the interviewee. Some basic business
or product related questions are asked and noted down in the form of
notes, audio, or video and this data is stored for processing. These can be
both structured and unstructured like personal interviews or formal
interviews through telephone, face to face, email, etc.
2. Survey method:
The survey method is the process of research where a list of
relevant questions are asked and answers are noted down in the form of
text, audio, or video. The survey method can be obtained in both online
and offline mode like through website forms and email. Then that survey
answers are stored for analyzing data. Examples are online surveys or
surveys through social media polls.
3. Observation method:
The observation method is a method of data collection in which the
researcher keenly observes the behavior and practices of the target
audience using some data collecting tool and stores the observed data in
the form of text, audio, video, or any raw formats. In this method, the
data is collected directly by posting a few questions on the participants.
For example, observing a group of customers and their behavior towards
the products. The data obtained will be sent for processing.
4. Experimental method:
The experimental method is the process of collecting data through
performing experiments, research, and investigation. The most frequently
used experiment methods are CRD, RBD, LSD, FD.
 CRD- Completely Randomized design is a simple experimental
design used in data analytics which is based on randomization and
replication. It is mostly used for comparing the experiments.
 RBD- Randomized Block Design is an experimental design in which
the experiment is divided into small units called blocks. Random
experiments are performed on each of the blocks and results are drawn
using a technique known as analysis of variance (ANOVA). RBD was
originated from the agriculture sector.
 LSD – Latin Square Design is an experimental design that is similar
to CRD and RBD blocks but contains rows and columns. It is an
arrangement of NxN squares with an equal amount of rows and columns
which contain letters that occurs only once in a row. Hence the
differences can be easily found with fewer errors in the experiment.
Sudoku puzzle is an example of a Latin square design.
 FD- Factorial design is an experimental design where each
experiment has two factors each with possible values and on performing
trail other combinational factors are derived.

2. Secondary data:
Secondary data is the data which has already been collected and
reused again for some valid purpose. This type of data is previously
recorded from primary data and it has two types of sources named
internal source and external source.
Internal source:
These types of data can easily be found within the organization
such as market record, a sales record, transactions, customer data,
accounting resources, etc. The cost and time consumption is less in
obtaining internal sources.
External source:
The data which can’t be found at internal organizations and can be
gained through external third party resources is external source data. The
cost and time consumption is more because this contains a huge amount
of data. Examples of external sources are Government publications, news
publications, Registrar General of India, planning commission,
international labor bureau, syndicate services, and other non-
governmental publications.
Other sources:
 Sensors data: With the advancement of IoT devices, the sensors of
these devices collect data which can be used for sensor data analytics to
track the performance and usage of products.
 Satellites data: Satellites collect a lot of images and data in terabytes
on daily basis through surveillance cameras which can be used to collect
useful information.
 Web traffic: Due to fast and cheap internet facilities many formats of
data which is uploaded by users on different platforms can be predicted
and collected with their permission for data analysis. The search engines
also provide their data through keywords and queries searched mostly.

Data Architecture Design:


 Data architecture design is set of standards which are composed of
certain policies, rules, models and standards which manages, what type of
data is collected, from where it is collected, the arrangement of collected
data, storing that data, utilizing and securing the data into the systems and
data warehouses for further analysis.
 Data architecture design is important for creating a vision of
interactions occurring between data systems.
 Data architecture also describes the type of data structures applied to
manage data and it provides an easy way for data preprocessing.
 The data architecture is formed by dividing into three essential models
and then are combined :

 Conceptual model:
It is a business model which uses Entity Relationship (ER) model for
relation between entities and their attributes.
 Logical model:
It is a model where problems are represented in the form of logic such as
rows and column of data, classes, xml tags and other DBMS techniques.
 Physical model:
Physical models holds the database design like which type of database
technology will be suitable for architecture.

Data Architect:
 A data architect is responsible for all the design, creation, manage,
deployment of data architecture and defines how data is to be stored and
retrieved, other decisions are made by internal bodies.
Factors that influence Data Architecture :
Few influences that can have an effect on data architecture are business
policies, business requirements, Technology used, economics, and data
processing needs.

Business requirements:
These include factors such as the expansion of business, the
performance of the system access, data management, transaction
management, making use of raw data by converting them into image files
and records, and then storing in data warehouses. Data warehouses are
the main aspects of storing transactions in business.
 Business policies:
The policies are rules that are useful for describing the way of processing
data. These policies are made by internal organizational bodies and other
government agencies.
 Technology in use:
This includes using the example of previously completed data
architecture design and also using existing licensed software purchases,
database technology.
 Business economics:
The economical factors such as business growth and loss, interest rates,
loans, condition of the market, and the overall cost will also have an
effect on design architecture.
 Data processing needs :
These include factors such as mining of the data, large continuous
transactions, database management, and other data preprocessing needs.

You might also like