Data Management & Data Architecture
Data Management & Data Architecture
Programming Languages:
Spark: Apache Spark is another big data framework known for its speed
and versatility. It's used for data processing, machine learning, and graph
analytics.
TensorFlow and PyTorch: These libraries are popular for deep learning
and neural network development.
Note: Data wrangling is the process of converting raw data into a usable
form
Cloud Platforms:-
Emergence of Computers:
Early 2000s: Big Data Emergence: With the proliferation of the internet,
social media, and sensors, the volume of data exploded. The term "Big
Data" emerged, emphasizing the challenges and opportunities posed by
massive datasets.
Current Trends:
Data Cleaning: Ensuring that the data is accurate, complete, and free from
inconsistencies. Data cleaning involves removing duplicates, handling missing
values, correcting errors, and standardizing data formats to improve the quality
of data before analysis.
Data Accessibility: Ensuring that the right people have access to the data they
need, when they need it, without compromising security. This can include
using tools like dashboards, APIs, or data lakes that allow for easy access and
sharing of data across different teams.
Data Analysis and Reporting: Once the data is organized, cleaned, and made
accessible, it’s ready for analysis. Effective data management helps ensure that
data analysts and data scientists can extract meaningful insights, generate
reports, and build predictive models without running into data quality or
accessibility issues.
Data Sources
Data Sources:
Data Management:
Data management is the process of managing tasks like extracting
data, storing data, transferring data, processing data, and then
securing data with low-cost consumption.
Main motive of data management is to manage and safeguard the
people’s and organization data in an optimal way so that they can
easily create, access, delete, and update the data.
Because data management is an essential process in each and every
enterprise growth, without which the policies and decisions can’t be
made for business advancement. The better the data management the
better productivity in business.
Large volumes of data like big data are harder to manage traditionally
so there must be the utilization of optimal technologies and tools for
data management such as Hadoop, Scala, Tableau, AWS, etc. Which
can further used for big data analysis in achieving improvements in
patterns.
Data management can be achieved by training the employees
necessarily and maintenance by DBA, data analyst, and data
architects.
Data Collection:
Data collection is the process of acquiring, collecting, extracting, and
storing the voluminous amount of data which may be in the structured or
unstructured form like text, video, audio, XML files, records, or other
image files used in later stages of data analysis.
In the process of data analysis, “Data collection” is the initial step
before starting to analyze the patterns or useful information in data.
The data which is to be analyzed must be collected from different valid
sources.
The data which is collected is known as raw data which is not useful
now but on cleaning the impure and utilizing that data for further analysis
forms information, the information obtained is known as “knowledge”.
The main goal of data collection is to collect information-rich data.
Data collection starts with asking some questions such as what type of
data is to be collected and what is the source of collection.
Various sources of Data:
The data sources are divided mainly into two types known as:
1. Primary data
2. Secondary data
1. Primary data:
The data which is Raw, original, and extracted directly from the
official sources is known as primary data. This type of data is collected
directly by performing techniques such as questionnaires, interviews, and
surveys. The data collected must be according to the demand and
requirements of the target audience on which analysis is performed
otherwise it would be a burden in the data processing.
Few methods of collecting primary data:
1. Interview method:
The data collected during this process is through interviewing the
target audience by a person called interviewer and the person who
answers the interview is known as the interviewee. Some basic business
or product related questions are asked and noted down in the form of
notes, audio, or video and this data is stored for processing. These can be
both structured and unstructured like personal interviews or formal
interviews through telephone, face to face, email, etc.
2. Survey method:
The survey method is the process of research where a list of
relevant questions are asked and answers are noted down in the form of
text, audio, or video. The survey method can be obtained in both online
and offline mode like through website forms and email. Then that survey
answers are stored for analyzing data. Examples are online surveys or
surveys through social media polls.
3. Observation method:
The observation method is a method of data collection in which the
researcher keenly observes the behavior and practices of the target
audience using some data collecting tool and stores the observed data in
the form of text, audio, video, or any raw formats. In this method, the
data is collected directly by posting a few questions on the participants.
For example, observing a group of customers and their behavior towards
the products. The data obtained will be sent for processing.
4. Experimental method:
The experimental method is the process of collecting data through
performing experiments, research, and investigation. The most frequently
used experiment methods are CRD, RBD, LSD, FD.
CRD- Completely Randomized design is a simple experimental
design used in data analytics which is based on randomization and
replication. It is mostly used for comparing the experiments.
RBD- Randomized Block Design is an experimental design in which
the experiment is divided into small units called blocks. Random
experiments are performed on each of the blocks and results are drawn
using a technique known as analysis of variance (ANOVA). RBD was
originated from the agriculture sector.
LSD – Latin Square Design is an experimental design that is similar
to CRD and RBD blocks but contains rows and columns. It is an
arrangement of NxN squares with an equal amount of rows and columns
which contain letters that occurs only once in a row. Hence the
differences can be easily found with fewer errors in the experiment.
Sudoku puzzle is an example of a Latin square design.
FD- Factorial design is an experimental design where each
experiment has two factors each with possible values and on performing
trail other combinational factors are derived.
2. Secondary data:
Secondary data is the data which has already been collected and
reused again for some valid purpose. This type of data is previously
recorded from primary data and it has two types of sources named
internal source and external source.
Internal source:
These types of data can easily be found within the organization
such as market record, a sales record, transactions, customer data,
accounting resources, etc. The cost and time consumption is less in
obtaining internal sources.
External source:
The data which can’t be found at internal organizations and can be
gained through external third party resources is external source data. The
cost and time consumption is more because this contains a huge amount
of data. Examples of external sources are Government publications, news
publications, Registrar General of India, planning commission,
international labor bureau, syndicate services, and other non-
governmental publications.
Other sources:
Sensors data: With the advancement of IoT devices, the sensors of
these devices collect data which can be used for sensor data analytics to
track the performance and usage of products.
Satellites data: Satellites collect a lot of images and data in terabytes
on daily basis through surveillance cameras which can be used to collect
useful information.
Web traffic: Due to fast and cheap internet facilities many formats of
data which is uploaded by users on different platforms can be predicted
and collected with their permission for data analysis. The search engines
also provide their data through keywords and queries searched mostly.
Conceptual model:
It is a business model which uses Entity Relationship (ER) model for
relation between entities and their attributes.
Logical model:
It is a model where problems are represented in the form of logic such as
rows and column of data, classes, xml tags and other DBMS techniques.
Physical model:
Physical models holds the database design like which type of database
technology will be suitable for architecture.
Data Architect:
A data architect is responsible for all the design, creation, manage,
deployment of data architecture and defines how data is to be stored and
retrieved, other decisions are made by internal bodies.
Factors that influence Data Architecture :
Few influences that can have an effect on data architecture are business
policies, business requirements, Technology used, economics, and data
processing needs.
Business requirements:
These include factors such as the expansion of business, the
performance of the system access, data management, transaction
management, making use of raw data by converting them into image files
and records, and then storing in data warehouses. Data warehouses are
the main aspects of storing transactions in business.
Business policies:
The policies are rules that are useful for describing the way of processing
data. These policies are made by internal organizational bodies and other
government agencies.
Technology in use:
This includes using the example of previously completed data
architecture design and also using existing licensed software purchases,
database technology.
Business economics:
The economical factors such as business growth and loss, interest rates,
loans, condition of the market, and the overall cost will also have an
effect on design architecture.
Data processing needs :
These include factors such as mining of the data, large continuous
transactions, database management, and other data preprocessing needs.