Database & Big Data
Database & Big Data
Advantages
• Ease of use. Developers can install MySQL in
minutes, and the database is easy to manage.
• Reliability. MySQL is one of the most mature and
widely used database.
Disadvantage
• MySQL is inefficient for places where we need to
store very large data.
• MySQL does not have good developing and
MySQL PostgreSQL MariaDB SQL Lite CouchDB
Advantages
• Reduce costs. As a true open source product, PostgreSQL does
not cause anything – no license fees.
Disadvantages
• Slower performance
MySQL PostgreSQL MariaDB SQL Lite CouchDB
Advantage
• It is more scalable and offers a higher query speed.
Disadvantages
• It cannot support complex data types.
MySQL PostgreSQL MariaDB SQL Lite CouchDB
Advantage
• SQL lite source code is in the public-domain and is free
to everyone to use for any purpose.
Disadvantage
• It is not designed for high-concurrency or large scale
applications. It lacks advanced security features.
MySQL PostgreSQL MariaDB SQL Lite CouchDB
Advantage
• Scalability. The architectural design of CouchDB makes it
extremely adaptable when partitioning databases and scaling
data onto multiple nodes.
Disadvantages
• CouchDB takes a large space for overhead, which is major
disadvantage as compared to other databases.
• Temporary views on huge datasets are very slow.
Big Data
• Big data is the term used to describe data collections that are so
enormous (terabytes or more) and complex (from sensor data to
social media data) that traditional data management software,
hardware, and analysis processes are incapable of dealing with them.
• Organizations collect and use data from a variety of sources, including
business applications, social media, sensors and controllers that are
part of the manufacturing process, systems that manage the physical
environments and many more.
Sources of an organization’s useful data
• An organization has many sources of
useful data.
• Much of those data is instructed and does
not fit neatly into traditional relational
database management.
Characteristics of Big Data
• Volume provides the amount
of data and the form of data.
• Velocity is the data speed
and it provides the time at
which the data is collected
and analyze.
• Variety provides the type of
data collected.
Big Data Uses
Here are some examples of how organizations are employing big data to improve their day-to-day
operations, planning, and decision making.
• Retail organizations monitor social networks such as Facebook, Google, LinkedIn, Twitter, and
Yahoo to engage brand advocates, identify brand adversaries (and attempt to reverse their
negative opinions), and even enable passionate customers to sell their products.
• Advertising and marketing agencies track comments on social media to understand consumers’
responsiveness to ads, campaigns, and promotions.
• Hospitals analyze medical data and patient records to try to identify patients likely to need
readmission within a few months of discharge, with the goal of engaging with those patients in
the hope of preventing another expensive hospital stay.
Data Cleansing
• Data cleansing is the process of detecting and then
correcting or deleting incomplete, incorrect, inaccurate, or
irrelevant records that reside in a database.
• The goal of data cleansing is to improve the quality of the
data used in decision making.
Specific steps to perform Data
cleansing
There are specific steps
an organization might Steps:
take to perform data • To identify and correct data by cross checking it against a
cleansing before adding validated data set.
• Clear formatting.
this data to a data
• Remove irrelevant data
warehouse.
• Remove duplicates
Data cleansing is a key • Filter missing values
part of the overall data • Delete outliers
management process • Converting data types
and one of the core • Validate data
components of data
preparation work.
Data cleansing in an organization
• Inspect and profiling.
• Cleaning.
• Verification.
• Reporting.
Accuracy of Data
• The cost of performing data cleansing to achieve 100% database
accuracy can be prohibitively expensive.
• Accuracy can be measured with percent error which determines the
percentage of error between the sample’s measured observation and
the true measure of the population. If the measurement is far from
the true value of the population, the percent error is high and the
accuracy is low vice versa.
Concerns raised by performing data
cleansing and how to address them.
• One of the primary challenges you’ll encounter on your data cleansing
is resistance to change. Many organizations are accustomed to their
existing data processes and may be hesitant to disrupt them.
Employees might be resistant to adopting new tools and procedures,
fearing they will add complexity to their workflow. To overcome this
concern or challenge, it’s crucial to emphasize the benefits of data
cleansing and communicate the positive impact it will have on daily
operations. Provide training and support to help your team transition
smoothly and highlight how clean data can make their jobs more
efficient and effective.
• Data security is a paramount concern for any organization, and data
cleansing can raise legitimate security worries. Sharing and processing
data, even for the purpose of cleaning, can be perceived as a risk. To
address this challenge, implement robust data security measures.
Ensure the data is encrypted during transmission and storage, and
limit access to those who need it. Collaborate closely with your IT and
cybersecurity teams to establish a secure data cleaning process that
complies with relevant regulations and safeguards sensitive
information.
After doing all these data cleansing, than all these data is
than added to the data warehouse.