0% found this document useful (0 votes)
52 views

Introduction To Data Management - Week 1 - 2024

Data management involves organizing data to optimize its use within policy and enable informed decision making. It includes creating naming standards, carefully recording details and decisions, backing up data in multiple locations, and planning for long-term storage. Both structured and unstructured data require management, with structured data being easier to analyze due to its predefined structure in databases, while unstructured data like text and media is less structured and more challenging to analyze at scale. Effective data management benefits organizations by improving decisions, credibility, operations and access to capital.

Uploaded by

Mjay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views

Introduction To Data Management - Week 1 - 2024

Data management involves organizing data to optimize its use within policy and enable informed decision making. It includes creating naming standards, carefully recording details and decisions, backing up data in multiple locations, and planning for long-term storage. Both structured and unstructured data require management, with structured data being easier to analyze due to its predefined structure in databases, while unstructured data like text and media is less structured and more challenging to analyze at scale. Effective data management benefits organizations by improving decisions, credibility, operations and access to capital.

Uploaded by

Mjay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Introduction to Data Management

1. Overview
Data management is the practice of collecting, keeping, and using data securely,
efficiently, and cost-effectively. The goal of data management is to help people,
organisations, and connected things optimise the use of data within the bounds of
policy and regulation so that they can make decisions and take actions that maximise
the benefit to the organisation.

Thus, data management is a way to organise and maintain your information through
creating naming standards, for consistency and easier location, carefully recording all
the details, including decisions made during the project, and saving copies of your
data in more than one place to prevent loss and making plans to store your data after
project completion.

1.1 Content Resources


Introduction to data management and data analytics
• Watch the YouTube video and discover what Data Management is and how it
can help a business.
https://www.youtube.com/watch?v=JHl_R0ajlZE

2. Data Management Overview


The core intention of data management is to assist people with the administrative side
of data, which allows for more time to focus on the research or task itself. The great
thing about data management is that it benefits everyone – from college students
organising their computer files to experienced researchers working with terabytes of
data. Wherever you are in the research process, data management is an important
part of conducting research.

Here are some ways that you can apply data management skills in your own life:
• Save your files in formats that everyone can access, no matter if they have access
to certain software or not. Examples of open-access files are plain text files (.txt),
PNG files (.png), and CSV files (.csv). Proprietary formats are the opposite of
open access and include file formats such as Word (.doc or .docx) and Excel
(.xlsx) documents.
• Keep your computer organised by creating and sticking to a file naming
convention. This increases the chance of finding the correct file when you need
it.
• Document your research process, decisions, and changes. This allows you to
have a reference for questions and keep track of what still needs to be done. This
helps your future self remember what you did. If you do not write it down, you will
forget it!
• Write down your database and search entries (queries). This way you know what
search terms get you the best results.
• Back up your data regularly, on multiple places and more than one media type.
Once data has been collected, it will need to be saved. Make sure the correct data
source and file naming convention are used.

3. Data Source Usage


3.1 Analytics and Operations
Most of our experience with data sources occurs in the context of analytics, but, the
most common usage is automated operations. All IT systems that operate
automatically run continuous queries to databases. As people, we do not see these
systems, but they make up most data source instances.

When we think about data sources, we should remember that these are not
exclusively the ones we see — but virtually every system we do not see as well.

3.2 Financial vs. Nonfinancial


Potentially everything a business does has a financial impact. Lawyers' fees for
fighting a sexual harassment lawsuit will affect your bottom line, for example. If you
must settle or pay damages, the effect will be greater.
Even so, looking at examples of financial data and nonfinancial data shows that there
is a difference. Financial data examples include advertising costs, sales revenue,
employee compensation, and the value of assets. Examples of nonfinancial
information include environmental impact, your relationship with your vendors,
diversity in the workplace, and social responsibility. They may have financial impacts,
but it is impossible to quantify them purely by assigning them a dollar figure.

3.3 Nonfinancial Data and Investment


The focus of any business decision is usually profit and loss. How much will it cost
us? What are the potential rewards? What's the risk of loss? However, there are also
times when nonfinancial information is required for an investment decision.
• Does the company meet the requirements of current legislation on, say,
handling harassment or workplace bullying? If new legislation is in the wings,
will you still be complying?
• Does the company follow industry standards and best practices?
• Is staff morale high? Is the business one that can attract talented recruits?
• Does the local community see it as a friend or a despoiler?
• Are relationships with clients and suppliers good?

Nonfinancial data is also important for internal decision-making in a business. Cutting


employee benefits and bonuses might improve the bottom line in the short term, but
if it damages employee morale and loyalty, it will hurt in the long run.

3.3.1 Examples of Nonfinancial Information Benefits


Measuring whether sales revenue rises or falls between this quarter and the last is
simple. Measuring customer loyalty, employee commitment or environmental impact
takes more work, but it offers rewards:
• Improved management decisions.
• Better stakeholder confidence.
• Increased credibility within the community.
• A lower risk of problems.
• Improved operations.
• Greater access to capital, as you are seen as a safe, reliable investment.
4. Structured vs. Unstructured Data
Structured data vs. unstructured data: structured data is comprised of clearly defined
data types with patterns that make them easily searchable; while unstructured data –
“everything else” – is comprised of data that is usually not as easily searchable,
including formats like audio, video, and social media postings.
The “versus” in unstructured data vs. structured data does not denote conflict between
the two. Customers select one or the other not based on their data structure, but on
the applications that use them: relational databases for structured data, and most any
other type of application for unstructured data.

However, there is a growing tension between the ease of analysis of structured data
versus the more challenging analysis of unstructured data. Structured data analytics
is a mature process and technology. Unstructured data analytics is a nascent industry
with a lot of new investment in research and development, but it is not yet a mature
technology. The structured data vs. unstructured data issue within corporations is
deciding if they should invest in analytics for unstructured data, and if it is possible to
aggregate the two into better business intelligence.

4.1 What Is Structured Data?


Structured data usually resides in relational databases (RDBMS). Fields store length-
delineated data like phone numbers, Identification numbers, or ZIP codes. Records
even contain text strings of variable length like names, making it a simple matter to
search. Data may be human- or machine-generated if the data is created within an
RDBMS structure. This format is eminently searchable, both with human-generated
queries and via algorithms using types of data and field names, such as alphabetical
or numeric, currency, or date.

Common relational database applications with structured data include airline


reservation systems, inventory control, sales transactions, and ATM activity.
Structured Query Language (SQL) enables queries on this type of structured data
within relational databases.

Some relational databases store or point to unstructured data, such as customer


relationship management (CRM) applications. The integration can be awkward at best
since memo fields do not lend themselves to traditional database queries. Still, most
of the CRM data is structured.

4.2 What Is Unstructured Data?


Unstructured data is essentially everything else. Unstructured data has an internal
structure but is not structured via predefined data models or schema. It may be textual
or non-textual, and human- or machine-generated. It may also be stored within a
nonrelational database like NoSQL.
Typical human-generated unstructured data includes:
• Text files: Word processing, spreadsheets, presentations, emails, logs.
• Email: Email has some internal structure thanks to its metadata, and we
sometimes refer to it as semi-structured. However, its message field is
unstructured, and traditional analytics tools cannot parse it.
• Social Media: Data from Facebook, Twitter, LinkedIn.
• Website: YouTube, Instagram, photo-sharing sites.
• Mobile data: Text messages, locations.
• Communications: Chat, IM, phone recordings, collaboration software.
• Media: MP3, digital photos, audio, and video files.
• Business applications: MS Office documents, productivity applications.

Typical machine-generated unstructured data includes:


• Satellite imagery: Weather data, landforms, military movements.
• Scientific data: Oil and gas exploration, space exploration, seismic imagery,
atmospheric data.
• Digital surveillance: Surveillance photos and video.
• Sensor data: Traffic, weather, oceanographic sensors.

The most inclusive Big Data analysis makes use of both structured and unstructured
data.

4.3 Structured Vs. Unstructured Data: What’s The Difference?


Besides the obvious difference between storing in a relational database and storing
outside of one, the biggest difference between structured and unstructured data is the
ease of analysis. Mature analytics tools exist for structured data, but analytics tools
for mining unstructured data are nascent and developing.

Users can run simple content searches across textual unstructured data. However its
lack of orderly internal structure defeats the purpose of traditional data mining tools,
and the enterprise gets little value from potentially valuable data sources like rich
media, network or weblogs, customer interactions, and social media data.

On top of this, there is simply much more unstructured data than structured.
Unstructured data makes up 80% or more of enterprise data and is growing at the
rate of 55% and 65% per year. And without the tools to analyse this massive data
category, organisations are leaving vast amounts of valuable data on the business
intelligence table.

Besides understanding the different types of data, you need to understand how to
save the data in a file format that is easily accessible. This is related to the naming
convention of the file.

5. What is a File Naming Convention?


A File Naming Convention (FNC) is a framework for naming your files in a way that
describes what they contain and how they relate to other files.

It is essential to establish a FNC before you begin collecting data to prevent a backlog
of unorganised files that could lead to misplaced or lost data. Deciding on a FNC within
a group is helpful for effective communication and consistency in your work. Consider
the important elements of your research to establish meaningful file names (date of
creation, version of the file, project title, etc.).

5.1 File Naming Conventions


When you are working, you may not always think about how you name your files. You
may end up with a folder that looks like this:
Source: https://xkcd.com/1459/

Thankfully, there are many ways to name your files. To start, you can choose from the
elements below that are meaningful to you and your project:

• Project lead's last name or initials


• File creator's last name or initials
• Project name/acronym
• Date file created/generated (in YYYY-MM-DD format)
• Version number (with leading zeroes)

Now that you understand file naming conventions, let’s look at how documentation
works.

6. What is Documentation?
Documentation means capturing the work that you do in a way that enables others
to understand what you did so they can duplicate the process. To do this, your
documentation must include information about what was done, how it was
done, why it was done when it was performed, where it was performed,
and who performed the work.
6.1 Documentation Formats
Documentation can take on a variety of formats, though all formats should be similar
in content. All forms of documentation must include basic information about the data
that allows for its correct interpretation and reuse by yourself in the future and other
researchers. Different fields of study may choose one format over another.
• README file - a file that contains critical information about data file(s), including
citation information, file organisation structure, variable definitions,
methodological information, code (if applicable), data collection information,
software/instruments used and versions, licensing information, etc. Often in .txt
file format.
• README tab - like a README file, except this is often created in connection
with a spreadsheet.
• Data Dictionary - a file that provides critical information about a data file by
describing the names, definitions, and attributes of the data elements. This is
often created for tabular data, though it can be made for all dataset formats.
• Codebook - a file that documents the layout and structure of a data file and
contains the response codes that are used to record survey responses and
other information. This is most often done in social science research.
• Commented Code - in-line comments in computer code that describe the
code’s function that is not obtainable from reading the code itself.
• Lab Notebooks - a detailed record of all activities done while conducting
research, including experimental materials and conditions, protocols, and
results. Includes both e-lab notebooks and physical notebooks.

After the data has been documented, one should ensure the data is secured/backed
up.

7. What is Data Security / Backup?


Security involves maintaining the integrity of the data on the storage system and
backups, as well as ensuring that sensitive or confidential data is managed in a way
that is compliant with university, state, and federal regulations, in addition to the
requirements of the funder.
Backups refer to the creation of additional copies of your data that can be used to
restore data if the original is damaged or deleted. The general rule of thumb is that
you should have three copies of your data:
Original + Local Copy + Remote Copy

7.1 Security and Backup Resources


Using Passwords Effectively
This guide from Privacy and Information Security provides a wealth of information
about what makes good passwords and how to best use passwords to protect your
accounts and information.

7.2 Backup Your Data


Use the 3-2-1 rule to keep your data safe. The image below is a good resource for
learning how to keep your data backed up and secure.

Cloud-based storage solutions such as Google Drive can also be used to back up
data. Once the data is secured and has been successfully backed up, data
preservation takes place.

8. What is Preservation?
Preservation means following procedures to keep data files for an extended period
by choosing durable formats, archiving files locally, and/or submitting data files to a
data repository.

8.1 Why should you preserve your data?


Data preservation refers to maintaining access to data and files over time. For data to
be preserved, at minimum, it must be stored in a secure location, stored across
multiple locations, and saved in file formats that will likely have the greatest utility in
the future:

Source: http://www.library.illinois.edu/sc/services/data_management/file_formats.html

Data preservation may seem like a lot of work, but there’s good news! A lot of
institutions run data repositories, which support the preservation and discovery of
research data. Find out if Wits has a data repository and check how researchers on
campus have preserved and shared their data.

In today’s digital economy, data is a kind of capital, an economic factor of production


in digital goods and services. Just as an automaker can’t manufacture a new model if
it lacks the necessary financial capital, it can’t make its cars autonomous if it lacks the
data to feed the onboard algorithms. This new role for data has implications for
competitive strategy as well as for the future of computing.

Given this central and mission-critical role of data, strong management practices and
a robust management system are essential for every organisation, regardless of size
or type.

9. Data Management Systems Today


Today’s organisations need a data management solution that provides an efficient way
to manage data across a diverse but unified data tier. Data management systems are
built on data management platforms and can include databases, data lakes and data
warehouses, big data management systems, data analytics, and more.

All these components work together as a “data utility” to deliver the data management
capabilities an organisation needs for its apps and the analytics and algorithms that
use the data originated by those apps. Although current tools help database
administrators (DBAs) automate many of the traditional management tasks, manual
intervention is still often required because of the size and complexity of most database
deployments. Whenever manual intervention is required, the chance of errors
increases. Reducing the need for manual data management is a key objective of a
new data management technology, the autonomous database.

10. Data Management Platforms


A data management platform is the foundational system for collecting and analysing
large volumes of data across an organisation. Commercial data platforms typically
include software tools for management, developed by the database vendor or by third-
party vendors. These data management solutions help IT teams and Database
administrators (DBAs) perform typical tasks such as:

• Identifying, alerting, diagnosing, and resolving faults in the database system


or underlying infrastructure
• Allocating database memory and storage resources
• Making changes in the database design
• Optimising responses to database queries for faster application performance

11. Data Management Challenges


Most of the challenges in data management today stem from the faster pace of
business and the increasing proliferation of data. The ever-expanding variety, velocity,
and volume of data available to organisations are pushing them to seek more-effective
management tools to keep up. Some of the top challenges organisations face include
the following:
Lack of data insight Data from an increasing number and variety of sources such as
sensors, smart devices, social media, and video cameras are being
collected and stored. But none of that data is useful if the
organisation doesn’t know what data it has, where it is, and how to
use it. Data management solutions need scale and performance to
deliver meaningful insights promptly.
Difficulty maintaining Organisations are capturing, storing, and using more data all the
data-management time. To maintain peak response times across this expanding tier,
performance levels organisations need to continuously monitor the type of questions the
database is answering and change the indexes as the queries
change—without affecting performance.
Challenges Compliance regulations are complex and multijurisdictional, and
complying with they change constantly. Organisations need to be able to easily
changing data review their data and identify anything that falls under new or
requirements modified requirements. In particular, personally identifiable
information (PII) must be detected, tracked, and monitored for
compliance with increasingly strict global privacy regulations.
Need to easily Collecting and identifying the data itself doesn’t provide any value—
process and convert the organisation needs to process it. If it takes a lot of time and effort
data to convert the data into what they need for analysis, that analysis
won’t happen. As a result, the potential value of that data is lost.
The constant need to In the new world of data management, organisations store data in
store data effectively multiple systems, including data warehouses and unstructured data
lakes that store any data in any format in a single repository. An
organisation’s data scientists need a way to quickly and easily
transform data from its original format into the shape, format, or
model they need it to be in for a wide array of analyses.
Demand to With the availability of cloud data management systems,
continually optimise organisations can now choose whether keep and analyse data in
IT agility and costs on-premises environments, in the cloud, or a hybrid mixture of the
two. IT organisations need to evaluate the level of identicality
between on-premises and cloud environments to maintain maximum
IT agility and lower costs.

The increasingly popular cloud database platforms allow businesses to scale up or


down quickly and cost-effectively. Some are available as a service, allowing
organisations to save even more.

12. Data Management Best Practices


Addressing data management challenges requires a comprehensive, well-thought-out
set of best practices. Although specific best practices vary depending on the type of
data involved and the industry, the following best practices address the major data
management challenges organisations face today:

Create a discovery A discovery layer on top of the organisation’s data tier allows
layer to identify your analysts and data scientists to search and browse for datasets to
data make the data usable.
Develop a data A data science environment automates as much of the data
science environment transformation work as possible, streamlining the creation and
to efficiently evaluation of data models. A set of tools that eliminates the need
repurpose your data for the manual transformation of data can expedite the
hypothesizing and testing of new models.
Use autonomous Autonomous data capabilities use AI and machine learning to
technology to continuously monitor database queries and optimise indexes as
maintain the queries change. This allows the database to maintain rapid
performance levels response times and frees DBAs and data scientists from time-
across your consuming manual tasks.
expanding data tier
Use discovery to New tools use data discovery to review data and identify the
stay on top of chains of connection that need to be detected, tracked, and
compliance monitored for multijurisdictional compliance. As compliance
requirements demands increase globally, this capability is going to be
increasingly important to risk and security officers.
Ensure you are using A converged database is a database that has native support for
a converged all modern data types and the latest development models built into
database one product. The best-converged databases can run many kinds
of workloads, including graphs, IoT, blockchain, and machine
learning.
Ensure the database The goal of bringing data together is to be able to analyse it to
platform has the make better, more timely decisions. A scalable, high-performance
performance, scale, database platform allows enterprises to rapidly analyse data from
and availability to multiple sources using advanced analytics and machine learning
support a business so they can make better business decisions.
Use a common query New technologies are enabling data management repositories to
layer to manage work together, making the differences between them disappear.
multiple and diverse A common query layer that spans the many kinds of data storage
forms of data storage enables data scientists, analysts, and applications to access data
without needing to know where it is stored and without needing to
manually transform it into a usable format.

13. Data Management Evolves


With data’s new role as business capital, organisations are discovering what digital
startups and disruptors already know: Data is a valuable asset for identifying trends,
making decisions, and taking action before project completion.

Businesses must embrace data management tools and best practices to increase
operational efficiency and craft smarter, more reliable growth strategies.

Data management is the process of gathering, storing, analysing, and sharing


data within a larger organisation. Prospects and customers create tons of valuable
information every day, but it’s estimated that only 32% of that data is used to a
company’s benefit.

The sheer amount of available data can make it difficult to know what to look for and
how to use it. Pouring it all into a big pile doesn’t magically reveal insights into
customer behaviours or improvements to existing business operations. A company
must develop a baseline strategy for data management to convert countless data
points into measurable, actionable insights and results.

Data management provides companies with a means to easily evaluate important


information in meaningful ways. Customers create new data points every time they
use the product or like a social media post.

Relevant data is created through several sources, including:


• Products
• Websites
• Marketing channels
• Customer relationship management (CRM) software
• Accounting and payment platforms

Companies that actively engage with their data better understand their customers than
those that don’t. Combining these sources of data allows companies to build more
complete customer profiles and make better-informed business decisions. The lack of
a defined data management strategy leads to a situation where executives, team
leaders, and even base-level employees make important decisions based on limited,
dated, or even inaccurate data.

14. Where is data management used?


Digital data is used by companies across verticals and of various sizes. Even small
companies have websites, product analytics, and more that can be collected and
sorted. This means data management practices are leveraged by companies as large
as Facebook and Amazon or as small as startups.

Companies with a vast volume of information at their disposal will find data
management particularly beneficial. As a rule, the more relevant data a business has
at its disposal, the more accurate its assessment of it will be. A chief reason that
Netflix’s “You might like…” recommendation system works so well is because it’s
informed by the data of millions of consumers. However, before Netflix was able to
leverage that data into an executable strategy, they had to collect, sort, and analyse it
first.
14.1 Real-world examples of data management
Chameleon
• Product success platform Chameleon used to manage their event tracking
manually through Google Sheets. This created a situation in which the resource
was constantly out-of-date and inaccurate. There was no way to verify whether
the information in the resource reflected the current state of the product. They
could no longer trust their data, making it of limited strategic value.
• Chameleon adopted the Amplitude-acquired tool Iteratively to assist in data
verification. Integrating Iteratively with their existing analytics stack allowed
them to build schemas and adopt naming conventions to help confirm and
validate events within their product. This greatly improved the trustworthiness
of their data. Chameleon was also able to create defined processes for data
handling, resulting in increased collaboration between teams.

Flipp
• Planning and shopping app Flipp initially adopted Amplitude to enhance the
level of personalisation in their marketing campaigns. They achieved their goal,
but the Flipp team discovered an additional benefit of using the data
management solution: data democratisation. Their growth marketing team was
able to access reliable data faster than ever before. This allowed them to react
to campaigns more quickly than if they had waited for another team to find and
send over the data.

Instacart
• Grocery delivery service Instacart struggled for a time with data efficiency.
Their data management infrastructure consisted of self-built tools and an
internal database. Getting the tools to speak to each other and respond to
requests was a frustrating process that required a great deal of time and effort.
Additionally, Instacart’s data volume had grown beyond the capacity of its
existing management system.
• Instacart adopted Amplitude to unite the data from these tools through a single
solution. Amplitude was also able to handle its growing data load with ease.
This vast infrastructure improvement allowed the Instacart team to focus on
product improvements instead of getting bogged down in the development and
maintenance of their tools.

15. Unlock the power of data management


Data management is critical to organise and for making sense of a company’s vast
amounts of data. Once there is a data management process in place, a business can
use the data to understand key customer insights and turn them into actions that drive
conversion and retention.

16. Summary
In today’s digital economy, companies have access to more data than ever before.
This data creates a foundation of intelligence for important business decisions. To
ensure employees have the right data for decision-making, companies must invest in
data management solutions that improve visibility, reliability, security, and scalability.

Data management is the practice of collecting, organising, protecting, and storing an


organisation’s data so it can be analysed for business decisions. As organisations
create and consume data at unprecedented rates, data management solutions
become essential for making sense of the vast quantities of data. Today’s leading data
management software ensures that reliable, up-to-date data is always used to drive
decisions. The software helps with everything from data preparation to cataloging,
search, and governance, allowing people to quickly find the information they need for
analysis.

You might also like