Introduction To Data Management - Week 1 - 2024
Introduction To Data Management - Week 1 - 2024
1. Overview
Data management is the practice of collecting, keeping, and using data securely,
efficiently, and cost-effectively. The goal of data management is to help people,
organisations, and connected things optimise the use of data within the bounds of
policy and regulation so that they can make decisions and take actions that maximise
the benefit to the organisation.
Thus, data management is a way to organise and maintain your information through
creating naming standards, for consistency and easier location, carefully recording all
the details, including decisions made during the project, and saving copies of your
data in more than one place to prevent loss and making plans to store your data after
project completion.
Here are some ways that you can apply data management skills in your own life:
• Save your files in formats that everyone can access, no matter if they have access
to certain software or not. Examples of open-access files are plain text files (.txt),
PNG files (.png), and CSV files (.csv). Proprietary formats are the opposite of
open access and include file formats such as Word (.doc or .docx) and Excel
(.xlsx) documents.
• Keep your computer organised by creating and sticking to a file naming
convention. This increases the chance of finding the correct file when you need
it.
• Document your research process, decisions, and changes. This allows you to
have a reference for questions and keep track of what still needs to be done. This
helps your future self remember what you did. If you do not write it down, you will
forget it!
• Write down your database and search entries (queries). This way you know what
search terms get you the best results.
• Back up your data regularly, on multiple places and more than one media type.
Once data has been collected, it will need to be saved. Make sure the correct data
source and file naming convention are used.
When we think about data sources, we should remember that these are not
exclusively the ones we see — but virtually every system we do not see as well.
However, there is a growing tension between the ease of analysis of structured data
versus the more challenging analysis of unstructured data. Structured data analytics
is a mature process and technology. Unstructured data analytics is a nascent industry
with a lot of new investment in research and development, but it is not yet a mature
technology. The structured data vs. unstructured data issue within corporations is
deciding if they should invest in analytics for unstructured data, and if it is possible to
aggregate the two into better business intelligence.
The most inclusive Big Data analysis makes use of both structured and unstructured
data.
Users can run simple content searches across textual unstructured data. However its
lack of orderly internal structure defeats the purpose of traditional data mining tools,
and the enterprise gets little value from potentially valuable data sources like rich
media, network or weblogs, customer interactions, and social media data.
On top of this, there is simply much more unstructured data than structured.
Unstructured data makes up 80% or more of enterprise data and is growing at the
rate of 55% and 65% per year. And without the tools to analyse this massive data
category, organisations are leaving vast amounts of valuable data on the business
intelligence table.
Besides understanding the different types of data, you need to understand how to
save the data in a file format that is easily accessible. This is related to the naming
convention of the file.
It is essential to establish a FNC before you begin collecting data to prevent a backlog
of unorganised files that could lead to misplaced or lost data. Deciding on a FNC within
a group is helpful for effective communication and consistency in your work. Consider
the important elements of your research to establish meaningful file names (date of
creation, version of the file, project title, etc.).
Thankfully, there are many ways to name your files. To start, you can choose from the
elements below that are meaningful to you and your project:
Now that you understand file naming conventions, let’s look at how documentation
works.
6. What is Documentation?
Documentation means capturing the work that you do in a way that enables others
to understand what you did so they can duplicate the process. To do this, your
documentation must include information about what was done, how it was
done, why it was done when it was performed, where it was performed,
and who performed the work.
6.1 Documentation Formats
Documentation can take on a variety of formats, though all formats should be similar
in content. All forms of documentation must include basic information about the data
that allows for its correct interpretation and reuse by yourself in the future and other
researchers. Different fields of study may choose one format over another.
• README file - a file that contains critical information about data file(s), including
citation information, file organisation structure, variable definitions,
methodological information, code (if applicable), data collection information,
software/instruments used and versions, licensing information, etc. Often in .txt
file format.
• README tab - like a README file, except this is often created in connection
with a spreadsheet.
• Data Dictionary - a file that provides critical information about a data file by
describing the names, definitions, and attributes of the data elements. This is
often created for tabular data, though it can be made for all dataset formats.
• Codebook - a file that documents the layout and structure of a data file and
contains the response codes that are used to record survey responses and
other information. This is most often done in social science research.
• Commented Code - in-line comments in computer code that describe the
code’s function that is not obtainable from reading the code itself.
• Lab Notebooks - a detailed record of all activities done while conducting
research, including experimental materials and conditions, protocols, and
results. Includes both e-lab notebooks and physical notebooks.
After the data has been documented, one should ensure the data is secured/backed
up.
Cloud-based storage solutions such as Google Drive can also be used to back up
data. Once the data is secured and has been successfully backed up, data
preservation takes place.
8. What is Preservation?
Preservation means following procedures to keep data files for an extended period
by choosing durable formats, archiving files locally, and/or submitting data files to a
data repository.
Source: http://www.library.illinois.edu/sc/services/data_management/file_formats.html
Data preservation may seem like a lot of work, but there’s good news! A lot of
institutions run data repositories, which support the preservation and discovery of
research data. Find out if Wits has a data repository and check how researchers on
campus have preserved and shared their data.
Given this central and mission-critical role of data, strong management practices and
a robust management system are essential for every organisation, regardless of size
or type.
All these components work together as a “data utility” to deliver the data management
capabilities an organisation needs for its apps and the analytics and algorithms that
use the data originated by those apps. Although current tools help database
administrators (DBAs) automate many of the traditional management tasks, manual
intervention is still often required because of the size and complexity of most database
deployments. Whenever manual intervention is required, the chance of errors
increases. Reducing the need for manual data management is a key objective of a
new data management technology, the autonomous database.
Create a discovery A discovery layer on top of the organisation’s data tier allows
layer to identify your analysts and data scientists to search and browse for datasets to
data make the data usable.
Develop a data A data science environment automates as much of the data
science environment transformation work as possible, streamlining the creation and
to efficiently evaluation of data models. A set of tools that eliminates the need
repurpose your data for the manual transformation of data can expedite the
hypothesizing and testing of new models.
Use autonomous Autonomous data capabilities use AI and machine learning to
technology to continuously monitor database queries and optimise indexes as
maintain the queries change. This allows the database to maintain rapid
performance levels response times and frees DBAs and data scientists from time-
across your consuming manual tasks.
expanding data tier
Use discovery to New tools use data discovery to review data and identify the
stay on top of chains of connection that need to be detected, tracked, and
compliance monitored for multijurisdictional compliance. As compliance
requirements demands increase globally, this capability is going to be
increasingly important to risk and security officers.
Ensure you are using A converged database is a database that has native support for
a converged all modern data types and the latest development models built into
database one product. The best-converged databases can run many kinds
of workloads, including graphs, IoT, blockchain, and machine
learning.
Ensure the database The goal of bringing data together is to be able to analyse it to
platform has the make better, more timely decisions. A scalable, high-performance
performance, scale, database platform allows enterprises to rapidly analyse data from
and availability to multiple sources using advanced analytics and machine learning
support a business so they can make better business decisions.
Use a common query New technologies are enabling data management repositories to
layer to manage work together, making the differences between them disappear.
multiple and diverse A common query layer that spans the many kinds of data storage
forms of data storage enables data scientists, analysts, and applications to access data
without needing to know where it is stored and without needing to
manually transform it into a usable format.
Businesses must embrace data management tools and best practices to increase
operational efficiency and craft smarter, more reliable growth strategies.
The sheer amount of available data can make it difficult to know what to look for and
how to use it. Pouring it all into a big pile doesn’t magically reveal insights into
customer behaviours or improvements to existing business operations. A company
must develop a baseline strategy for data management to convert countless data
points into measurable, actionable insights and results.
Companies that actively engage with their data better understand their customers than
those that don’t. Combining these sources of data allows companies to build more
complete customer profiles and make better-informed business decisions. The lack of
a defined data management strategy leads to a situation where executives, team
leaders, and even base-level employees make important decisions based on limited,
dated, or even inaccurate data.
Companies with a vast volume of information at their disposal will find data
management particularly beneficial. As a rule, the more relevant data a business has
at its disposal, the more accurate its assessment of it will be. A chief reason that
Netflix’s “You might like…” recommendation system works so well is because it’s
informed by the data of millions of consumers. However, before Netflix was able to
leverage that data into an executable strategy, they had to collect, sort, and analyse it
first.
14.1 Real-world examples of data management
Chameleon
• Product success platform Chameleon used to manage their event tracking
manually through Google Sheets. This created a situation in which the resource
was constantly out-of-date and inaccurate. There was no way to verify whether
the information in the resource reflected the current state of the product. They
could no longer trust their data, making it of limited strategic value.
• Chameleon adopted the Amplitude-acquired tool Iteratively to assist in data
verification. Integrating Iteratively with their existing analytics stack allowed
them to build schemas and adopt naming conventions to help confirm and
validate events within their product. This greatly improved the trustworthiness
of their data. Chameleon was also able to create defined processes for data
handling, resulting in increased collaboration between teams.
Flipp
• Planning and shopping app Flipp initially adopted Amplitude to enhance the
level of personalisation in their marketing campaigns. They achieved their goal,
but the Flipp team discovered an additional benefit of using the data
management solution: data democratisation. Their growth marketing team was
able to access reliable data faster than ever before. This allowed them to react
to campaigns more quickly than if they had waited for another team to find and
send over the data.
Instacart
• Grocery delivery service Instacart struggled for a time with data efficiency.
Their data management infrastructure consisted of self-built tools and an
internal database. Getting the tools to speak to each other and respond to
requests was a frustrating process that required a great deal of time and effort.
Additionally, Instacart’s data volume had grown beyond the capacity of its
existing management system.
• Instacart adopted Amplitude to unite the data from these tools through a single
solution. Amplitude was also able to handle its growing data load with ease.
This vast infrastructure improvement allowed the Instacart team to focus on
product improvements instead of getting bogged down in the development and
maintenance of their tools.
16. Summary
In today’s digital economy, companies have access to more data than ever before.
This data creates a foundation of intelligence for important business decisions. To
ensure employees have the right data for decision-making, companies must invest in
data management solutions that improve visibility, reliability, security, and scalability.