DA(Unit 1)
DA(Unit 1)
UNIT-1
JAYATI BHARDWAJ
Assistant Professor, CSE
Course Outcomes
Syllabus as per University
Syllabus as per University
Data Analytics
• Data analytics (DA) is the process of examining data sets in order to find trends and draw
conclusions about the information they contain.
• Data analytics is done with the aid of specialized systems and software.
• Data analytics predominantly refers to an assortment of applications, from basic business
intelligence (BI), reporting and online analytical processing (OLAP) to various forms of advanced
analytics.
• It's similar in nature to business analytics.
• Data analytics initiatives can help businesses increase revenue, improve operational efficiency,
optimize marketing campaigns and bolster customer service efforts. Analytics also enable
organizations to respond quickly to emerging market trends
Why Data Analytics?
Data Analytics Tools
Data Collection
• In the process of big data analysis, “Data collection” is the initial step
before starting to analyze the patterns or useful information in data.
• The data which is to be analyzed must be collected from different
valid sources.
• The data which is collected is known as raw data which is not useful
as it is but on cleaning and utilizing that data for analysis further forms
information, the information obtained is known as “knowledge”.
• The main goal of data collection is to collect information-rich data.
Data could be…
1. RDBMS: A relational database is a collection of tables, each of which is assigned a unique
name. Each table consists of a set of attributes (columns or fields) and usually stores a large set of
tuples (records or rows). Each tuple in a relational table represents an object identified by a unique
key and described by a set of attribute values.
Data could be…
2. Data Warehouses: A data warehouse is a repository of information collected from
multiple sources, stored under a unified schema, and that usually resides at a single site. Data
warehouses are constructed via a process of data cleaning, data integration, data transformation, data
loading, and periodic data refreshing.
Data could be…
3. Transactional Databases: In general, a transactional database consists of a file where each
record represents a transaction. A transaction typically includes a unique transaction identity number (trans
ID) and a list of the items making up the transaction (such as items purchased in a store).
A sequence database stores sequences of ordered events, with or without a concrete notion of time.
Examples include customer shopping sequences, Web click streams, and biological sequences.
A time-series database stores sequences of values or events obtained over repeated measurements of
time (e.g., hourly, daily, weekly). Examples include data collected from the stock exchange,
inventory control, and the observation of natural phenomena (like temperature and wind).
Data could be…
6. Spatial Databases and Spatiotemporal Databases
Spatial databases contain spatial-related information. Examples include geographic(map)
databases, very large-scale integration (VLSI) or computed-aided design databases, and
medical and satellite image databases.
A spatial database that stores spatial objects that change with time is called a
Spatiotemporal database, from which interesting information can be mined. For example,
we may be able to group the trends of moving objects and identify some strangely moving
vehicles, or distinguish a bioterrorist attack from a normal outbreak of the flu based on the
geographic spread of a disease with time.
Data could be…
7. Text Databases and Multimedia Databases
Text databases are databases that contain word descriptions for objects. These word descriptions are
usually not simple keywords but rather long sentences or paragraphs, such as product specifications,
error or bug reports, warning messages, summary reports, notes, or other documents.
Multimedia databases store image, audio, and video data. They are used in applications such as
picture content-based retrieval, voice-mail systems, video-on-demand systems, the World Wide Web,
and speech-based user interfaces that recognize spoken commands. Multimedia databases must
support large objects, because data objects such as video can require gigabytes of storage.
Data could be…
8. Heterogeneous Databases and Legacy Databases
A heterogeneous database consists of a set of interconnected, autonomous component databases. The
components communicate in order to exchange information and answer queries. Objects in one component
database may differ greatly from objects in other component databases, making it difficult to assimilate their
semantics into the overall heterogeneous database.
A legacy database is a group of heterogeneous databases that combines different kinds of data systems, such as
relational or object-oriented databases, hierarchical databases, network databases, spreadsheets, multimedia
databases, or file systems
Data could be…
9. Data Streams
Many applications involve the generation and analysis of a new kind of data, called stream data, where
data flow in and out of an observation platform (or window) dynamically.
Such data streams have the following unique features: huge or possibly infinite volume, dynamically
changing, flowing in and out in a fixed order, allowing only one or a small number of scans, and
demanding fast (often real-time) response time.
Typical examples of data streams include various kinds of scientific and engineering data, time-series data,
and data produced in other dynamic environments, such as power supply, network traffic, stock exchange,
telecommunications, Web click streams, video surveillance, and weather or environment monitoring.
Data could be…
10. The World Wide Web
The World Wide Web and its associated distributed information services, where data objects are linked together
to facilitate interactive access. Users seeking information of interest traverse from one object via links to
another. Such systems provide ample opportunities and challenges for data mining. For example, understanding
user access patterns will not only help improve system design (by providing efficient access between highly
correlated objects), but also leads to better marketing decisions (e.g., by placing advertisements in frequently
visited documents, or by providing better customer/user classification and behavior analysis).
Data Collection
• Most of the data collected are of two types-
• Qualitative data: Data that is represented either in a verbal or narrative format is qualitative
data. A simple way to look at qualitative data is to think of qualitative data in the form of words. These
types of data are collected through focus groups surveys, interviews, opened ended questionnaires,
observations.
• Quantitative data: Quantitative data is data that is expressed in numerical terms, in which the
numeric values could be large or small. Numerical values may correspond to a specific category or
label. These types of data are collected through Surveys and questionnaires, Analytics tools,
Environmental sensors, Manipulation of pre-existing quantitative data.
Nominal Data
• These are the set of values that don’t possess a natural ordering.
• Ex.-The color of a smart phone can be considered as a nominal data type as we can’t compare one
color with others. It is not possible to state that ‘Red’ is greater than ‘Blue’.
• The gender of a person is another one where we can’t differentiate between male, female, or others.
• Mobile phone categories whether it is midrange, budget segment, or premium smart phone is also
nominal data type.
• Nominal data types in statistics are not quantifiable and cannot be measured through numerical
units. Nominal types of statistical data are valuable while conducting qualitative research as it
extends freedom of opinion to subjects
Ordinal Data
• These types of values have a natural ordering while maintaining their class of values.
• If we consider the size of a clothing brand then we can easily sort them according to their name tag
in the order of small < medium < large.
• The grading system while marking candidates in a test can also be considered as an ordinal data
type where A+ is definitely better than B grade.
• These categories help us deciding which encoding strategy can be applied to which type of data.
• Data encoding for Qualitative data is important because machine learning models can’t handle
these values directly and needed to be converted to numerical types as the models are mathematical
in nature.
• For nominal data type where there is no comparison among the categories, one-hot encoding can
be applied which is similar to binary coding considering there are in less number and for the
ordinal data type, label encoding can be applied which is a form of integer encoding.
Discrete Data
• The numerical values which fall under are integers or whole numbers are placed under this
category.
• The number of speakers in the phone, cameras, cores in the processor, the number of sims
supported all these are some of the examples of the discrete data type.
• Discrete data types in statistics cannot be measured – it can only be counted as the objects included
in discrete data have a fixed value.
• The value can be represented in decimal, but it has to be whole.
• Discrete data is often identified through charts, including bar charts, pie charts, and tally charts.
Continuous Data
• The fractional numbers are considered as continuous values.
• These can take the form of the operating frequency of the processors, the android version of the
phone, wifi frequency, temperature of the cores, and so on.
• Unlike discrete data types of data in research, with a whole and fixed value, continuous data can
break down into smaller pieces and can take any value.
• For example, volatile values such as temperature and the weight of a human can be included in the
continuous value.
• Continuous types of statistical data are represented using a graph that easily reflects value
fluctuation by the highs and lows of the line through a certain period of time.
Data Collection
Primary data:
The data which is Raw, original, and extracted directly from the official sources is known as primary
data. This type of data is collected directly by performing techniques such as questionnaires,
interviews, and surveys. The data collected must be according to the demand and requirements of the
target audience on which analysis is performed.
Few methods of collecting primary data:
1.Interview method
2.Survey method
3.Observation method
4.Experimental method: CRD- Completely Randomized design
RBD- Randomized Block Design
LSD – Latin Square Design
FD- Factorial design
Secondary data:
Secondary data is the data which has already been collected and reused again for some valid purpose. This type
of data is previously recorded from primary data and it has two types of sources named internal source and
external source.
1. Internal source: These types of data can easily be found within the organization such as market record, a
sales record, transactions, customer data, accounting resources, etc. The cost and time consumption is less in
obtaining internal sources.
2. External source: The data which can’t be found at internal organizations and can be gained through external
third party resources is external source data. The cost and time consumption is more because this contains a
huge amount of data. Examples of external sources are Government publications, news publications, Registrar
General of India, planning commission, international labor bureau, syndicate services, and other non-
governmental publications.
Secondary data:
3. Other sources:
• Sensors data: With the advancement of IoT devices, the sensors of these devices collect data
which can be used for sensor data analytics to track the performance and usage of products.
• Satellites data: Satellites collect a lot of images and data in terabytes on daily basis through
surveillance cameras which can be used to collect useful information.
• Web traffic: Due to fast and cheap internet facilities many formats of data which is uploaded by
users on different platforms can be predicted and collected with their permission for data analysis.
The search engines also provide their data through keywords and queries searched mostly.
Types of Data
Types of Data
Types of Data
Characteristics of Data
• Data quality is crucial – it assesses whether information can serve its purpose in a
particular context (such as data analysis).
• So, to determine the quality of a given set of information, there are data quality
characteristics of which one should be aware.
• There are five traits namely:
• Accuracy
• Completeness
• Reliability
• Relevance
• Timeliness
Characteristics of Data
Characteristics of Data
• Accuracy: This data quality characteristic means that information is correct. Accuracy is a crucial data
quality characteristic because inaccurate information can cause significant problems with severe consequences.
• Completeness: “Completeness” refers to how comprehensive the information is. When looking at data
completeness, think about whether all of the data you need is available. Ex- You might need a customer’s first
and last name, but the middle initial may be optional.
• Reliability: Reliability means that a piece of information doesn’t contradict another piece of information
in a different source or system. Ex.- if a patient’s birthday is January 1, 1970 in one system, yet it’s June 13,
1973 in another, the information is unreliable. Reliability is a vital data quality characteristic. When pieces of
information contradict themselves, you can’t trust the data and this could result in damages.
Characteristics of Data
• Relevance: When you’re looking at data quality characteristics, relevance comes into play because
there has to be a good reason as to why you’re collecting this information in the first place. You
must consider whether you really need this information, or whether you’re collecting it just for the
sake of it. If you’re gathering irrelevant information, you’re wasting time as well as money. Your
analyses won’t be as valuable.
• Timeliness: Timeliness, as the name implies, refers to how up to date information is. If it was
gathered in the past hour, then it’s timely – unless new information has come in that renders
previous information useless. Timeliness is an important data quality characteristic – out-of-date
information costs companies time and money
Introduction to Big Data
• Big data is a term that describes large, hard-to-manage volumes of data – both
structured and unstructured – that inundate businesses on a day-to-day basis.
• It is the data that contains greater variety, arriving in increasing volumes and with
more velocity. This is also known as the three Vs.
• Big data is larger, more complex data sets, especially from new data sources.
These data sets are so voluminous that traditional data processing software just
can’t manage them.
• But it’s not just the type or amount of data that’s important, it’s what organizations
do with the data that matters. Big data can be analyzed for insights that improve
decisions and give confidence for making strategic business moves.
Some Facts & Figures
Insights
Sources: People, machine, organization: Ubiquitous computing. More people carrying data-generating
devices (Mobile phones with facebook, GPS, Cameras, etc.)
Introduction to Big Data
• 3 ‘V’s of Big Data – Variety, Velocity, and Volume.
a) Variety: Variety of Big Data refers to structured, unstructured, and semi-structured data
that is gathered from multiple sources. While in the past, data could only be collected
from spreadsheets and databases, today data comes in an array of forms such as emails,
PDFs, photos, videos, audios and so much more. Variety is one of the important
characteristics of big data.
b) Velocity: Velocity essentially refers to the speed at which data is being created in real-
time. In a broader prospect, it comprises the rate of change, linking of incoming data
sets at varying speeds, and activity bursts.
Introduction to Big Data
c) Volume: It indicates huge ‘volumes’ of data that is being generated on a daily basis from various sources like
social media platforms, business processes, machines, networks, human interactions, etc. Such a large amount of data
is stored in data warehouses. Thus comes to the end of characteristics of big data.
d) Veracity: It refers to inconsistencies and uncertainty in data, that is data which is available can sometimes get
messy and quality and accuracy are difficult to control. Big Data is also variable because of the multitude of data
dimensions resulting from multiple disparate data types and sources. Example: Data in bulk could create confusion
whereas less amount of data could convey half or Incomplete Information.
e) Value: The bulk of Data having no Value is of no good to the company, unless you turn it into something useful.
Data in itself is of no use or importance but it needs to be converted into something valuable to extract Information.
Benefits of Big Data
• Cost Savings: Some tools of Big Data like Hadoop and Cloud-Based Analytics can bring cost advantages to
business when large amounts of data are to be stored and these tools also help in identifying more efficient ways of
doing business.
• Time Reductions: The high speed of tools like Hadoop and in-memory analytics can easily identify new sources of
data which helps businesses analyzing data immediately and make quick decisions based on the learning.
• Understand the market conditions: By analyzing big data you can get a better understanding of current market
conditions. For example, by analyzing customers’ purchasing behaviors, a company can find out the products that
are sold the most and produce products according to this trend. By this, it can get ahead of its competitors.
• Control online reputation: Big data tools can do sentiment analysis. Therefore, you can get feedback about who is
saying what about your company. If you want to monitor and improve the online presence of your business, then, big
data tools can help in all this.
Benefits of Big Data
• Using Big Data Analytics to Boost Customer Acquisition and Retention: The customer is the most
important asset any business depends on. If a business is slow to learn what customers are looking for, then it
is very easy to begin offering poor quality products. In the end, loss of clientele will result, and this creates an
adverse overall effect on business success. The use of big data allows businesses to observe various customer
related patterns and trends. Observing customer behavior is important to trigger loyalty.
• Using Big Data Analytics to Solve Advertisers Problem and Offer Marketing Insights: Big data analytics
can help change all business operations. This includes the ability to match customer expectation, changing
company’s product line and of course ensuring that the marketing campaigns are powerful.
• Big Data Analytics As a Driver of Innovations and Product Development Another huge advantage of big data
• This information is available quickly and efficiently so that companies can be agile in crafting
plans to maintain their competitive advantage.
• Technologies such as business intelligence (BI) tools and systems help organizations take the
unstructured and structured data from multiple sources.
• Users (typically employees) input queries into these tools to understand business operations and
performance.
Big Data Analytics
• Big data analytics is important because it helps companies leverage their data to identify
opportunities for improvement and optimization.
• Across different business segments, increasing efficiency leads to overall more intelligent
operations, higher profits, and satisfied customers.
• Big data analytics helps companies reduce costs and develop better, customer-centric products and
services.
• Data analytics helps provide insights that improve the way our society functions. In health care, big
data analytics not only keeps track of and analyzes individual records, but plays a critical role in
measuring COVID-19 outcomes on a global scale. It informs health ministries within each nation’s
government on how to proceed with vaccinations and devises solutions for mitigating pandemic
outbreaks in the future.
Big Data Analytics
Why?
To make the right decisions for your business to succeed, you need the right data. So, it’s
important to have a data analytics strategy in place.
Such plans can help organizations:
• boost revenue
• cut costs
• improve efficiencies
• enhance marketing efforts
• strengthen customer focus and customer service
• respond quickly and effectively to market events and industry trends
• reduce risk
• gain a competitive edge
Harnessing Big Data
• OLTP(Online
Transaction
Processing)- DBMS
• OLAP(Online
Analytical Processing)-
Datawarehouse
• RTAP(Real Time
Analytical Processing)-
Big Data Architecture &
Technology
Traditional Model
Traditional Data Model
• To integrate data across mixed application environments, you need to get data from one data environment
(source) to another data environment (destination). Extract, Transform and Load (ETL) technologies have
been used to accomplish this in traditional data warehouse environments.
• ETL tools combine three important functions required to get data from one data environment and put it into
another data environment.
• Extract: Read data from the source database.
• Transform: Convert the format of the extracted data so that it conforms to the requirements of the target
database. (Transformation is done by using rules or merging data with other data.)
• Load: Write data to the target database
• Data warehouses provide business users with a way to consolidate information across disparate sources to
analyze and report on data relevant to their specific business focus. ETL tools are used to transform the data
into the format required by the data warehouse. The transformation is actually done in an intermediate
location before the data is loaded into the data warehouse.
• Many software vendor including Oracle, Microsoft, IBM, Informatica, Talend, and Pentaho provided
Traditional ETL software tools.
Big Data Model
Big Data Model
• Data Storage: There is data stored in file stores that are distributed in nature and that can hold a
variety of format-based big files. It is also possible to store large numbers of different format-based
big files in the data lake. This consists of the data that is managed for batch built operations and is
saved in the file stores.
• Batch Processing: Each chunk of data is split into different categories using long-running jobs,
which filter and aggregate and also prepare data for analysis. These jobs typically require sources,
process them, and deliver the processed files to new files. Multiple approaches to batch processing
are employed, including Hive jobs, U-SQL jobs, Sqoop or Pig and custom map reducer jobs
written in any one of the Java or Scala or other languages such as Python.
• Real Time-Based Message Ingestion: A real-time streaming system that caters to the data being
generated in a sequential and uniform fashion is a batch processing system. When compared to
batch processing, this includes all real-time streaming systems that cater to the data being
generated at the time it is received. This data mart or store, which receives all incoming messages
and discards them into a folder for data processing, is usually the only one that needs to be
contacted. Message-based ingestion stores such as Apache Kafka, Apache Flume, Event hubs from
Azure, and others, on the other hand, must be used if message-based processing is required. The
delivery process, along with other message queuing semantics, is generally more reliable.
Big Data Model
• Stream Processing: Real-time message ingest and stream processing are different. The latter uses
the ingested data as a publish-subscribe tool, whereas the former takes into account all of the
ingested data in the first place and then utilizes it as a publish-subscribe tool. Stream processing, on
the other hand, handles all of that streaming data in the form of windows or streams and writes it to
the sink. This includes Apache Spark, Flink, Storm, etc.
• Analytics-Based Datastore: In order to analyze and process already processed data, analytical
tools use the data store that is based on HBase or any other NoSQL data warehouse technology.
The data can be presented with the help of a hive database, which can provide metadata
abstraction, or interactive use of a hive database, which can provide metadata abstraction in the
data store. NoSQL databases like HBase or Spark SQL are also available.
• Reporting and Analysis: The generated insights, on the other hand, must be processed and that is
effectively accomplished by the reporting and analysis tools that utilize embedded technology and
a solution to produce useful graphs, analysis, and insights that are beneficial to the businesses. For
example, Cognos, Hyperion, and others.
• Orchestration: Data-based solutions that utilise big data are data-related tasks that are repetitive in
nature, and which are also contained in workflow chains that can transform the source data and
also move data across sources as well as sinks and loads in stores. Sqoop, oozie, data factory, and
others are just a few examples
Big Data Process
Big Data Layers
Big Data Layers
• Big data sources layer: The data available for analysis will vary in origin and format; the format may be
structured, unstructured, or semi-structured, the speed of data arrival and delivery will vary according to the
source, the data collection mode may be direct or through data providers, in batch mode or in real-time, and
the location of the data source may be external or within the organization.
• Data Storage layer: This layer acquires data from the data sources, converts it, and stores it in a format that
is compatible with data analytics tools. Governance policies and compliance regulations primarily decide the
suitable storage format for different types of data.
• Data Query Layer: It is the layer of data architecture where active analytic processing takes place. This is a
field where interactive queries are necessary, and traditionally dominated by SQL expert developers. Before
Hadoop, we had insufficient storage, due to which it takes a long analytics process. At first, it goes through a
Lengthy process, i.e., ETL, to get a new data source ready to be stored, and after that, it puts the data in a
database or data warehouse. Data ingestion and data analytics became two essential steps that solved
problems while computing such a large amount of data while making a Data ingestion framework.
Big Data Layers
• Processing Layer: In the previous layer, we gathered the data from different sources and made it
available to go through rest of pipeline. In this layer, our data is ready we only have to route the
data to different destinations. In this layer the focus is to specialize Data Pipeline processing
system.
• Analysis layer: It extracts the data from the data storage layer (or directly from the data source) to
derive insights from the data.
• Visualization layer: This layer receives the output provided by the analysis layer and presents
them to the relevant output layer. The consumers of the output may be business processes, humans,
visualization applications, or services.
*
* Subjected to change.
Reporting Vs Analysis
• Reports and analytics help businesses improve operational efficiency and
productivity, but in different ways.
• Reports explains what is happening while Analytics helps identify why it is
happening.
• Reporting summarizes and organizes data in easily digestible ways while analytics
enables questioning and exploring that data further. It provides invaluable insights
into trends and helps create strategies to help improve operations, customer
satisfaction, growth, and other business metrics.
• Analytics enables business users to cull out insights from data, spot trends, and
help make better decisions. Next-generation analytics takes advantage of emerging
technologies like AI, NLP, and machine learning to offer predictive insights based
on historical and real-time data.
Reporting Vs Analysis
Reporting
Examples
• Take the population census, for example. This is a technical document that transmits basic
information on how many and what kind of people live in a certain country. It can be
displayed in the text, or in a visual format, such as a graph or chart. But it is static
information that can be used to assess current conditions.
• Examples:
• Marketing teams gather data on customer behavior and habits to form business strategies around them. A company like
Starbucks keeps track of its customer base through its mobile app. The mobile app provides insight into consumer spending
and buying behaviors, and the data is used in predictive analysis to orient future decisions.
• Another aspect that companies improve by using data analytics is customer experience. CX is the engagement and
interaction of customers with businesses. For example, McDonald’s stores customer data through their mobile app. These
analytical efforts help them automatically send out promotions, discounts, and other updates.
Different Types of Data Analytics
Descriptive(business intelligence and data mining): This surface-level analysis is aimed at analyzing past data
through data aggregation and data mining.
• Descriptive analytics looks at data and analyze past event for insight as to how to approach future events. It looks at the past
performance and understands the performance by mining historical data to understand the cause of success or failure in the
past.
• Almost all management reporting such as sales, marketing, operations, and finance uses this type of analysis.
• The descriptive model quantifies relationships in data in a way that is often used to classify customers or prospects into
groups.
• Unlike a predictive model that focuses on predicting the behavior of a single customer, Descriptive analytics identifies many
different relationships between customer and product.
• Common examples of Descriptive analytics are company reports that provide historic reviews like:
• Data Queries
• Reports
• Descriptive Statistics
• Data dashboard
Different Types of Data Analytics
Diagnostic: This kind of analysis explores the “why”. For instance, diagnostic analysis can help in understanding the
reason behind a sudden drop in customers for a company.
• In this analysis, we generally use historical data over other data to answer any question or for the solution of any problem.
We try to find any dependency and pattern in the historical data of the particular problem.
• For example, companies go for this analysis because it gives a great insight into a problem, and they also keep detailed
information about their disposal otherwise data collection may turn out individual for every problem and it will be very
time-consuming.
• Prescriptive analytics goes beyond predicting future outcomes by also suggesting action benefit from the predictions and showing the
decision maker the implication of each decision option.
• Prescriptive Analytics not only anticipates what will happen and when to happen but also why it will happen. Further, Prescriptive Analytics
can suggest decision options on how to take advantage of a future opportunity or mitigate a future risk and illustrate the implication of each
decision option.
• For example, Prescriptive Analytics can benefit healthcare strategic planning by using analytics to leverage operational and usage data
combined with data of external factors such as economic data, population demography, etc.
Different Types of Data Analytics
Cognitive analytics: It is analytics with human-like intelligence. This can include understanding the context and
meaning of a sentence, or recognizing certain objects in an image given large amounts of information. Cognitive analytics
often uses artificial intelligence algorithms and machine learning, allowing a cognitive application to improve over time.
Cognitive analytics reveals certain patterns and connections that simple analytics cannot.
Key Roles Of Successful Analytics Projects
Each key plays a crucial role in developing a successful analytics project:
• Business User
• Project Sponsor
• Project Manager
• Business Intelligence Analyst
• Database Administrator (DBA)
• Data Engineer
• Data Scientist
Key Roles Of Successful Analytics Projects
Business User :
• The business user is the one who understands the main area of the project and is also
basically benefited from the results.
• This user gives advice and consult the team working on the project about the value of the
results obtained and how the operations on the outputs are done.
• The business manager, line manager, or deep subject matter expert in the project mains
fulfills this role.
Project Sponsor :
• The Project Sponsor is the one who is responsible to initiate the project. Project Sponsor
provides the actual requirements for the project and presents the basic business issue.
• He generally provides the funds and measures the degree of value from the final output of
the team working on the project.
• This person introduce the prime concern and brooms the desired output.
Key Roles Of Successful Analytics Projects
Project Manager :
• This person ensures that key milestone and purpose of the project is met on time
and of the expected quality.
Data Engineer :
• Data engineer grasps deep technical skills to assist with tuning SQL queries for data
management and data extraction and provides support for data intake into the analytic
sandbox.
• The data engineer works jointly with the data scientist to help build data in correct ways
for analysis.
Key Roles Of Successful Analytics Projects
Data Scientist :
• Data scientist facilitates with the subject matter expertise for analytical techniques,
data modelling, and applying correct analytical techniques for a given business
issues.
• He ensures overall analytical objectives are met.
• Data scientists outline and apply analytical methods and proceed towards the data
available for the concerned project.
Data Analytics Lifecycle
Data Analytics Lifecycle
Phase 1: Discovery -
• The data science team is trained and researches the issue.
• Create context and gain understanding.
• Learn about the data sources that are needed and accessible to the project.
• The team comes up with an initial hypothesis, which can be later confirmed with evidence.
Phase 6: Operationalize -
• The team distributes the benefits of the project to a wider audience. It sets up a pilot project that will
deploy the work in a controlled manner prior to expanding the project to the entire enterprise of users.
• This technique allows the team to gain insight into the performance and constraints related to the model
within a production setting at a small scale and then make necessary adjustments before full
deployment.
• The team produces the last reports, presentations, and codes.
• Open source or free tools such as WEKA, SQL and Octave.