Exploratory Data Analysis (EDA) - Types and Tools
Last Updated :
02 Aug, 2024
Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore data, and possibly formulate hypotheses that might cause new data collection and experiments. EDA focuses more narrowly on checking assumptions required for model fitting and hypothesis testing. It also checks while handling missing values and making transformations of variables as needed.
EDA builds a robust understanding of the data, and issues associated with either the info or process. It's a scientific approach to getting the story of the data.
TYPES OF EXPLORATORY DATA ANALYSIS:
- Univariate Non-graphical
- Multivariate Non-graphical
- Univariate graphical
- Multivariate graphical
1. Univariate Non-graphical: this is the simplest form of data analysis as during this we use just one variable to research the info. The standard goal of univariate non-graphical EDA is to know the underlying sample distribution/ data and make observations about the population. Outlier detection is additionally part of the analysis. The characteristics of population distribution include:
- Central tendency: The central tendency or location of distribution has got to do with typical or middle values. The commonly useful measures of central tendency are statistics called mean, median, and sometimes mode during which the foremost common is mean. For skewed distribution or when there's concern about outliers, the median may be preferred.
- Spread: Spread is an indicator of what proportion distant from the middle we are to seek out the find the info values. the quality deviation and variance are two useful measures of spread. The variance is that the mean of the square of the individual deviations and therefore the variance is the root of the variance
- Skewness and kurtosis: Two more useful univariates descriptors are the skewness and kurtosis of the distribution. Skewness is that the measure of asymmetry and kurtosis may be a more subtle measure of peakedness compared to a normal distribution
2. Multivariate Non-graphical: Multivariate non-graphical EDA technique is usually used to show the connection between two or more variables within the sort of either cross-tabulation or statistics.
- For categorical data, an extension of tabulation called cross-tabulation is extremely useful. For 2 variables, cross-tabulation is preferred by making a two-way table with column headings that match the amount of one-variable and row headings that match the amount of the opposite two variables, then filling the counts with all subjects that share an equivalent pair of levels.
- For each categorical variable and one quantitative variable, we create statistics for quantitative variables separately for every level of the specific variable then compare the statistics across the amount of categorical variable.
- Comparing the means is an off-the-cuff version of ANOVA and comparing medians may be a robust version of one-way ANOVA.
3. Univariate graphical: Non-graphical methods are quantitative and objective, they are not able to give the complete picture of the data; therefore, graphical methods are used more as they involve a degree of subjective analysis, also are required. Common sorts of univariate graphics are:
- Histogram: The foremost basic graph is a histogram, which may be a barplot during which each bar represents the frequency (count) or proportion (count/total count) of cases for a variety of values. Histograms are one of the simplest ways to quickly learn a lot about your data, including central tendency, spread, modality, shape and outliers.
- Stem-and-leaf plots: An easy substitute for a histogram may be stem-and-leaf plots. It shows all data values and therefore the shape of the distribution.
- Boxplots: Another very useful univariate graphical technique is that the boxplot. Boxplots are excellent at presenting information about central tendency and show robust measures of location and spread also as providing information about symmetry and outliers, although they will be misleading about aspects like multimodality. One among the simplest uses of boxplots is within the sort of side-by-side boxplots.
- Quantile-normal plots: The ultimate univariate graphical EDA technique is that the most intricate. it's called the quantile-normal or QN plot or more generally the quantile-quantile or QQ plot. it's wont to see how well a specific sample follows a specific theoretical distribution. It allows detection of non-normality and diagnosis of skewness and kurtosis
4. Multivariate graphical: Multivariate graphical data uses graphics to display relationships between two or more sets of knowledge. The sole one used commonly may be a grouped barplot with each group representing one level of 1 of the variables and every bar within a gaggle representing the amount of the opposite variable.
Other common sorts of multivariate graphics are:
- Scatterplot: For 2 quantitative variables, the essential graphical EDA technique is that the scatterplot , sohas one variable on the x-axis and one on the y-axis and therefore the point for every case in your dataset.
- Run chart: It's a line graph of data plotted over time.
- Heat map: It's a graphical representation of data where values are depicted by color.
- Multivariate chart: It's a graphical representation of the relationships between factors and response.
- Bubble chart: It's a data visualization that displays multiple circles (bubbles) in two-dimensional plot.
In a nutshell: You ought to always perform appropriate EDA before further analysis of your data. Perform whatever steps are necessary to become more conversant in your data, check for obvious mistakes, learn about variable distributions, and study about relationships between variables. EDA is not an exact science- It is very important are!
TOOLS REQUIRED FOR EXPLORATORY DATA ANALYSIS:
Some of the most common tools used to create an EDA are:
1. R: An open-source programming language and free software environment for statistical computing and graphics supported by the R foundation for statistical computing. The R language is widely used among statisticians in developing statistical observations and data analysis.
2. Python: An interpreted, object-oriented programming language with dynamic semantics. Its high level, built-in data structures, combined with dynamic binding, make it very attractive for rapid application development, also as to be used as a scripting or glue language to attach existing components together. Python and EDA are often used together to spot missing values in the data set, which is vital so you'll decide the way to handle missing values for machine learning.
Apart from these functions described above, EDA can also:
- Perform k-means clustering: Perform k-means clustering: it's an unsupervised learning algorithm where the info points are assigned to clusters, also referred to as k-groups, k-means clustering is usually utilized in market segmentation, image compression, and pattern recognition
- EDA is often utilized in predictive models like linear regression, where it's wont to predict outcomes.
- It is also utilized in univariate, bivariate, and multivariate visualization for summary statistics, establishing relationships between each variable, and understanding how different fields within the data interact with one another.
Similar Reads
What is Exploratory Data Analysis?
Exploratory Data Analysis (EDA) is a important step in data science as it visualizing data to understand its main features, find patterns and discover how different parts of the data are connected. In this article, we will see more about Exploratory Data Analysis (EDA).Why Exploratory Data Analysis
8 min read
Steps for Mastering Exploratory Data Analysis | EDA Steps
Mastering exploratory data analysis (EDA) is crucial for understanding your data, identifying patterns, and generating insights that can inform further analysis or decision-making. Data is the lifeblood of cutting-edge groups, and the capability to extract insights from records has become a crucial
15+ min read
Exploratory Data Analysis (EDA) with NumPy, Pandas, Matplotlib and Seaborn
Exploratory Data Analysis (EDA) serves as the foundation of any data science project. It is an essential step where data scientists investigate datasets to understand their structure, identify patterns, and uncover insights. Data preparation involves several steps, including cleaning, transforming,
4 min read
Exploratory Data Analysis in Python | Set 1
This article provides a comprehensive guide to performing Exploratory Data Analysis (EDA) using Python focusing on the use of NumPy and Pandas for data manipulation and analysis.Step 1: Setting Up EnvironmentTo perform EDA in Python we need to import several libraries that provide powerful tools for
4 min read
How to Write Data Analysis Reports
Reports on data analysis are essential for communicating data-driven insights to decision-makers, stakeholders, and other pertinent parties. These reports provide an organized format for providing conclusions, analyses, and suggestions derived from data set analysis. In this guide, we will learn how
9 min read
Exploratory Data Analysis in R Programming
Exploratory Data Analysis or EDA is a statistical approach or technique for analyzing data sets to summarize their important and main characteristics generally by using some visual aids. The EDA approach can be used to gather knowledge about the following aspects of data. Main characteristics or fea
11 min read
Types of Statistical Data Analysis
Statistics data analysis is a class of analysis that includes different techniques and methods for collection, data analysis, interpretation and presentation of data. Knowing the approach to data analysis is one of the crucial aspects that allows drawing a meaningful conclusion. In this article, the
7 min read
Data Analysis in Research: Types & Methods
Data analysis is a crucial step in the research process, transforming raw data into meaningful insights that drive informed decisions and advance knowledge. This article explores the various types and methods of data analysis in research, providing a comprehensive guide for researchers across discip
7 min read
Types of Data Analysis Techniques
Data analysis techniques have significantly evolved, providing a comprehensive toolkit for understanding, interpreting, and predicting data patterns. These methods are crucial in extracting actionable insights from data, enabling organizations to make informed decisions. Data Analysis TechniquesThis
7 min read
Attributes and its Types in Data Analytics
In this article, we are going to discuss attributes and their various types in data analytics. We will also cover attribute types with the help of examples for better understanding. So let's discuss them one by one. What are Attributes?Attributes are qualities or characteristics that describe an obj
4 min read