EDA Mini Report
EDA Mini Report
A MINIPROJECT REPORT
Submitted by
BACHELOR OF ENGINEERING
IN
RAJAPALAYAM
NOVEMBER 2024
1
BONAFIDE CERTIFICATE
Certified that this mini-project report "Incorporating Model Checking into Exploratory
Visual Analysis: Trends, Patterns, and Consistency in Data Visualization" is the
bonafide work of Selvaruba S (953621104038), who carried out the mini-project work under
my supervision..
SIGNATURE SIGNATURE
Mrs.S.Vijaya Amala Devi B.E.,M.E. Dr.K.Vijayalakshmi M.E., Ph.D.
2
ABSTRACT
Exploratory Visual Analysis (EVA) is an essential technique in data visualization that allows
analysts to investigate datasets interactively, form hypotheses, and uncover insights.
However, current EVA systems often lack rigorous mechanisms to ensure the validity of
findings, leading to potential biases or misinterpretations. This paper presents EVM
(Exploratory Visual Model-checking), a novel framework that incorporates model checking
techniques into EVA to improve analytical reliability. By integrating formal verification
methods, EVM enables users to validate hypotheses directly within the visualization
environment, detecting inconsistencies or unsupported patterns in real time. EVM supports a
wide range of data types and visual analytics tasks, enhancing the accuracy and dependability
of insights derived from visual analysis. Case studies and experimental results demonstrate
that EVM can help analysts mitigate cognitive biases, avoid common logical pitfalls, and
refine hypotheses more effectively than traditional EVA methods. This paper highlights
EVM’s potential to bridge the gap between exploratory data analysis and rigorous model
verification, setting a foundation for future advancements in reliable visual analytics.
EVM enhances traditional EVA by integrating automated model checking to validate user
hypotheses and uncover logical inconsistencies within the data. The system leverages formal
verification techniques to support hypothesis testing in real-time, automatically flagging
possible issues when exploratory insights contradict the underlying data model.
3
TABLE OF CONTENT:
CONTENT PAGE NO
1.INTRODUCTION 5
2.SYSTEM SPECIFICATION 15
2.1Hardware specification 15
3. PACKAGES 16
3.1 Seaborn 16
3.2 Pandas 16
3.3 Matplotlib 17
4.APPENDIX 21
4.2.Screenshots 24
5. CONCLUSION 29
6.FUTURE WORK 30
7. REFERENCE 31
4
1.INTRODUCTION
This research aims to demonstrate how model checking techniques, often used in formal
verification of software systems, can be effectively applied to the realm of data visualization.
By introducing a structured approach to verifying the consistency and validity of visual
representations, the study helps ensure that the insights drawn from the data are reliable and
not artifacts of flawed analysis. This is particularly crucial when working with large, complex
datasets, where simple visual tools might miss underlying errors or inconsistencies.
The primary focus of this paper is to explore how these two fields—model checking and
exploratory visual analysis—can be integrated to improve the robustness of data
interpretation, enabling more confident decision-making in data-driven domains such as
healthcare, finance, and engineering.
This integration begins with identifying and validating trends in data, such as recurring
movements or sustained directional changes. Automated detection methods, like moving
averages and regression analysis, allow for a systematic approach to observing trends, while
model checking ensures that these detected trends are consistent over time and free from
unexpected deviations. Similarly, patterns within the data, such as cyclic or seasonal
behaviors, can be more rigorously validated through model checking, with specifications in
place to verify that these cycles align with anticipated intervals. In complex datasets, simple
visualizations might miss these nuances, but model checking helps to confirm that the
identified patterns are reliable and representative of actual data behavior rather than random
variations or artifacts.
5
1. Data Preprocessing and Cleaning Module
This module prepares the dataset for analysis by handling missing values, outliers, and
inconsistent data entries. It includes steps like imputing missing values, filtering noise, and
standardizing formats, ensuring that the data meets quality standards before further analysis.
Effective preprocessing minimizes the risk of distorted insights in visual analysis due to data
anomalies.
Key Steps:
Tools:
This module employs statistical methods such as moving averages, linear and polynomial
regression, and time-series decomposition to identify trends within the dataset. The trends are
then validated through model checking, which confirms that they align with expected
behaviors and are free from unexpected deviations. This module ensures that trends are not
artifacts but are reliable indicators of the data's direction.
6
Key Steps:
Tools:
Focused on identifying repeating patterns, such as cyclic or seasonal behaviors, this module
uses Fourier analysis and seasonal decomposition techniques. Model checking within this
module verifies the consistency and expected intervals of these patterns, ensuring they
represent true cyclic behavior rather than random variations.
Key Steps:
Fourier Analysis: Identifying periodic patterns in the data by transforming the data
into the frequency domain.
Seasonal Decomposition: Breaking the time-series data into components that
represent trend, seasonal, and residual variations.
Cyclic Pattern Identification: Using statistical methods or machine learning
algorithms to identify cyclic patterns (e.g., daily, monthly, yearly).
Tools:
7
4. Consistency Verification Module
Key Steps:
Tools:
This module uses model checking to detect anomalies, such as outliers or unexpected shifts,
that could affect the reliability of trends and patterns. Anomalies are flagged and either
corrected or documented to maintain the dataset's integrity. This is especially crucial in large
datasets where subtle inconsistencies could significantly impact interpretation.
Key Steps:
Tools:
Integrating model checking with visualization tools, this module provides an interface for
users to visually inspect data while receiving automated validation feedback on identified
trends and patterns. By linking model checking algorithms to visual feedback, this module
enhances the user’s ability to explore data while ensuring that each insight meets predefined
accuracy standards.
Key Steps:
User Interface: Allows users to interact dynamically with visualizations (e.g., zoom,
filter, adjust).
Real-time Feedback: Model checking algorithms run in the background, providing
validation and feedback on the data’s accuracy and trends.
Tools:
After completing the visual analysis, this module generates reports documenting verified
trends, patterns, and any anomalies. Model checking outcomes are logged to provide
transparency on data consistency and accuracy, helping stakeholders trust the findings and
supporting further data-driven decision-making.
Key Steps:
Tools:
9
8. Hypothesis Generation and Testing Module
This module assists analysts in formulating and testing hypotheses based on the visualized
data. Using statistical testing methods, such as chi-square tests for categorical data or t-tests
for continuous data, the module helps to verify or refute assumptions about trends or patterns
in the data. By incorporating model checking, the module ensures that the results of
hypothesis testing are consistent and align with underlying data characteristics, reducing the
likelihood of drawing incorrect conclusions from visual insights.
Key Steps:
Tools:
This module allows users to interact dynamically with visualizations, modifying variables,
selecting data ranges, and adjusting visualization parameters to explore data from multiple
perspectives. The module incorporates user feedback loops, where analysts can mark areas of
interest or concern, prompting model checking to assess specific sections of the data more
closely. This interactive element promotes a hands-on approach to exploration while ensuring
insights remain validated and accurate through real-time model checking feedback.
Key Steps:
Dynamic Exploration: Users can interact with visualizations, adjust data parameters,
and zoom into specific data points.
Feedback Loop: Users provide feedback (e.g., marking anomalies or areas of
interest) that triggers further model checking on those specific areas.
10
1.2 PROJECT OBJECTIVE AND SCOPE
The primary objective of this project is to bridge model checking—a technique traditionally
used in software verification—with exploratory data visualization to establish a more robust
framework for data analysis. Exploratory data analysis (EDA) often relies on visual tools to
uncover trends, patterns, and relationships within datasets. However, without formal
validation, these insights may be subject to human error, interpretation bias, or artifacts from
incomplete data. This project proposes to enhance the reliability of visualizations by
embedding model checking within the EDA process, thus providing an added layer of
verification to validate the accuracy and consistency of visual insights.
Through this approach, the project will focus on incorporating model checking methods into
visualizations such as scatter plots and heatmaps. These types of visualizations are commonly
used to represent multidimensional relationships and data distributions. For example, scatter
plots can reveal correlations or clusters, while heatmaps can illustrate concentration patterns.
By applying model checking techniques, these visualizations are examined rigorously to
confirm that any detected patterns—such as a trend line or clustering—are consistent with the
dataset’s expected statistical or structural properties. This mitigates the risk of misinterpreting
random variations as genuine insights, ultimately providing a more dependable analysis
framework.
The scope of this project extends to key domains where data accuracy is critical, including
healthcare, finance, and engineering. For instance, in healthcare, consistent data visualization
can aid in identifying patterns in patient demographics or treatment outcomes, leading to
more effective healthcare planning. In finance, accurate visualization is vital for identifying
market trends and making informed investment decisions, while in engineering, it can support
pattern recognition in performance metrics or quality control. By ensuring the reliability of
visual data insights, the project contributes to more confident decision-making across these
fields.
Furthermore, this integration of model checking and visual analysis will enable analysts to
seamlessly transition between exploring data intuitively and verifying it rigorously. Users can
interact with visualizations, such as adjusting parameters or filtering data, while receiving
real-time feedback on the validity of the visual patterns they observe. This combination aims
to elevate the standard of data-driven insights by aligning intuitive analysis with formal,
11
systematic validation, creating a powerful toolset that enhances both the interpretability and
trustworthiness of data visualization in complex datasets.
This project is dedicated to advancing the rigor and dependability of exploratory visual
analysis by integrating model checking methodologies. Exploratory visual analysis,
commonly used for initial insights into datasets, typically relies on visualization techniques to
uncover trends, correlations, and patterns. However, while these visual insights can be
compelling, they often lack formal validation, which can lead to incorrect interpretations or
over-reliance on apparent trends that may not hold under closer scrutiny. By incorporating
model checking into this process, the project seeks to apply systematic verification to ensure
the reliability of identified trends and patterns, providing more accurate and dependable
insights.
Specifically, the project will focus on visualizations that are widely used in data exploration,
including scatter plots, heatmaps, and time-series plots. Scatter plots are frequently employed
to show correlations and distribution patterns among variables, heatmaps can reveal
concentration patterns and clustering tendencies, and time-series plots illustrate trends and
changes over time. Each of these visualization types will be analyzed through model
checking to confirm that the displayed relationships align with statistical expectations and
established properties of the data, reducing the likelihood of spurious findings due to
randomness or sampling anomalies.
A key aspect of this approach involves building statistical models or using known data
properties as benchmarks for the visualizations. For instance, trends identified in a time-series
plot might be checked against a moving average model to verify consistency over time. In
scatter plots, detected clusters or correlations will be verified to ensure they match with
known relationships or are not influenced by outlier effects. Heatmaps will similarly be
validated to confirm that density patterns represent genuine underlying data features and are
not skewed by anomalies or biased data segments. This validation process ensures that the
insights derived are truly reflective of the data, rather than artifacts of incomplete or biased
analysis.
In addition to ensuring statistical alignment, the project will incorporate demographic and
contextual elements—such as data variability, range, and potential outliers—to provide a
more comprehensive validation framework. By factoring in these elements, model checking
12
can account for deviations within expected limits, distinguishing genuine insights from
anomalies. For example, in datasets with significant demographic diversity or high
variability, visualizations might show more fluctuations; model checking will help identify
whether these variations fall within expected bounds or indicate an underlying trend.
Non-functional Requirements
1. Usability
o Interface design should be intuitive, with clear navigation and user guidance,
including tooltips and onboarding tutorials.
2. Reliability
3. Performance
o Data preprocessing and validation should occur within minimal latency, even
for large datasets.
4. Scalability
5. Compatibility
o Compatible with Windows, macOS, and Linux for a wide range of users.
13
6. Data Security
o Data processing and analysis should follow strict data integrity protocols.
8. Maintainability
10. Extensibility
14
o The system should allow users to customize their visualization and validation
settings, such as specifying threshold parameters for trend and anomaly
detection, choosing color schemes, and defining the frequency of validation
checks. This personalization ensures that users can tailor the framework to best
suit their data needs and analytical preferences, enhancing usability across
different user profiles.
2.SYSTEM SPECIFICATION
OS
Language : Python
Compiler : googlecolab
Cloud Storage/Backup:
Google Drive or other cloud storage solutions for file management and project
collaboration.
15
Web Browser:
A modern web browser (e.g., Chrome, Firefox) is essential for accessing Google
Colab, cloud storage, and other online tools.
3.PACKAGES
3.1 SEABORN
3.2 PANDAS
Pandas allows us to analyze big data and make conclusions based on statistical theories.
Pandas can clean messy data sets, and make them readable and relevant.
Relevant data is very important in data science.
Pandas gives you answers about the data. Like:
16
Min value?
Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty
or NULL values. This is called cleaning the data.
Import Pandas
Once Pandas is installed, import it in your applications by adding the import keyword:
import pandas
3.3 MATPLOTLIB
A Python matplotlib script is structured so that a fewlines of code are all that is required in
most instancesto generate a visual data plot.
The pyplot API has a convenient MATLAB-style statefulinterface. In fact, matplotlib was
originally written as an open source alternative for MATLAB. The OO API and its interface
is more customizable and powerful than pyplot, but considered more difficult to use. As a
result, the pyplot interface is more commonly used, and is referred to by default in this
article.
17
Understanding matplotlib’s pyplot API is key to understanding how to work with plots:
Installing Matplotlib :
A bar plot or bar chart is a graph that represents the category of data with rectangular bars
with lengths and heights that is proportional to the values which they represent. The bar plots
can be plotted horizontally or vertically. A bar chart describes the comparisons between the
discrete categories. One of the axis of the plot represents the specific categories being
compared, while the other axis represents the measured values corresponding to those
categories.
The matplotlib API in Python provides the bar() function which can be used in MATLAB
style use or as an object-oriented API. The syntax of the bar() function to be used with the
axes is as follows:- plt.bar(x, height, width, bottom, align).
EXAMPLE:
plt.xlabel('Products')
18
plt.ylabel('Sales')
plt.title('Sales by Product')
plt.show()
Output:
19
Count how many values fall into each interval.
Example: If you have a dataset with values ranging from 0 to 100, you might divide it
into bins of size 10, resulting in bins for values 0-10, 10-20, 20-30, and so on.
Matplotlib allows you to specify the number of bins or the specific bin edges when
creating the histogram.
EXAMPLE:
OUTPUT:
20
FIGURE 2:HISTOGRAM
4.APPENDIX
import pandas as pd
df = pd.read_csv("economic_indicators.csv")
# Inflation Distribution
21
sns.histplot(df['inflation'], kde=True, ax=axes[0], color="blue")
axes[0].set_title("Distribution of Inflation")
plt.tight_layout()
plt.show()
plt.figure(figsize=(14, 8))
plt.xlabel("Date")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
22
daily_mean_inflation = df.groupby("date")["inflation"].mean().reset_index()
plt.figure(figsize=(12, 6))
plt.xlabel("Date")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
df['unemployment_rate_rolling_avg'] = df.groupby('country_id')
['unemployment_rate'].transform(
plt.figure(figsize=(14, 8))
plt.xlabel("Date")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
plt.figure(figsize=(14, 8))
23
sns.lineplot(data=df, x="Year", y="Consumer Confidence Index", marker="o",
color="brown")
plt.xlabel("Year")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
4.2 SCREENSHOTS
24
FIGURE 3:DATASET IMPORT
FIGURE 4: DISTRIBUTION
25
FIGURE 5:GDP GROWTH
26
FIGURE 7:UNEMPLOYMENT ROLLING AVERAGE
27
FIGURE 8:INFLATION ROLLING AVERAGE
28
FIGURE 10:DATASET(ECONOMIC_INDICATORS)
5. CONCLUSION
29
The efficacy of EVM has been demonstrated through case studies, which reveal its utility in
complex, high-stakes environments such as healthcare, finance, and engineering. In these
fields, where data-driven decisions can have substantial impacts, EVM’s ability to detect
inconsistencies and refine hypotheses proves invaluable. Analysts can be more confident that
their conclusions are not merely plausible but are verified against objective criteria,
minimizing the risk of costly misjudgments. This integration of formal verification into
exploratory analysis is particularly beneficial in today’s data-driven landscape, where
decision-makers require accurate and reliable insights to navigate complex challenges.
Looking forward, the future development of EVM will focus on optimizing its scalability to
handle large, real-time datasets without compromising performance. Additionally, further
integration with popular visualization tools is essential to make EVM widely accessible to
data analysts across various domains. By making EVM compatible with existing EVA tools
and scalable for large data environments, this framework has the potential to become a
cornerstone in data analysis workflows.
6.FUTURE WORK
1. Scalability and Performance Optimization: One of the main challenges for EVM is
ensuring its efficiency when working with large, complex datasets. Future work
should focus on optimizing the underlying model checking algorithms to handle big
data more effectively. This includes developing more efficient algorithms for real-
time validation, parallelization techniques, and approaches that reduce the
computational overhead of model checking, particularly in high-dimensional or
streaming data scenarios.
2. Integration with Existing Data Visualization Tools: For EVM to gain broader
adoption, it is essential to seamlessly integrate with popular visualization tools and
platforms, such as Tableau, Power BI, or D3.js. Future efforts could focus on
developing plug-ins or APIs that allow EVM to work with these existing tools without
30
requiring significant changes to current workflows. This would make it easier for
analysts to adopt EVM without needing to learn entirely new systems.
3. Extending Support for Diverse Data Types: While EVM is designed to handle a
wide range of data types, there is room to expand its capabilities. Future work should
focus on enhancing EVM's ability to support specialized data types, such as temporal,
spatial, or graph data, and integrate with emerging data formats. Improved support for
different data modalities will make EVM more versatile across various domains,
including healthcare, geospatial analysis, and social network analysis.
4. Improved User Interface and Experience: While EVM offers powerful verification
capabilities, the user interface (UI) could be enhanced to make the framework more
accessible to a broader audience. Future developments could focus on designing
intuitive, user-friendly interfaces that allow users—particularly those without formal
training in model checking—to easily interpret validation results and incorporate them
into their analysis process.
5. Adaptive Model Checking for Dynamic Data: Real-time data streams are
increasingly common in many fields, from financial markets to IoT systems. Future
work could explore how EVM can be adapted to handle dynamic, continuously
updating data in real time.
7.REFERENCES
WEBSITES:
https://arxiv.org/abs/
https://www.researchgate.net/publication/374941893_EVM_Incorporating_Model_Checking_
into_Exploratory_Visual_Analysis
https://idl.uw.edu/papers/evm
PERFORMANCE
VIVA-VOCE
MINI PROJECT 31
TOTAL
32