Universal Data Analytics Algorithm
Universal Data Analytics Algorithm
GUIDE OF
UNIVERSAL
DATA
ANALYTICS
ALGORITHM
PRATYUSH PURI
A THEORITICAL
GUIDE OF
UNIVERSAL
DATA
ANALYTICS
ALGORITHM
PRATYUSH PURI
Let’s Discuss More…
Contents
• Introduction
a. First, open your Python environment (Jupyter Notebook, VS Code, or any IDE).
b. Import the essential libraries for data analysis:
i. Pandas (for data handling)
ii. NumPy (for numerical operations)
iii. Matplotlib and Seaborn (for visualization)
iv. If you encounter warnings, set up your environment to ignore them.
a. Understand the data source: CSV, Excel, JSON, SQL database, or any other format.
b. Keep the file path or database connection details ready.
a. CSV File:
i. This is the most common format; use pd.read_csv() to import.
ii. Write the file path correctly (use double backslash or forward slash in
Windows).
iii. If the file does not have a header, use header=None.
iv. For large files, use the chunk size parameter to import data in chunks.
v. For encoding issues, use the encoding parameter (e.g., encoding='utf-8').
vi. To treat missing values specifically, use the na_values parameter.
b. Excel File:
i. Use pd.read_excel(), and you can specify the sheet name.
c. JSON File:
i. Import using pd.read_json().
d. SQL Database:
i. Create a connection using pyodbc or sqlalchemy, then use
pd.read_sql_query().
e. Other Formats (SAS, Stata, etc.):
i. Pandas provides functions like read_sas(), read_stata() for these formats.
a. If the file is very large, import a sample using the nrows parameter.
b. If the data is compressed (zip/gz), you can import it directly (e.g.,
pd.read_csv('file.csv.gz')).
c. To select specific columns, use the usecols parameter.
d. To set a column as the index, use the index_col parameter.
e. If there are comments or unnecessary rows in the data, use the comment or
skiprows parameter.
6. Documenting the Data Import Process
a. Comment the data import process in your notebook so other analysts can understand
where the data came from and how it was imported.
b. Mention the data source, version, and import date (to maintain data lineage).
Pro Tip:
Understand and use all available parameters during data import (header, index_col, usecols,
na_values, dtype, skiprows, nrows, encoding, etc.). This is what sets apart an average analyst
from the best.
Summary:
In Chapter 1, pay attention to every detail while importing data—file format, path, encoding, missing
values, columns, data types, and import parameters. Immediately verify after import that the data is
correct. Following all these steps will ensure your analysis is always professional and reliable.
Chapter 2
Importing Data
• Use df.shape to get the count of rows and columns. This tells you how big the data is.
• Use df.size to find the total number of elements (rows × columns).
• Use df.head(n) to view the top n rows (default 5). This helps you understand the structure
and starting values of the data.
• Use df.tail(n) to view the last n rows, so you can catch end values and possible data entry
issues.
• Use df.sample(n) to look at random rows, ensuring you don’t miss any patterns in the data.
• Use df.info() to get data types, non-null counts, and memory usage for each column.
• This helps you identify missing values and get an idea of memory optimization.
• Use df.describe() for numerical columns to get count, mean, std, min, max, and quartiles.
• Use df.describe(include='object') for a summary of categorical columns (unique, top,
freq).
• Use df.describe(include='all') for a summary of mixed data types.
• Use df.isnull().sum() to find how many missing values each column has.
• Use df.isnull().any() to see which columns contain missing values.
• For numeric columns, check the range (e.g., are negative values allowed?).
• For categorical columns, check for inconsistent entries (e.g., 'Male', 'male', 'MALE').
• For date columns, check if the format is consistent, possibly using regex.
• Check cross-column dependencies (e.g., bedrooms should not be less than rooms).
• Check for duplicates: df.duplicated().sum().
10. Visually Inspect the DataFrame
• Always focus on data types and missing values, as these can cause errors in analysis and
modeling.
• Use value_counts() on categorical data to spot rare categories or spelling mistakes.
• Check logical consistency (cross-column rules), which might be missed in normal inspection.
• Save the output of DataFrame info and describe in your notebook for future reference.
In Chapter 2, inspect the data from every angle—structure, types, missing values, unique values,
logical consistency, and data quality. Doing all these checks will make your analysis professional,
reliable, and error-free, just like the best data analysts.
Chapter 3
Data Cleaning
• Always save a separate copy of the raw/original data. Never overwrite it, so you can easily
revert if needed.
• Remove columns/rows not needed for analysis, like IDs, irrelevant logs, or placeholder
columns.
• Use df.drop(columns=['col1', 'col2']) or df = df[df['col'] !=
'unwanted_value'].
• Detect out-of-range values, impossible entries (like negative age, future date).
• Standardize values, e.g., convert all 'Male', 'male', 'MALE' to 'male'.
• Convert date/time columns to a uniform format: pd.to_datetime(df['date_col'],
errors='coerce').
• Apply cross-column rules (e.g., start_date should not be after end_date, bedrooms should
not be less than rooms).
• Fix or flag logical errors.
• Treat special symbols (like '?', '--', 'N/A') as missing values using the na_values parameter
or .replace().
• Use df.info(), df.describe(), and visual checks to ensure that after cleaning, the data is
correct and no wrong bias has been introduced.
• Always re-inspect the data after every cleaning step to avoid unintended consequences.
• Build automated cleaning pipelines to save time and reduce errors in large or repeatable
projects.
• After cleaning, check the data’s distribution, mean, median, std, and unique values again.
• Make cleaning functions reusable and well-documented for team sharing.
Summary:
In Chapter 3, an expert data analyst performs every possible data cleaning activity—handling missing
values, duplicates, invalid entries, outliers, string/text issues, data types, logical consistency, and
documentation. Every step should be modular, repeatable, and well-documented. Don’t forget to re-
inspect the data after cleaning so your analysis is always trustworthy, accurate, and professional.
Chapter 4
Exploratory Data Analysis (EDA)
• Before starting analysis, review your business or research objectives again. This ensures the
analysis stays focused and relevant.
• Summarize the data using mean, median, mode, minimum, maximum, standard deviation,
percentiles, range, and count.
• For categorical columns, check value counts, frequency tables, and unique values.
• For numerical columns, examine distributions using histograms and boxplots.
• Note any outliers or anomalies.
• Create new features (e.g., extract month, day, year from dates; calculate text length or
sentiment).
• Scale or normalize numerical features using StandardScaler or MinMaxScaler.
• Encode categorical features using Label Encoding or One-Hot Encoding.
• Apply binning, bucketing, or discretization where appropriate.
• Divide data into relevant segments (such as age groups, locations, product categories).
• Analyze each segment separately to gain granular insights.
• Segmentation helps uncover hidden trends not visible in aggregated data.
8. Detect Anomalies, Trends, and Seasonality
• Use live dashboards or streaming analytics tools (e.g., Apache Kafka, Spark Streaming) to
meet business needs.
• Record every finding, pattern, relationship, and anomaly in your notebook or report.
• Include visualizations, tables, and key metrics.
• Note limitations, data quality issues, and assumptions.
• Treat EDA as an iterative process: as new patterns emerge, re-inspect data, update
visualizations, and test new hypotheses.
Quick Checklist Table
• Always write insights alongside every visualization; just plotting graphs is not enough.
• Check statistical significance to ensure findings are reliable.
• Segment data thoroughly; valuable insights often lie in subgroups.
• Make EDA reproducible by maintaining clean code, comments, and outputs.
• Clearly mention limitations and data quality issues.
Summary:
In Chapter 4, an expert data analyst applies descriptive, inferential, statistical, and machine learning
analyses; visualizes every variable; detects relationships, patterns, and outliers; performs
segmentation; and documents all findings thoroughly. Keep EDA iterative and objective-driven so
insights are robust, actionable, and aligned with business or research goals.
Chapter 5
Data Visualization
• First, decide the purpose of the visualization: showing trends, explaining distributions,
making comparisons, highlighting correlations, or illustrating part-to-whole relationships.
• Understand your audience: are they technical or non-technical, business or research
focused?
• Univariate Analysis:
• Numerical: Histogram, boxplot, density plot.
• Categorical: Bar chart, pie chart, count plot.
• Bivariate/Multivariate Analysis:
• Numerical vs Numerical: Scatter plot, hexbin plot.
• Categorical vs Numerical: Boxplot, violin plot, swarm plot.
• Multiple variables: Pairplot, heatmap, correlation matrix.
• Time Series:
• Line plot, area chart, seasonal decomposition.
• Geographical Data:
• Map, choropleth map, symbol map.
• Part-to-Whole:
• Pie chart, donut chart, stacked bar chart.
• Ranking/Comparison:
• Bar chart, lollipop chart, dot plot.
• Network/Relationship:
• Network diagram, sankey diagram.
• Text Data:
• Word cloud, frequency bar chart.
• Aggregate, filter, or transform data as needed (e.g., groupby, pivot, rolling averages).
• Treat outliers or missing values so the visualization is not misleading.
• Understand the scale and range of the variables you are plotting.
• Clarity:
• Always keep axis labels, titles, and legends clear and readable.
• Use accessible color palettes (colorblind-friendly, high contrast).
• Avoid unnecessary gridlines, ticks, and decorations.
• Consistency:
• Use the same color, scale, and units for the same variable across visuals.
• Annotation:
• Annotate important points, trends, or outliers.
• Sorting:
• Sort bar charts or rankings in a logical order.
• Interactivity:
• Add filters, slicers, and drill-downs in dashboards so users can explore data.
6. Combine Multiple Visualizations
• Create date, geography, or product hierarchies to allow users to move from high-level to
detailed views.
Summary:
In Chapter 5, an expert data analyst selects the right visualization technique, prepares data, follows
design best practices, builds interactive and multi-angle dashboards, documents each visualization,
and uses a story-driven approach. The goal of visualization is to convert complex data into simple,
clear, and actionable insights so that decision-making is fast and effective.
Chapter 6
Feature Engineering
• For numerical features: fill with mean, median, mode, interpolation, or a domain-specific
value.
• For categorical features: fill with mode, 'Unknown', or predictive imputation.
• Advanced: create a missing indicator feature (e.g., is_missing flag).
• Interaction Features: Product, ratio, or difference of two or more features (e.g., price ×
quantity = revenue).
• Polynomial Features: Square, cube, etc. of features (e.g., x, x², x³).
• Temporal Features: Extract year, month, day, weekday, or time-delta from dates.
• Aggregated Features: Use groupby to get mean, sum, count, min, max, std, etc. (e.g., total
purchases per customer).
• Text Features: Text length, word count, sentiment score, TF-IDF, embeddings.
• Domain-Specific Features: Create new features based on business logic or expert
knowledge.
7. Feature Transformation
• Test the impact of every new feature or selection on the model (cross-validation, A/B
testing).
• Use an iterative approach: add/remove features and evaluate model performance.
• The combination of creativity and domain knowledge is the most powerful in feature
engineering.
• Always test the impact of every new feature on the model to avoid overfitting.
• Use dimensionality reduction (PCA, autoencoders) to make high-dimensional data
manageable.
• After feature selection, check model interpretability and business explainability.
• Feature engineering is an iterative process—refine features as new patterns emerge.
Feature Engineering Techniques
Summary:
In Chapter 6, an expert data analyst transforms, creates, selects, and optimizes raw data—handling
missing values, outliers, scaling, encoding, interaction/polynomial/temporal features, feature
selection, and automation—all with documentation and benchmarking. Feature engineering is the
real secret to model accuracy, robustness, and explainability, so use creativity, logic, and domain
knowledge at every step.
Chapter 7
Outlier Detection
• First, decide why you are detecting outliers: data cleaning, anomaly detection, fraud
detection, or rare event analysis.
• Use business context and domain knowledge to correctly interpret unusual points.
• Mahalanobis Distance:
• Considers correlations between multiple variables.
• Multivariate Analysis (MVA):
• Analyze multiple columns together to detect outliers.
• Pairwise Scatterplots:
• Visually identify outliers in multivariate data.
• Isolation Forest:
• Isolates data points through random splits; easily isolated points are outliers.
• One-Class SVM:
• Identifies outliers by treating normal data as one class.
• Autoencoders (Deep Learning):
• Detect outliers in high-dimensional data using reconstruction error.
• Document which outlier detection technique was used, what threshold was set, how many
points were detected, and what action was taken.
• Note the impact of outlier removal/capping on the analysis.
• Outlier detection is not a one-time task; check again after each new feature or
transformation.
• Experiment with different techniques (statistical, clustering, ML-based) and choose the best
approach.
Outlier Detection Techniques
• Context is most important in outlier detection—sometimes rare but valid business cases may
look like outliers.
• Combine multiple techniques (visual + statistical + ML) for robust detection.
• Always check data distribution and model performance after handling outliers.
• Document and validate with business experts to avoid bias from outlier removal.
Summary:
In Chapter 7, an expert data analyst detects outliers from every angle—visualization, IQR, Z-score,
clustering, ML-based, domain logic—and handles each outlier according to context (remove, cap,
impute, flag, or business review). Every step should be transparent, iterative, and well-documented
to ensure the analysis is accurate, fair, and business-relevant.
Chapter 8
Reporting & Communicating Insights
• The main goal of data splitting is to evaluate your model in an unbiased way and prevent
overfitting.
• The training set teaches the model, the validation set is for hyperparameter tuning and
model selection, and the test set evaluates real-world model performance.
• Random Splitting:
• Randomly split the data (e.g., 70-80% training, 20-30% testing).
• Simple and effective when data is large and balanced.
• Stratified Splitting:
• For imbalanced datasets, use stratified splitting to maintain class proportions in each
split.
• Use the stratify parameter in scikit-learn.
• Time-Based Splitting:
• For time series data, use earlier data for training and later data for testing.
• Use TimeSeriesSplit or custom logic.
• K-Fold Cross-Validation:
• Split data into K equal folds; each fold is used once as a test set, the rest as training.
• Stratified K-Fold is best for imbalanced classes.
• Custom Splitting:
• Use business or domain-specific logic (e.g., recent data for testing, older data for
training).
• Common ratios:
• 70% train, 30% test
• 80% train, 20% test
• 60% train, 20% validation, 20% test
• For K-Fold: K = 5 or 10 is commonly used.
6. Ensure Reproducibility
• For random splits, fix the random_state parameter to make results reproducible.
• Document the splitting process, code, parameters, and logic.
• Perform data cleaning, feature engineering, scaling, or encoding only after splitting (fit only
on training, then apply to test/validation).
• Make sure the target variable or future information does not accidentally leak into training or
test sets.
• Check the distribution of the target variable in each split (especially for stratified splits).
• Ensure splits are representative and not biased.
• Nested Cross-Validation:
• For hyperparameter tuning and unbiased evaluation.
• Group K-Fold:
• When data has groups (e.g., patients, users), ensure each group is only in one split.
• Leave-One-Out (LOO):
• Each observation serves as a test set once (for small datasets).
• Clearly mention split ratios, strategy, random state, and logic in your notebook/report.
• Report summary stats, class balance, and sample sizes after splitting.
• Always use stratified splitting for imbalanced datasets to avoid model bias.
• Never include future data in the training set for time series problems.
• Use K-Fold cross-validation to check model stability and robustness.
• After splitting, plot descriptive stats and target distribution for each split.
• Make your data splitting code modular for repeatability and auditability.
Data Splitting Checklist Table
Summary:
In Chapter 8, an expert data analyst carefully plans data splitting—choosing the right strategy, ratios,
ensuring reproducibility, preventing leakage, and documenting the process. They check the
distribution of each split, use advanced techniques if needed, and keep the process transparent. This
ensures model evaluation is fair, unbiased, and real-world ready—just like the best data analysts do.
Chapter 9
Model Selection
4. Model Training
• Train each shortlisted model on the training data.
• Define the loss function (e.g., cross-entropy, MSE) and select the optimization algorithm (e.g.,
gradient descent).
• Apply regularization (L1, L2, dropout) to prevent overfitting.
5. Hyperparameter Tuning
• Tune hyperparameters for each model (e.g., tree depth, learning rate, number of estimators,
regularization strength).
• Use grid search, random search, or Bayesian optimization to find the best combination.
• Perform tuning with cross-validation for unbiased results.
7. Overfitting/Underfitting Analysis
• Select the most balanced model—one that performs best on validation and generalizes well.
• Consider model interpretability, business constraints, and deployment feasibility.
9. Final Evaluation on Test Set
• Use feature importance, SHAP values, or LIME to explain model predictions—especially for
critical domains (healthcare, finance, etc.).
Summary:
In Chapter 9, an expert data analyst systematically performs model selection, training, tuning,
evaluation, and documentation—tries multiple models, compares on best metrics, analyzes
overfitting/underfitting, checks explainability and deployment readiness, and ensures everything is
reproducible and transparent. This approach ensures your model is always accurate, robust, and
business-ready—just like the best data analysts do.
Chapter 10
Insights & Reporting
• Convert your findings from mere observations to actionable insights—focus not just on
“what happened,” but also on “why it happened” and “what should be done next.”
• Explain every major trend, pattern, or anomaly and highlight its business impact.
• Link insights to the business or project objectives.
• For every insight, provide clear, specific, and practical recommendations (e.g., “Streamline
the onboarding process to reduce customer churn”).
• Suggest both short-term quick wins and long-term strategic actions.
• Justify recommendations with data and analysis—avoid opinions, make them data-driven.
• Use written reports (PDF, Word), presentations (PowerPoint), dashboards (Tableau, Power
BI), or web-based reports—choose what works best for your audience.
• Also consider oral presentations, digital reports, and interactive dashboards.
• Ensure accessibility—visuals should be colorblind-friendly, fonts readable, and formats
universally accessible.
• Share results in a timely manner so they can actually impact business or project decisions.
• Make sure insights are not outdated and recommendations are relevant to the current
context.
8. Enable Feedback and Iteration
• Collect feedback from stakeholders, address queries, and refine the report/insights as
needed.
• Maintain an iterative approach so findings can be continuously improved.
• For every insight, answer “So What?”—what does this mean for the business or project?
• Highlight key trends in visuals (annotations, callouts)—just showing data is not enough.
• Prioritize recommendations—separate quick wins, high-impact, and strategic actions.
• Do not cherry-pick data or insights—share all relevant findings, whether positive or negative.
• Always include a “Next Steps” or “Action Plan” section at the end of every report or
presentation.
Summary:
In Chapter 10, an expert data analyst converts analysis results into actionable insights and
recommendations, links them to business objectives, maintains honest and transparent
communication, uses the best visualization and storytelling techniques, follows an audience-centric
structure, shares results in multiple timely mediums, collects feedback, and maintains thorough
documentation. This approach ensures insights are impactful, understandable, and decision-ready—
just like the best data analysts do.
Chapter 11
Save Results
• Decide what needs to be saved or backed up: cleaned datasets, processed features, trained
models, code scripts, reports, visualizations, logs, and documentation.
• The objective is reproducibility, audit trail, future reuse, and knowledge transfer.
• Save final cleaned datasets in standardized formats (CSV, Parquet, Excel, SQL, etc.).
• Use data versioning (file naming conventions, timestamps, or tools like DVC/Git LFS).
• Encrypt or apply access control to sensitive data.
• Push Jupyter notebooks, Python scripts, R scripts, or SQL queries to version control (Git).
• Save README, requirements.txt/environment.yml, and usage instructions with the code.
• Update code documentation and comments.
5. Store Reports, Visualizations, and Dashboards
• Save reports as PDF, PPT, HTML, or on dashboard platforms (Power BI, Tableau).
• Export visualizations as high-resolution images or interactive formats (Plotly, Tableau Public,
etc.).
• Store reports and dashboards on shared drives, SharePoint, or cloud storage.
• Save data dictionaries, methodology docs, decision logs, and process documentation in a
central repository.
• Maintain documentation with version and date.
• Share results, code, and documentation with the team (shared folders, GitHub, Confluence,
Notion, etc.).
• Conduct handover notes or walkthrough sessions for new team members or stakeholders.
• Share FAQs, troubleshooting guides, and best practices.
• Attach process logs, code version, data version, and environment details to every saved
output.
• This ensures future analysis can be reproduced, troubleshot, or audited.
• Save feedback, lessons learned, and improvement notes in the archive for future projects.
• Keep documentation updated and review the archive periodically.
• Always tag every output, model, and code file with version and date.
• Use cloud storage, version control, and data cataloging tools for large teams.
• Keep a “README” or summary file in the archive so anyone can quickly understand what’s
there and how to use it.
• Never store sensitive data in unsecured locations—encryption and access control are a must.
• Periodically audit the archive—clean obsolete files, back up critical outputs.
Summary:
In Chapter 11, an expert data analyst systematically saves, archives, and shares every output—cleaned
data, models, code, reports, and documentation. Every file is versioned, documented, and secured;
knowledge transfer and reproducibility are ensured; and the foundation is set for compliance, audit
trail, and future learning. This approach makes your work sustainable, reusable, and a long-term
asset for the organization—just like the best data analysts do.