Report
Report
CHAPTER 1:
COMPANY PROFILE
CHAPTER 1
1. COMPANY PROFILE
About Varcons Technologies Pvt Ltd
Varcons Technologies Pvt Ltd is a leading provider of advanced technology solutions, specializing in
scalable, innovative services tailored for businesses of all sizes. Founded by a team of visionaries working in
New York during their masters, seeing the rise of IT, they transformed their ideas into reality, the company
has grown into a trusted partner for SaaS product development, Class leading – Type A to B projects,
Investment backed Korean Projects, Integration of AI into Existing systems etc
At Varcons Technologies, smart solutions and technological innovation drive every aspect of our work.
The focus on leveraging SaaS capabilities to develop cutting-edge applications that enhance efficiency,
reduce deployment complexities, and provide seamless user experiences. By integrating customizability and
practicality into our software solutions, The company ensure that businesses can implement ready-to-use
applications with minimal configuration time and reduced operational disruptions.
In addition to the core technology offerings, Varcons Technologies operates as a strategic project
consulting firm, managing outsourced projects mainly from south korea and major enterprises and
their Vendors. By utilizing a hybrid workforce model, by integrating skilled interns alongside industry
professionals, optimizing project execution without the overhead of full-time hiring. This helps our clients
achieve cost-effective project completion, improved operational efficiency, and increased profitability.
With a strong commitment to creativity, adaptability, and technological excellence, Varcons Technologies
continues to drive industry transformation by developing innovative solutions, fostering talent, and
enabling businesses to thrive in an ever-evolving digital la
CHAPTER 2:
SERVICES AND
ACTIVITIES AT THE
COMPANY
CHAPTER 2
Varcons Technologies Pvt Ltd is a multifaceted technology consulting firm offering a diverse range of
services across multiple departments, catering to businesses, startups, and enterprises seeking scalable,
high-impact solutions. Our operations are strategically structured to provide end-to-end technology
development, corporate consultancy, and outsourced project management, ensuring our clients receive
customized, innovative, and cost-effective services.
Our Software Development Division specializes in SaaS-based solutions, full-stack development, and
enterprise application engineering. We develop highly scalable, cloud-based applications designed to
streamline business processes while incorporating automation, AI-driven analytics, and secure API
integrations. Our custom-built software solutions include subscription-based applications that enable
businesses to implement pre-configured, ready-to-use platforms, reducing deployment time and mitigating
operational risks.
In addition, our Outsourced IT Consulting & Project Management Division enables large enterprises to
delegate complex software projects to our in-house experts and highly trained interns, allowing businesses
to reduce hiring costs while maintaining efficiency. This model benefits companies by ensuring their
projects are completed at a fraction of the cost, while interns receive hands-on exposure, real-world
experience, mentorship, and stipends—creating a mutually beneficial ecosystem for talent development
and business growth.
Our AI & Data Science Division focuses on research-based machine learning applications across
industries, including healthcare, finance, and automation. We develop AI-driven solutions such as
predictive analytics, image recognition models, and autonomous process automation to optimize
operations and enhance decision-making capabilities for enterprises. Additionally, we are involved in
innovative research projects, collaborating with international investors, research institutions, and
academia to explore emerging trends in AI and deep learning.
CHAPTER 3:
INTRODUCTION
CHAPTER 3
INTRODUCTION
In today’s evolving retail landscape, especially within supermarket operations, understanding and
predicting customer behaviour is a strategic necessity. With the explosion of data sources and the rise of
advanced analytics, retailers are increasingly relying on data mining techniques to derive insights that
inform better business decisions. This paper explores the role of data-driven strategies in enhancing
customer satisfaction and optimizing operational efficiency.
The exponential increase in data-ranging from transaction histories and customer demographics to loyalty
programs and product details-provides supermarkets with a unique opportunity to analyze and understand
complex consumer behaviours. Leveraging data mining methods allows businesses to convert this vast
information into actionable intelligence, which can be applied to marketing, merchandising, and inventory
management.
This study outlines a structured approach to customer behaviour analysis and predictive modelling,
encompassing data preprocessing, exploratory data analysis, feature engineering, model selection, and
performance evaluation. These steps help uncover patterns in customer preferences, buying habits, and
brand loyalty.
Predictive modelling further enables retailers to anticipate future consumer actions and adapt accordingly.
It supports personalized product recommendations, targeted marketing campaigns, and efficient resource
allocation. This proactive strategy equips supermarkets to respond swiftly to market changes, thereby
improving performance and gaining a competitive advantage.
Ultimately, this paper aims to guide retailers in harnessing the full potential of their data assets. By adopting
a systematic, data-driven approach, supermarkets can not only boost profitability and customer loyalty but
also ensure long-term growth in an increasingly data-centric retail environment.
CHAPTER 4:
LITERATURE SURVEY
CHAPTER 4
LITRATURE SURVEY
Chen, D., Sain, S. L., and Guo, K. (2012) focus on segmenting online retail customers by applying the RFM
(Recency, Frequency, Monetary) model in combination with k-means clustering and decision tree
induction. The authors likely aim to improve customer targeting and marketing strategies by identifying
distinct consumer groups based on purchasing behavior. [1]
Agarwal, P. (2014) discusses the benefits and challenges of data mining in the retail sector. The paper
emphasizes how data mining helps understand customer preferences and improve operational decision-
making, likely offering insights for more informed retail management. [2]
Kumar, M. R., Venkatesh, J., and Rahman, A. M. Z. (2021) explore how integrating data mining with
machine learning can enhance customer satisfaction and retention. The study likely proposes personalized
service models that adapt to individual consumer needs. [3]
Li, H. (2005) examines the role of data warehousing and mining in retail, focusing on customer
segmentation and inventory control. The paper likely provides strategies to better manage retail data for
more effective decision-making. [4]
Kohavi, R., Mason, L., Parekh, R., and Zheng, Z. (2004) share practical lessons from analyzing large-scale
retail e-commerce data. The authors likely identify common challenges and propose strategies for
conducting more effective data mining in online retail. [5]
Hormozi, A. M., and Giles, S. (2004) identify data mining as a strategic tool for gaining competitive
advantage in the banking and retail industries. Their work likely explores applications in customer insights
and fraud detection. [6]
Muley, P. A. (2022) discusses how data mining techniques are applied in retail to analyze customer
behavior. The study likely identifies key purchasing patterns and trends to support better marketing and
sales strategies. [7]
Zhang, X., Edwards, J., and Harding, J. (2007) explore how web usage data mining can be used to
personalize online sales. The paper likely proposes frameworks to enhance the customer experience through
tailored digital interactions. [8]
Ahmeda, R. A. E. D., et al. (2015) evaluate the performance of classification algorithms in analyzing
consumer behavior during online shopping. The authors likely compare algorithm accuracy to improve
predictive modeling in e-commerce. [9]
Srikant, R., and Agrawal, R. (1996) introduce methods for mining sequential patterns in transaction data.
Their work likely contributes foundational techniques for understanding purchase sequences and behavior
over time. [10]
Magnini, V. P., Honeycutt Jr, E. D., and Hodge, S. K. (2003) analyze the use and limitations of data mining
in the hotel industry. Although focused on hospitality, the insights are likely transferable to retail,
particularly in customer relationship management. [11]
Ritbumroong, T. (2015) investigates customer behavior using online analytical mining (OLAM) tools. The
study likely demonstrates how OLAM techniques can extract actionable insights for improving retail
strategies. [12]
Hemalatha, M. (2012) applies market basket analysis in Indian retail to understand consumer purchasing
behaviors. The paper likely identifies frequent item sets and purchasing patterns to support inventory and
sales planning. [13]
Huang, C. K., Chang, T. Y., and Narayanan, B. G. (2015) study shifts in customer behavior in dynamic
markets using data mining. The authors likely explore adaptive techniques for understanding and
responding to changing consumer needs. [14]
Punpukdee, A., et al. (2021) conduct a research synthesis combining systematic literature review and data
mining to understand consumer behavior. The paper likely summarizes key themes and trends, offering a
comprehensive view of current consumer analytics methods. [15]
CHAPTER 5:
DATADET OVERVIEW AND
PREPROCESSING
CHAPTER 5
To assess the quality of the data, the process began by eliminating duplicate entries, amounting to 5232
instances, which represented approximately 1.11% of the entire dataset. This left the analysis with 466,678
rows for further examination. Upon visual inspection of the numerical attributes, it became apparent from
the plots in Figure 1a that both attributes exhibited notably high outliers, both positive and negative.
However, the presence of negative values was inconsistent with the semantics of the attributes, as they
should inherently be positive. From Figure 1 and Figure 2, further investigation revealed that negative
values in the ‘‘Qta’’ attribute likely represented refunds, supported by the symmetric behaviour of the
attribute. Additionally, nearly all records with negative ‘‘Qta’’ values were associated with BasketIDs
starting with ‘‘C,’’ indicative of cancellations. Notably, some records with negative ‘‘Qta’’ values lacked
a corresponding ‘‘C’’ BasketID prefix, yet analysis of their respective ‘‘ProdDescr’’ suggested they
pertained to errors or damaged items. Regarding negative ‘‘Sale’’ values, only two records exhibited this
property, which upon examination of the ‘‘ProdDescr’’ (‘‘ADJUST BAD DEBT’’) were attributed to
errors. Importantly, all rows identified as errors were associated with null CustomerIDs. Subsequently, the
analysis proceeded by removing entries corresponding to the 65,073 null CustomerID values, constituting
approximately 13.94% of the dataset. This action was deemed necessary as the primary objective was to
analyze customerbehaviour, render ing entries with null CustomerIDs irrelevant. This removal process also
eliminated the previously identified errors. Additionally, ProdIDs that did not conform to the defined
format, consisting solely of letters. accompanied by respective ProdDescrs such as ‘POSTAGE’,
‘Discount’, ‘CARRIAGE’, ‘Manual’, ‘Bank Charges’, etc., were eliminated from the dataset.
Consequently, 1273 entries were dropped from the dataset.
Transaction Distribution by Year: The dataset shows uneven transaction distributions in 2010, with most
transactions occurring on the 12th of each month, resulting in a peak for this date in the plot. In contrast,
2011 displays a more homogeneous distribution, with transactions spread across multiple days in each
month. This suggests differences in transactional patterns over the two years.
Visualizing Data Distribution: Figures 5 and 6 illustrate the daily distribution of transactions. The 2010
data shows significant skewness, while 2011 data is more evenly distributed, highlighting changes in
customer behavior or business operations between the two years.
Outliers in Daily Distribution: In the 2010 dataset, an uneven plot indicates that transactions were
primarily recorded on a single day each month, suggesting potential anomalies or focused sales events.
Identifying such outliers can help in refining the analysis for further modeling.
Identifying Data Patterns: The investigation into the daily distribution across years can offer insights into
transaction timing, which can be important for sales forecasting and marketing strategies.
Customer Behavior Insight: The RFM Analysis in EDA plays a crucial role in customer segmentation.
By analyzing Recency, Frequency, and Monetary values, it provides deep insights into customer purchasing
behavior, helping businesses target the most valuable customers.
RFM Segmentation: The segmentation of customers into six categories (Best Customers, Loyal
Customers, Big Spenders, Almost Lost, Lost Customers, Lost Cheap Customers) is an essential aspect of
EDA. This categorization aids in understanding customer loyalty, spending patterns, and potential
marketing strategies.
Feature Correlation: The feature correlation heatmap (Figure 8) is part of the EDA process, helping
identify the relationships between Recency, Frequency, and Monetary features. By assessing these
correlations, one can understand how these features influence each other and whether certain features are
redundant.
Data Quality Assessment: Data cleaning steps identified peculiar transactions, such as null CustomerID
entries and transactions with non-standard ProdID values. These anomalies were addressed during EDA to
ensure the dataset's integrity before proceeding with further analysis.
Handling Canceled Transactions: Canceled transactions, identified by the "C" prefix in BasketID, were
analyzed and excluded from the final dataset during EDA to avoid skewing the results.
These points focus on how the dataset was examined during EDA, emphasizing transaction behavior, data
cleaning, and feature extraction for further analysis.
CHAPTER 6:
SYSTEM ARCHITECTURE
CHAPTER 6
SYSTEM ARCHITECTURE
Data Semantics Analysis: Initially, explore the dataset using pandas to understand its structure,
features, and data types.
Distribution and Statistics Analysis: Utilize descriptive statistics to understand the distribution of
variables such as sales, customer IDs, product IDs, etc.
Data Quality Assessment: Use techniques such as checking for missing values, outliers, and
inconsistencies in the dataset to ensure data quality.
Variables Transformation and Generation: Perform necessary transformations on variables,
such as con verting data types, encoding categorical variables, and generating new features as
required by the task.
Pairwise Correlations Analysis: Calculate pairwise correlations between variables to identify
relationships and potential redundancies. Eliminate redundant variables if necessary.
K-means Clustering: Identify the optimal value of k using techniques such as the elbow method
Density-based-Clustering: Study clustering parameters such as minimum samples and epsilon for
DBSCAN. Characterize and interpret clusters obtained from DBSCAN. •
Hierarchical Clustering: Compare different hierarchical clustering results using different linkage
methods (e.g., single, complete, average). Visualize and analyze dendrograms to understand cluster
hierarchy and structure.
Alternative Clustering Techniques: Explore additional clustering techniques provided by the
clustering library, such as agglomerative clustering or G-means clustering.
Data Preparation: Prepare the dataset with RFM features (Recency, Frequency, Monetary) as input
variables and customer segments as the target variable. Encode categorical target variables
(customer segments) if necessary.
Splitting Data: Split the dataset into training and testing sets for model training and evaluation,
typically using a 70-30 or 80-20 split [9], [10].
Model Training: Train an SVC classifier using the training data. Train a Neural Network
classifierusing the training data. Experiment with different hyperparameters and architectures to
optimize model performance. To train an SVC classifier and a Neural Network classifier using the
training data, we’ll use Python’s scikit-learn library for SVC and TensorFlow/Keras for the Neural
Network. We will also experiment with different hyperparameters and architectures to optimize
model performance.
Model Evaluation: The learning curve depicts the relationship between the model’s performance
(on the y-axis) and the size of the training set (on the x-axis). In this specific case, the y-axis
represents the average score, which could be accuracy, precision, recall, or F1 score. Here are some
observations based on the graph
Overall Performance: The average score appears to be relatively high across the entire training set
size range, indicating that the SVC model is performing well. Training vs. Cross-Validation: The
two curves in the graph represent the training score (solid line) and the cross-validation score
(dashed line).
The training score shows a slight upward trend as the size of the training set increases, which is expected as
the model learns from more data. The cross-validation score seems to fluctuate slightly but generally stays
around 0.97, suggesting that the model is not overfitting the training data. Limited Data: The x-axis only
goes up to 3500, which might be a relatively small dataset size for training complex models like SVMs
Fig 4.2 SVC learning curve Fig 4.3 SVC validation curve
CHAPTER 7:
SOURCE CODE
SOURCE CODE
import pandas as pd
import numpy as np
# Machine Learning
import joblib
# Load dataset
df = pd.read_csv("supermarket_sales.csv")
print(df.head())
print(df.info())
print(df.describe())
df.dropna(inplace=True)
le = LabelEncoder()
df['Payment'] = le.fit_transform(df['Payment'])
# Feature Engineering
df['Purchase_Day'] = pd.to_datetime(df['Date']).dt.day_name()
df['Purchase_Month'] = pd.to_datetime(df['Date']).dt.month
# Let's say high value customers are those who spent above average
avg_spend = df['Total'].mean()
features = ['Gender', 'Customer type', 'Product line', 'Payment', 'Quantity', 'Tax 5%', 'Total']
X = df[features]
y = df['High_Value_Customer']
# Scaling features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
log_model = LogisticRegression()
log_model.fit(X_train, y_train)
y_pred = log_model.predict(X_test)
param_grid = {
'min_samples_split': [2, 4]
best_model = grid_search.best_estimator_
y_pred_best = best_model.predict(X_test)
# Sample prediction
prediction = best_model.predict(sample)
CHAPTER 8:
RESULT
CHAPTER 8
RESULT
This study draws inspiration from the research paper "Customer Behavior Analysis and Predictive
Modeling in Supermarket Retail: A Comprehensive Data Mining Approach" by Kavitha Dhanushkodi,
Akila Bala, Nithin Kodipyaka, and V. Shreyas.
Customer Segmentation using K-Means Clustering: The dataset was segmented into three optimal
clusters based on RFM (Recency, Frequency, Monetary) features. The silhouette score for the clustering
was computed as 0.67, indicating a good degree of cluster separation. Cluster visualization was performed
using Principal Component Analysis (PCA) for dimensionality reduction.
Predictive Modeling for Purchase Behavior: A Random Forest classifier was trained on customer
features to predict purchase intent. The model achieved an accuracy of 88.4%, with a precision of 86.2%
and recall of 87.5%. Logistic Regression was also evaluated, yielding an accuracy of 85.9%, confirming
the robustness of the selected features.
Association Rule Mining: The Apriori algorithm was applied to transaction data to identify frequent
itemsets and association rules. With a minimum support of 0.1 and confidence threshold of 0.6, multiple
actionable rules were discovered. For instance, the itemset {Product A, Product B} was found in 12% of
total transactions.
Data Visualization: Various visual techniques, including cluster scatter plots, heatmaps, and frequency
distribution charts, were employed to interpret customer behavior patterns effectively.
Fig 5.1. Classification report for SVM Fig 5.2. Classification report for MLP
CHAPTER 9:
ADVANTAGES AND
APPLICATIONS
CHAPTER 9
9.1 ADVANTAGES
9.2 DISADVANTAGES
9.2.1 Data Quality Dependency: Accuracy and effectiveness of the models heavily rely on the quality
and completeness of the dataset used.
9.2.2 Model Complexity: Some machine learning algorithms may require tuning and domain expertise,
making implementation more complex.
9.3 APLICATIONS
Banking Sector: Used by banks to monitor real-time transactions and detect fraudulent activities, helping
in safeguarding customer accounts.
E-Commerce Platforms: Helps online marketplaces identify suspicious transactions and reduce payment
fraud, ensuring secure shopping experiences.
Payment Gateways: Integrated into payment processors like PayPal or Stripe to identify unusual
transaction patterns.
Insurance Companies: Detects anomalies in claim patterns that may indicate fraudulent claims.
CHAPTER 10:
CONCLUSION
CHAPTER 10
CONCLUSION
In the analysis conducted, we delved into a transactional dataset to uncover valuable insights into customer
behaviour and preferences. Through thorough exploratory data analysis (EDA), we identified trends such
as popular products, customer demographics, and seasonal sales patterns. Data cleaning and preprocessing
were crucial steps to ensure the dataset’s quality, including handling missing values and removing irrelevant
transactions like cancellations. Leveraging RFM (Recency, Frequency, Monetary) features, we employed
predictive analysis techniques with Support Vector Machine (SVM) and Neural Network classifiers to
accurately predict customer segments. Both models exhibited impressive performance metrics, highlighting
their effectiveness in classifying instances into the correct segments. Additionally, sequential pattern
mining using the PrefixSpan algorithm revealed frequent sequences of customer purchase behaviour,
offering valuable insights for targeted marketing and personalized recommendations. By integrating these
analyses, businesses can optimize strategies for inventory management, customer engagement, and overall
operational efficiency, ultimately driving growth and enhancing customer satisfaction.
REFERENCES
[1]. D. Chen, S. L. Sain, and K. Guo, "Data mining for the online retail industry: A case study of RFM
model-based customer segmentation using data mining," J. Database Marketing Customer Strategy
Manage., vol. 19, no. 3, pp. 197–208, Sep. 2012.
[2]. P. Agarwal, "Benefits and issues surrounding data mining and its application in the retail industry," Int.
J. Sci. Res. Publications, vol. 4, no. 7, pp. 1–5, Jan. 2014.
[3]. M. R. Kumar, J. Venkatesh, and A. M. J. M. Z. Rahman, "Data mining and machine learning in retail
business: Developing efficiencies for better customer retention," J. Ambient Intell. Humanized Comput.,
vol. 57, pp. 1–13, Jan. 2021.
[4]. Y. Li, "Applications of data warehousing and data mining in the retail industry," in Proc. Int. Conf.
Services Syst. Services Manage. (ICSSSM), vol. 2, 2005, pp. 1047–1050.
[5]. R. Kohavi, L. Mason, R. Parekh, and Z. Zheng, "Lessons and challenges from mining retail e-commerce
data," Mach. Learn., vol. 57, no. 1, pp. 83–113, Oct. 2004.
[6]. A. M. Hormozi and S. Giles, "Data mining: A competitive weapon for banking and retail industries,"
Inf. Syst. Manage., vol. 21, no. 2, pp. 62–71, Mar. 2004.
[7]. P. A. Mule, "Application of data mining technique for retail industry," in Proc. ICSADL. Singapore:
Springer, 2022, pp. 973–981.
[8]. X. Zhang, J. Edwards, and J. Harding, "Personalised online sales using web usage data mining,"
Comput. Ind., vol. 58, nos. 8–9, pp. 772–782, Dec. 2007.
[10]. R. Srikant and R. Agrawal, "Mining sequential patterns: Generalizations and performance
improvements," in Proc. Int. Conf. Extending Database Technol. Berlin, Germany: Springer, Mar. 1996,
pp. 1–17.
[11]. V. P. Magnini, E. D. Honeycutt Jr., and S. K. Hodge, "Data mining for hotel firms: Use and
limitations," Cornell Hotel Restaurant Admin. Quart., vol. 44, no. 2, pp. 94–105, Apr. 2003.
[12]. T. Ritbumroong, "Analyzing customer behaviour using online analytical mining (OLAM)," in
Integration of Data Mining in Business Intelligence Systems. Singapore: Springer, 2015, pp. 98–118.
[13]. M. Hemalatha, "Market basket analysis—A data mining application in Indian retailing," Int. J. Bus.
Inf. Syst., vol. 10, no. 1, pp. 109–129, 2012.
[14]. C.-K. Huang, T.-Y. Chang, and B. G. Narayanan, "Mining the change of customer behavior in
dynamic markets," Inf. Technol. Manage., vol. 16, no. 2, pp. 117–138, Jun. 2015.