finaldoc
finaldoc
RECOMMENDATION IN BOOK
PHASE 2 SUBMISSION
College code:8100
College Name: University College of Engineering, BIT
Campus, Anna University, Tiruchirappalli-620 024.
Technology: AI
Total number of students in a group:5
Student’s detail within the group:
1. Viththagi K - 810022205057
2. Sibani Selvi P - 810022205056
3. Arun J - 8100222053301
4. Ranjith M C - 810022205304
5. Gautham R A - 810022205303
Submitted by,
SIBANI SELVI P,
au810022205058
PHASE 2 DOCUMENT: DATA WRAGLING
AND ANALYSIS
Introduction:
Phase 2 of our project is dedicated to data wrangling and
analysis, critical steps in preparing the raw dataset for
building a ai tool for detecting online fraud transactions. This
phase involves employing various data manipulation
techniques using Python to clean, transform, and explore the
dataset. Additionally, we assume a scenario where the
project aims to recommend users about the fraud
transactions once they were about to start the transactions.
Objectives:
1. Cleanse the dataset by addressing inconsistencies, errors,
and missing values to ensure data integrity.
2. Explore the dataset's characteristics through exploratory
data analysis (EDA) to understand distributions and
correlations.
3. Engineer relevant features to enhance model performance
for accurate detections on fraud transactions.
Dataset Description:
A dataset for building a ai tool for detecting online fraud
transactions typically includes a variety of information about
both the fraud transactions and the user’s account details. In
the fraud1.csv we have the following feature variables
1.step
2.type
3.amount
4.nameOrig
5.oldbalanceOrg
6.newBalanceOrig
7.nameDest
8.oldbalanceDest
9.newbalanceDest
10.isFraud
11.isFlaggedFraud
➢ Info: The
info() method prints information about dataset,
datatypes, memory usage, column labels.
Code:
#Data Description
import pandas as pd
import numpy as np
data=pd.read_csv("/content/fraud1.csv")
data.head()
data.tail()
data.info()
data.describe()
Output:
#head:
#tail:
#info:
#descibe:
Code :
Output:
#isnull():
#notnull():
#isnull().sum():
#dropna():
#fillna(0):
3. Data validation:
➢ Data integrity check: Verifying data consistency and integrity to
eliminate errors.
Ensuring data consistency across
➢ Data consistency verification:
different columns in a datasets.
Code:
#Data Validation
data["type"].unique()
data["oldbalanceOrg"].unique()
data["isFraud"].unique()
Output:
#type:
#oldBalanceOrg:
#isFraud:
4.Data Reshaping:
➢ Reshaping rows and columns: Ina dataset involves restructuring
the data to better suit the analysis or visualization needs.
➢ Transposing data: Converting rows into columns and vice versa
as needed.
Code:
#Data Reshaping
df_stacked=data.stack()
print(df_stacked.head(10))
df_unstacked=df_stacked.unstack()
print(df_unstacked.head(5))
df_melt=data.melt(id_vars=['type','isFraud'])
print(df_melt.head(10))
transposed_data=data.T
print(transposed_data)
Output:
#stacked():
#unstacked():
#melt():
#transpose():
5.Data merging:
Code:
#data merging
data1=pd.read_csv("/content/crd.csv")
merged_data=pd.merge(data, data1, on="type", how="inner")
print(merged_data)
Output:
6.Data aggregation:
➢ Grouping data: Grouping dataset rows based on specific
criteria.
➢ Aggregating data: Computing summary statistics for grouped
data.
Code:
#Data Aggregation
aggregated_df = data.groupby('type').agg({'amount': ['mean',
'sum']})
print(aggregated_df)
#data Groupby
mean_value = data.groupby('type')['amount'].mean()
sum_value = data.groupby('type')['amount'].sum()
print("Mean:", mean_value)
print("Sum:", sum_value)
Output:
#data aggregation:
#data groupby:
Code:
Output:
#univariate analysis:
#bivariate analysis:
#multivariate analysis:
9. Feature Engineering:
Creating User Profiles : Aggregating user interaction data to
construct comprehensive user profiles capturing preferences
and behaviors.
Temporal Analysis : Incorporating temporal features such as
time of day or day of week to capture temporal trends in user
behavior.
Content Embeddings : Generating embeddings for
content items to represent their characteristics and
relationships.
Code:
import pandas as pd
from gensim.models import Word2Vec
# Creating user profiles
user_profiles = data.groupby('type').agg({'amount': 'mean'})
print("User Profiles:")
print(user_profiles)
# Temporal analysis
data['oldbalanceOrg'] = pd.to_datetime(data['oldbalanceOrg'])
data['isFraud'] = data['oldbalanceOrg'].dt.hour
print("\nTemporal Analysis (isFraud):")
print(data[['oldbalanceOrg', 'isFraud']])
Output:
#user profiles:
#temporal analysis:
Assumed Scenario:
Conclusion:
Dataset link :
https://www.kaggle.com/datasets/jainilcoder/online-payment-
fraud-detection