0% found this document useful (0 votes)
8 views

finaldoc

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

finaldoc

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

PERSONALIZED CONTENT

RECOMMENDATION IN BOOK

PHASE 2 SUBMISSION

College code:8100
College Name: University College of Engineering, BIT
Campus, Anna University, Tiruchirappalli-620 024.
Technology: AI
Total number of students in a group:5
Student’s detail within the group:
1. Viththagi K - 810022205057
2. Sibani Selvi P - 810022205056
3. Arun J - 8100222053301
4. Ranjith M C - 810022205304
5. Gautham R A - 810022205303

Submitted by,
SIBANI SELVI P,
au810022205058
PHASE 2 DOCUMENT: DATA WRAGLING
AND ANALYSIS

Introduction:
Phase 2 of our project is dedicated to data wrangling and
analysis, critical steps in preparing the raw dataset for
building a ai tool for detecting online fraud transactions. This
phase involves employing various data manipulation
techniques using Python to clean, transform, and explore the
dataset. Additionally, we assume a scenario where the
project aims to recommend users about the fraud
transactions once they were about to start the transactions.
Objectives:
1. Cleanse the dataset by addressing inconsistencies, errors,
and missing values to ensure data integrity.
2. Explore the dataset's characteristics through exploratory
data analysis (EDA) to understand distributions and
correlations.
3. Engineer relevant features to enhance model performance
for accurate detections on fraud transactions.
Dataset Description:
A dataset for building a ai tool for detecting online fraud
transactions typically includes a variety of information about
both the fraud transactions and the user’s account details. In
the fraud1.csv we have the following feature variables
1.step
2.type
3.amount
4.nameOrig
5.oldbalanceOrg
6.newBalanceOrig
7.nameDest
8.oldbalanceDest
9.newbalanceDest
10.isFraud
11.isFlaggedFraud

Data Wrangling Techniques:


1. Data Description

➢ Head: The head() function displays the top rows of a dataset.

➢ Tail: The tail() function displays the bottom rows of a dataset.

➢ Info: The
info() method prints information about dataset,
datatypes, memory usage, column labels.

➢ Describe: The describe() method is used for calculating some


statistical data like percentile, mean and std of the numerical
values.

Code:
#Data Description
import pandas as pd
import numpy as np
data=pd.read_csv("/content/fraud1.csv")
data.head()
data.tail()
data.info()
data.describe()

Output:

#head:

#tail:

#info:
#descibe:

2.Null Data Handling:

➢ Null data identification : Identifying null data involves finding


missing or empty values within the dataset.

➢ Null data imputation: Filling in missing values within the


dataset.

➢ Null data removal:


Eliminating the rows or columns within missing values from the
dataset.

Code :

#Null Data Handling


data.isnull()
data.notnull()
data.isnull().sum()
data.dropna()
data.fillna(0)

Output:

#isnull():

#notnull():
#isnull().sum():

#dropna():

#fillna(0):
3. Data validation:
➢ Data integrity check: Verifying data consistency and integrity to
eliminate errors.
Ensuring data consistency across
➢ Data consistency verification:
different columns in a datasets.

Code:

#Data Validation
data["type"].unique()
data["oldbalanceOrg"].unique()
data["isFraud"].unique()

Output:

#type:

#oldBalanceOrg:
#isFraud:

4.Data Reshaping:
➢ Reshaping rows and columns: Ina dataset involves restructuring
the data to better suit the analysis or visualization needs.
➢ Transposing data: Converting rows into columns and vice versa
as needed.

Code:

#Data Reshaping
df_stacked=data.stack()
print(df_stacked.head(10))
df_unstacked=df_stacked.unstack()
print(df_unstacked.head(5))
df_melt=data.melt(id_vars=['type','isFraud'])
print(df_melt.head(10))
transposed_data=data.T
print(transposed_data)

Output:

#stacked():

#unstacked():
#melt():

#transpose():
5.Data merging:

➢ Combining datasets: Merging multiple datasets or data sources


to enrich the information available for analysis.

➢ Joining data: Joining datasets based on common columns or


keys.

Code:
#data merging
data1=pd.read_csv("/content/crd.csv")
merged_data=pd.merge(data, data1, on="type", how="inner")
print(merged_data)

Output:

6.Data aggregation:
➢ Grouping data: Grouping dataset rows based on specific
criteria.
➢ Aggregating data: Computing summary statistics for grouped
data.
Code:

#Data Aggregation
aggregated_df = data.groupby('type').agg({'amount': ['mean',
'sum']})
print(aggregated_df)
#data Groupby
mean_value = data.groupby('type')['amount'].mean()
sum_value = data.groupby('type')['amount'].sum()

print("Mean:", mean_value)
print("Sum:", sum_value)

Output:
#data aggregation:

#data groupby:

Data Analysis Techniques:

7.Exploratory Data Analysis(EDA) :


➢ Univariate Analysis: Analysing individual variables to
understand their distributions and characteristics.
➢ Bivariate Analysis: Investigation relationships between
pairs of variables to identify correlations and dependencies.
➢ Multivariate Analysis: Exploring interactions among
multiple variables to uncover complex patterns and trends.

Code:

#Data Analysis Techniques


#Univariate Analysis
import matplotlib.pyplot as plt
import seaborn as sns
sns.histplot(data['amount'].tail(15),bins=20)
plt.title("univariate analysis")
plt.show()
#Bivariate analysis
x=data["amount"].head(10)
y=df["oldbalanceOrg"].head(10)
plt.scatter(x,y)
plt.title("Bivariate analysis")
plt.show()
#multivariate analysis
sns.pairplot(data.head(10))
plt.title("multivariate analysis")
plt.show()

Output:

#univariate analysis:
#bivariate analysis:

#multivariate analysis:
9. Feature Engineering:
Creating User Profiles : Aggregating user interaction data to
construct comprehensive user profiles capturing preferences
and behaviors.
Temporal Analysis : Incorporating temporal features such as
time of day or day of week to capture temporal trends in user
behavior.
Content Embeddings : Generating embeddings for
content items to represent their characteristics and
relationships.
Code:

import pandas as pd
from gensim.models import Word2Vec
# Creating user profiles
user_profiles = data.groupby('type').agg({'amount': 'mean'})
print("User Profiles:")
print(user_profiles)
# Temporal analysis
data['oldbalanceOrg'] = pd.to_datetime(data['oldbalanceOrg'])
data['isFraud'] = data['oldbalanceOrg'].dt.hour
print("\nTemporal Analysis (isFraud):")
print(data[['oldbalanceOrg', 'isFraud']])

Output:

#user profiles:

#temporal analysis:

Assumed Scenario:

➢ Scenario : The project aims to build an ai tool to create


awareness for user in online fraud transaction detections.
➢ Objective : Enhance user engagement and satisfaction by
delivering non fraud transactions by detecting the fraud one.

➢ Target Audience : Digital platform users who use online


transactions.

Conclusion:

Phase 2 of the project focuses on data wrangling and analysis


to prepare the dataset for building an ai tool for detecting
online fraud detections. By employing Python-based data
manipulation techniques and assuming a scenario focused on
online fraud detection transactions, we aim to transform raw
data into actionable insights for enhancing user experience and
engagement on digital platforms.

Dataset link :
https://www.kaggle.com/datasets/jainilcoder/online-payment-
fraud-detection

You might also like