0% found this document useful (0 votes)
30 views

Tourist Attractions Dataset

The mini project report titled 'Tourist Attractions' focuses on analyzing a dataset containing key attributes of tourist destinations using data analysis techniques such as line plots and PCA. It includes data preprocessing steps to handle missing values and visualize the distribution of attractions across different regions. The project aims to provide insights into Russia's tourism landscape for potential tourists and policymakers.

Uploaded by

Shaik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Tourist Attractions Dataset

The mini project report titled 'Tourist Attractions' focuses on analyzing a dataset containing key attributes of tourist destinations using data analysis techniques such as line plots and PCA. It includes data preprocessing steps to handle missing values and visualize the distribution of attractions across different regions. The project aims to provide insights into Russia's tourism landscape for potential tourists and policymakers.

Uploaded by

Shaik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

TOURIST ATTRACTIONS USING PANDS AND KNN

A MINI PROJECT REPORT

By
Shaik.Sameena
(322103210198)

Under the esteemed guidance of


DR P. V. S. LAKSHMI JAGADAMBA
Professor & Head of the Department
Department of CSE
Department of Computer Science and Engineering
GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN
[Approved by AICTE NEW DELHI, Affiliated to Andhra University]

[Accredited by National Board of Accreditation (NBA) for B.Tech. CSE, ECE & IT – Valid from 2019-22 and 2022-25]

[Accredited by National Assesment and Accreditation Council(NAAC)– Valid from 2022-27]


Kommadi , Madhurawada, Visakhapatnam–530048

2024–2025

Page | 1
GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CERTIFICATE

This is to certify that the mini project report titled “ Tourist Attractions” is a bonafide
work of III B.Tech. Student in the Department of Computer Science and Engineering, Gayatri
Vidya Parishad College of Engineering for Women affiliated to Andhra University,
Visakhapatnam during the academic year 2024-2025 Semester-1.

Dr P. V. S Lakshmi Jagadamba Shaik.Sameena

Professor & Head of the Department 322103210198


Department of CSE

Page | 2
ACKNOWLEDGEMENT
The satisfaction that accompanies the successful completion of any task would
be incomplete without the mention of people who made it possible and whose
constant guidance and encouragement crown all the efforts with success.
We feel elated to extend our sincere gratitude to Mr G Shankar Rao, Assistant
Professor for encouragement all the way during analysis of the project. His
annotations, insinuations and criticisms are the key behind the successful
completion of the thesis and for providing us all the required facilities.
We express our deep sense of gratitude and thanks to mentor Dr. P. V. S.
Lakshmi Jagadamba, Professor and Head of the Department of Computer
Science and Engineering for her guidance and for expressing her valuable and
grateful opinions in the project for its development and for providing lab
sessions and extra hours to complete the project.
We would like to take this opportunity to express our profound sense of
gratitude to Vice Principal, Dr. G. Sudheer for allowing us to utilize the college
resources thereby facilitating the successful completion of our project.
We would like to take the opportunity to express our profound sense of
gratitude to the revered Principal, Dr. R. K. Goswami for all the help and
support towards the successful completion of our project.
We are also thankful to both teaching and non-teaching faculty of the
Department of Computer Science and Engineering for giving valuable
suggestions for our project.

Page | 3
Table content

S.no
Index Pg.No

1 Introduction 5-7

IMPORTING DATASET INTO Simple Line Plot


2 2. 1 INTRODUCTION
2.2 USING Read-Exce 9-12
2.3 CONVERSION INTO SERIES OBJECT
2.4 CONVERSION INTO DATAFRAME OBJECT

DATA PREPROCESSING
3 3.1 Handling missing values
3.2 Handling categorical data 13-16
3.3 Principal component analysis

4 K-Fold Cross Validation 17-18

5 Conclusion 19

6 References 20

Page | 4
ABSTRACT
The dataset contains key attributes such as the name, type, region,
locality, and geolocation of tourist attractions. Various data analysis
techniques, including line plots, scatter plots, histograms, box plots,
and principal component analysis (PCA), are applied to extract
meaningful insights. A key focus of the analysis is the distribution of
attractions across different regions, which is visualized through a
series of plots that highlight regional variations in the number and
type of tourist attractions.
Data cleaning and preprocessing steps are performed to ensure
accuracy and completeness, and appropriate visualizations are
generated to represent trends and distributions. The project also
uses PCA to reduce the dimensionality of the dataset, identifying the
most influential factors in determining the distribution of attractions.
By leveraging data visualization tools and techniques, this analysis
provides a deeper understanding of Russia's tourism landscape,
offering valuable insights for potential tourists, travel agencies, and
policymakers looking to enhance the tourism sector in Russia

1. INTRODUCTION
Page | 5
TOURIST ATTRATION DATASET

Description:
A brief overview or additional information about the tourist attraction,
including its historical significance, uniqueness, or importance.
Tourist Attractions Dataset contains information about 100 popular tourist
destinations from around the world.
 This dataset includes essential details about each attraction, such as its
name, location, type, opening hours, ratings, and a brief description.
 It is a valuable resource for travelers, researchers, and professionals in
the tourism industry.
Columns and Their Descriptions:
1. ID
 Description: A unique numerical identifier for each tourist attraction. It is
used to distinguish each entry in the dataset.
2. Tourist Attraction Name
 Description: The official name of the tourist attraction or landmark.
3. Country
 Description: The country where the tourist attraction is located.
4. City
 Description: The city or region where the attraction is situated.
5. Latitude
 Description: The geographical latitude of the tourist attraction, which
indicates its position north or south of the Equator.
6. Longitude
 Description: The geographical longitude of the tourist attraction, which
indicates its position east or west of the Prime Meridian.

7. Type of Attraction
 Description: The category of the attraction. It could include:
Page | 6
o Natural Landmark
o Historical Site
o Landmark
o Religious Site
o Archaeological Site
o Museum
o Shopping Center
8. Opening Hours
 Description: The standard operating hours during which the tourist
attraction is accessible to the public. Some attractions may have different
hours based on the season or day of the week.
9. Rating
 Description: The average rating given by visitors based on online reviews
or surveys. The rating is usually presented as a numerical value (e.g.,
4.5/5).

Categories of Attractions in the Dataset :-

Page | 7
1. Natural Landmarks:
o These are attractions that are naturally occurring and have
geological, historical, or environmental significance. They could be
mountains, rivers, waterfalls, or national parks.
2. Historical Sites:
o These attractions hold cultural or historical importance. They
include ancient ruins, preserved sites, and monuments.
3. Landmarks:
o Iconic man-made structures that have become symbols of the
cities or countries in which they are located.
4. Religious Sites:
o Locations that hold spiritual or religious significance, such as
temples, churches, and mosques, which attract both pilgrims and
tourists.
5. Archaeological Sites:
o Locations that contain remains of past civilizations, often
uncovered through excavation and preserved for their historical or
cultural value.
6. Museums and Galleries:
o Institutions that house and display art collections, historical
artifacts, scientific exhibits, or cultural objects.

2. IMPORTING DATASET INTO Simple Line Plot

Page | 8
2.1 INTRODUCTION
Using the read_csv() function from the pandas package, you can
import tabular data from CSV files into pandas dataframe by
specifying a parameter value for the file name (e.g.
pd.read_csv("filename.csv")). Remember that you gave pandas an
alias (pd), so you will use pd to call pandas functions.
Syntax : If the CSV file contains non-UTF-8 encoded characters, you may
encounter an encoding error. You can try specifying the encoding of the file
using the encoding argument.

2.2 USING READ-EXCEL


Aim :- To load data set into pandas of simple line plot
Source code:-

import pandas as pd

import matplotlib.pyplot as plt

df = pd.read_csv('russian_tourist_attractions.csv')

df.columns = df.columns.str.strip()

print(df.columns)

region_counts = df['region'].value_counts() # Use 'region' (lowercase)

plt.figure(figsize=(10, 6))

region_counts.plot(kind='line', marker='o', color='blue', label='Attractions per Region')

plt.title("Number of Attractions in Each Region")

plt.xlabel("Region")

plt.ylabel("Number of Attractions")

plt.xticks(rotation=45, ha='right') # Rotate x labels for better readability

plt.legend()

plt.grid(True)

plt.show()

Output :- Fig : - simple plot


Index(['name', 'type', 'region', 'locality', 'geolocation'], dtype='object')

Page | 9
2.3 CONVERSION INTO SERIES OBJECT
In order to convert a dataset to a series, we can convert any one single column from the dataset
alone as Series is of 1-Dimensional datatype where as a dataset is of 2Dimensional datatype
consisting of many rows and columns

Aim :- To convert a part of the dataset into a Series.


Source Code:
df.columns = df.columns.str.strip()
series = pd.Series(df['locality'])
print(series)
Output:- Fig : Series Object

0 Blagoveshchensk

1 Ekaterinburg

2 Safonovka

3 Tomsk

4 Novosibirsk

...

5236 Khunzakh

5237 Derbent

5238 NaN

5239 Starocherkasskaya

5240 NaN

Name: locality, Length: 5241, dtype: object

2.4 CONVERSION INTO DATAFRAME OBJECT

Page | 10
The process of converting a dataset to a dataframe depends on the format in which the
dataset is stored for like a CSV file, an Excel file, a JSON file or a SQL DataBase table etc.

Aim :- To convert the dataset data into a DataFrame.

Source Code :-

Dataframe = pd.DataFrame(df)

Print(dataframe)

Output:-
name type \

0 The "second" home shopping store "IJ Churin an... architecture

1 "Town of security officers" architecture

2 "Palace for the beloved" architecture

3 "The House with The Firebird" (manor Zhelyabov... architecture

4 "House with the ghosts" architecture

... ... ...

5236 Khunzakh fortress defenses

5237 Tsitadel 'Regional' defenses

5238 Citadel "Pillai" defenses

5239 Anna's fortress defenses

5240 Peter-Pavel's Fortress defenses

region locality \

0 Amur region Blagoveshchensk

1 Sverdlovsk region Ekaterinburg

2 Kursk region Safonovka

3 Tomsk region Tomsk

4 Novosibirsk region Novosibirsk

... ... ...

5236 The Republic of Dagestan Khunzakh

5237 The Republic of Dagestan Derbent

5238 Kaliningrad region NaN

5239 Rostov region Starocherkasskaya

5240 St. Petersburg NaN

2.5 CONVERTING BOX PLOT

Page | 11
A box plot, you can modify the plotting section of your code. A box plot typically shows the
distribution of a variable rather than the counts. Since you're working with tourist attractions per
region, you can plot a box plot of the number of attractions per region by grouping your data based
on regions and using the count of attractions in each group.

Aim:- To convert the dataset into BoxPlot


Source Code:-

import pandas as pd

import matplotlib.pyplot as plt

df = pd.read_csv('russian_tourist_attractions.csv')

df.columns = df.columns.str.strip()

print(df.columns)

region_counts = df.groupby('region').size()

plt.figure(figsize=(10, 6))

region_counts.plot(kind='box', vert=False, color='blue', patch_artist=True)

plt.title("Distribution of Number of Attractions per Region")

plt.xlabel("Number of Attractions")

plt.ylabel("Region")

plt.grid(True)

plt.show()

Output :-
Fig:- box plot

Index(['name', 'type', 'region', 'locality', 'geolocation'], dtype='object')

3. DATA PREPROCESSING

Page | 12
3.1 HANDLING missing values
Missing values, often represented as NaN (Not a Number), can cause problems
during data processing and analysis. These gaps in data can lead to incorrect analysis and
misleading conclusions

Aim: To remove missing values


Source Code :-
null_values = df.isnull().sum()
print("Null Values in Each Column:")
print(null_values)

Output:

Null Values in Each Column:

name 0

type 0

region 0

locality 842

geolocation 0

dtype: int64

3.2 Sorting
To sort a dataset in Jupyter, first load the CSV using pandas.read_csv(). Then, use
df.sort_values(by='column_name') to sort by a specific column, such as 'arename'. Ensure to
display the result with display(sorted_df.head()) to see the sorted data in the notebook.

Aim:- To sorting of the elements


Source Code:-
import pandas as pd

import IPython.display as display

df = pd.read_csv('russian_tourist_attractions.csv')

print("First 5 rows of the dataset:")

print(df.head())

print("Column names in the dataset:", df.columns)

df.columns = df.columns.str.strip()

if 'arename' in df.columns:

sorted_df = df.sort_values(by='arename', ascending=True) # Change 'arename' to your desired column

print("First 5 rows of the sorted dataframe:")

display.display(sorted_df.head())

else: Page | 13
print("Column 'arename' does not exist. Available columns are:", df.columns)
Output:-
3.3 Principal component analysis
Principal Component Analysis (PCA) is a dimensionality reduction technique used to
transform features into a set of linearly uncorrelated components. These components
capture the maximum variance in the dataset.
Source Code:

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as plt

import pandas as pd

numeric_columns = df.select_dtypes(include=['float64', 'int64']).columns

if len(numeric_columns) > 0:

# Separate features

X = df[numeric_columns]

# Standardize the data

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

# Apply PCA to retain 95% of the variance

pca = PCA(n_components=0.95)

X_pca = pca.fit_transform(X_scaled)

explained_variance = pca.explained_variance_ratio_

# Print the explained variance and number of components

print("Explained Variance Ratio of Components:", explained_variance)

print("Number of components selected:", pca.n_components_)

# Plot the cumulative explained variance vs number of components

plt.figure(figsize=(8, 6))

plt.plot(range(1, len(explained_variance) + 1), pca.explained_variance_ratio_.cumsum(),


marker='o')

plt.xlabel('Number of Components')

plt.ylabel('Cumulative Explained Variance')

plt.title('Explained Variance vs. Number of Components')

plt.grid()

plt.show()

else: Page | 14
Output:
Explained Variance Ratio of Components: [0.21369912 0.11971959 0.09238384 0.08994039
0.07685925 0.07479569 0.06741785 0.05907578 0.05584142 0.04849683 0.04017354
0.0332042 ]
Number of components selected: 12

Page | 15
4. K-Fold Cross Validation
K-Fold Cross Validation using scikit-learn in Python, you can use the KFold class from
sklearn.model_selection. K-Fold cross-validation is a technique used to evaluate machine
learning models. It involves splitting the dataset into 'k' subsets (folds), using each fold as a
test set while the remaining folds are used for training. This process is repeated for each
fold, and the model's performance is averaged.
Customizing K-Fold Cross Validation:
Dataset: We assume that you have a dataset loaded into the df DataFrame. The target
variable is specified as 'target', and we drop this column to separate the feature columns X.
Standardization: The feature data is standardized using StandardScaler to ensure that the
features have a mean of 0 and a standard deviation of 1. This is important when working
with models that are sensitive to feature scaling.
Model: We use RandomForestClassifier as an example model. You can replace this with any
other model, such as SVC, LogisticRegression, etc.

Source Code:
import pandas as pd

from sklearn.model_selection import KFold, cross_val_score

from sklearn.preprocessing import StandardScaler, LabelEncoder

from sklearn.ensemble import RandomForestClassifier

df = pd.DataFrame({

'name': ['A', 'B', 'C', 'D', 'E'],

'region': ['North', 'South', 'East', 'West', 'North'],

'locality': ['Urban', 'Suburban', 'Urban', 'Suburban', 'Urban'],

'geolocation': ['X1', 'X2', 'X3', 'X4', 'X5'],

'type': [1, 2, 1, 2, 1],

'value': [10, 20, 30, 40, 50] # Example of a numeric column})

categorical_columns = ['name', 'region', 'locality', 'geolocation']

label_encoders = {}

for col in categorical_columns:

if df[col].dtype == 'object': # Only apply to columns that are categorical (strings)

le = LabelEncoder()

df[col] = le.fit_transform(df[col].astype(str)) # Convert to str if needed

label_encoders[col] = le

Page | 16
numeric_columns = df.select_dtypes(include=['float64', 'int64']).columns

df[numeric_columns] = df[numeric_columns].fillna(df[numeric_columns].mean())

target_column = 'type' # Modify this to your actual target column name

X = df.drop(columns=target_column)

y = df[target_column]

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

model = RandomForestClassifier(n_estimators=100, random_state=42)

kf = KFold(n_splits=5, shuffle=True, random_state=42)

cv_scores = cross_val_score(model, X_scaled, y, cv=kf, scoring='accuracy')

# 8. Output Results

print(f"Cross-validation scores for each fold: {cv_scores}")

print(f"Mean accuracy across all folds: {cv_scores.mean():.3f}")

print(f"Standard deviation of the accuracy across folds: {cv_scores.std():.3f}")

Output:-

Cross-validation scores for each fold: [0. 1. 1. 1. 0.]

Mean accuracy across all folds: 0.600

Standard deviation of the accuracy across folds: 0.490

Page | 17
5. CONCLUSION
we focused on cleaning and organizing the russian_tourist_attractions.csv
dataset, which includes valuable information about tourist destinations in
Russia. The dataset contains various attributes such as name, type, region,
locality, and geolocation of each attraction. To ensure a high-quality dataset, we
first handled any missing values, either by filling them with appropriate
imputation methods or removing incomplete entries. This process was essential
to ensure the integrity of the data for analysis.
After cleaning the dataset, we applied sorting to the relevant columns (name,
type, region, locality, and geolocation) to create a more structured and organized
view of the data. Sorting helps identify trends, outliers, and relationships among
the attributes, making it easier to draw meaningful insights. Sorting by these
attributes also prepares the dataset for further analysis or modeling, as it
provides a cleaner and more accessible structure to work with.
Ultimately, this process demonstrated the importance of data preprocessing in
ensuring that datasets are ready for in-depth analysis or machine learning tasks.
By focusing on cleaning, handling missing data, and sorting, we were able to
create a refined dataset that can serve as the foundation for future analyses, such
as clustering, predictive modeling, or trend forecasting. This project highlights
the critical steps in transforming raw data into a more structured and actionable
form, enabling deeper insights into Russian tourist attractions and their
geographical distribution.

Page | 18
6. REFERENCES

 Kaggle for data set: https://www.kaggle.com

 https://www.geeksforgeeks.org/data-mining-techniques/

 https://www.javatpoint.com

Page | 19

You might also like