Tourist Attractions Dataset
Tourist Attractions Dataset
By
Shaik.Sameena
(322103210198)
[Accredited by National Board of Accreditation (NBA) for B.Tech. CSE, ECE & IT – Valid from 2019-22 and 2022-25]
2024–2025
Page | 1
GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING FOR WOMEN
CERTIFICATE
This is to certify that the mini project report titled “ Tourist Attractions” is a bonafide
work of III B.Tech. Student in the Department of Computer Science and Engineering, Gayatri
Vidya Parishad College of Engineering for Women affiliated to Andhra University,
Visakhapatnam during the academic year 2024-2025 Semester-1.
Page | 2
ACKNOWLEDGEMENT
The satisfaction that accompanies the successful completion of any task would
be incomplete without the mention of people who made it possible and whose
constant guidance and encouragement crown all the efforts with success.
We feel elated to extend our sincere gratitude to Mr G Shankar Rao, Assistant
Professor for encouragement all the way during analysis of the project. His
annotations, insinuations and criticisms are the key behind the successful
completion of the thesis and for providing us all the required facilities.
We express our deep sense of gratitude and thanks to mentor Dr. P. V. S.
Lakshmi Jagadamba, Professor and Head of the Department of Computer
Science and Engineering for her guidance and for expressing her valuable and
grateful opinions in the project for its development and for providing lab
sessions and extra hours to complete the project.
We would like to take this opportunity to express our profound sense of
gratitude to Vice Principal, Dr. G. Sudheer for allowing us to utilize the college
resources thereby facilitating the successful completion of our project.
We would like to take the opportunity to express our profound sense of
gratitude to the revered Principal, Dr. R. K. Goswami for all the help and
support towards the successful completion of our project.
We are also thankful to both teaching and non-teaching faculty of the
Department of Computer Science and Engineering for giving valuable
suggestions for our project.
Page | 3
Table content
S.no
Index Pg.No
1 Introduction 5-7
DATA PREPROCESSING
3 3.1 Handling missing values
3.2 Handling categorical data 13-16
3.3 Principal component analysis
5 Conclusion 19
6 References 20
Page | 4
ABSTRACT
The dataset contains key attributes such as the name, type, region,
locality, and geolocation of tourist attractions. Various data analysis
techniques, including line plots, scatter plots, histograms, box plots,
and principal component analysis (PCA), are applied to extract
meaningful insights. A key focus of the analysis is the distribution of
attractions across different regions, which is visualized through a
series of plots that highlight regional variations in the number and
type of tourist attractions.
Data cleaning and preprocessing steps are performed to ensure
accuracy and completeness, and appropriate visualizations are
generated to represent trends and distributions. The project also
uses PCA to reduce the dimensionality of the dataset, identifying the
most influential factors in determining the distribution of attractions.
By leveraging data visualization tools and techniques, this analysis
provides a deeper understanding of Russia's tourism landscape,
offering valuable insights for potential tourists, travel agencies, and
policymakers looking to enhance the tourism sector in Russia
1. INTRODUCTION
Page | 5
TOURIST ATTRATION DATASET
Description:
A brief overview or additional information about the tourist attraction,
including its historical significance, uniqueness, or importance.
Tourist Attractions Dataset contains information about 100 popular tourist
destinations from around the world.
This dataset includes essential details about each attraction, such as its
name, location, type, opening hours, ratings, and a brief description.
It is a valuable resource for travelers, researchers, and professionals in
the tourism industry.
Columns and Their Descriptions:
1. ID
Description: A unique numerical identifier for each tourist attraction. It is
used to distinguish each entry in the dataset.
2. Tourist Attraction Name
Description: The official name of the tourist attraction or landmark.
3. Country
Description: The country where the tourist attraction is located.
4. City
Description: The city or region where the attraction is situated.
5. Latitude
Description: The geographical latitude of the tourist attraction, which
indicates its position north or south of the Equator.
6. Longitude
Description: The geographical longitude of the tourist attraction, which
indicates its position east or west of the Prime Meridian.
7. Type of Attraction
Description: The category of the attraction. It could include:
Page | 6
o Natural Landmark
o Historical Site
o Landmark
o Religious Site
o Archaeological Site
o Museum
o Shopping Center
8. Opening Hours
Description: The standard operating hours during which the tourist
attraction is accessible to the public. Some attractions may have different
hours based on the season or day of the week.
9. Rating
Description: The average rating given by visitors based on online reviews
or surveys. The rating is usually presented as a numerical value (e.g.,
4.5/5).
Page | 7
1. Natural Landmarks:
o These are attractions that are naturally occurring and have
geological, historical, or environmental significance. They could be
mountains, rivers, waterfalls, or national parks.
2. Historical Sites:
o These attractions hold cultural or historical importance. They
include ancient ruins, preserved sites, and monuments.
3. Landmarks:
o Iconic man-made structures that have become symbols of the
cities or countries in which they are located.
4. Religious Sites:
o Locations that hold spiritual or religious significance, such as
temples, churches, and mosques, which attract both pilgrims and
tourists.
5. Archaeological Sites:
o Locations that contain remains of past civilizations, often
uncovered through excavation and preserved for their historical or
cultural value.
6. Museums and Galleries:
o Institutions that house and display art collections, historical
artifacts, scientific exhibits, or cultural objects.
Page | 8
2.1 INTRODUCTION
Using the read_csv() function from the pandas package, you can
import tabular data from CSV files into pandas dataframe by
specifying a parameter value for the file name (e.g.
pd.read_csv("filename.csv")). Remember that you gave pandas an
alias (pd), so you will use pd to call pandas functions.
Syntax : If the CSV file contains non-UTF-8 encoded characters, you may
encounter an encoding error. You can try specifying the encoding of the file
using the encoding argument.
import pandas as pd
df = pd.read_csv('russian_tourist_attractions.csv')
df.columns = df.columns.str.strip()
print(df.columns)
plt.figure(figsize=(10, 6))
plt.xlabel("Region")
plt.ylabel("Number of Attractions")
plt.legend()
plt.grid(True)
plt.show()
Page | 9
2.3 CONVERSION INTO SERIES OBJECT
In order to convert a dataset to a series, we can convert any one single column from the dataset
alone as Series is of 1-Dimensional datatype where as a dataset is of 2Dimensional datatype
consisting of many rows and columns
0 Blagoveshchensk
1 Ekaterinburg
2 Safonovka
3 Tomsk
4 Novosibirsk
...
5236 Khunzakh
5237 Derbent
5238 NaN
5239 Starocherkasskaya
5240 NaN
Page | 10
The process of converting a dataset to a dataframe depends on the format in which the
dataset is stored for like a CSV file, an Excel file, a JSON file or a SQL DataBase table etc.
Source Code :-
Dataframe = pd.DataFrame(df)
Print(dataframe)
Output:-
name type \
region locality \
Page | 11
A box plot, you can modify the plotting section of your code. A box plot typically shows the
distribution of a variable rather than the counts. Since you're working with tourist attractions per
region, you can plot a box plot of the number of attractions per region by grouping your data based
on regions and using the count of attractions in each group.
import pandas as pd
df = pd.read_csv('russian_tourist_attractions.csv')
df.columns = df.columns.str.strip()
print(df.columns)
region_counts = df.groupby('region').size()
plt.figure(figsize=(10, 6))
plt.xlabel("Number of Attractions")
plt.ylabel("Region")
plt.grid(True)
plt.show()
Output :-
Fig:- box plot
3. DATA PREPROCESSING
Page | 12
3.1 HANDLING missing values
Missing values, often represented as NaN (Not a Number), can cause problems
during data processing and analysis. These gaps in data can lead to incorrect analysis and
misleading conclusions
Output:
name 0
type 0
region 0
locality 842
geolocation 0
dtype: int64
3.2 Sorting
To sort a dataset in Jupyter, first load the CSV using pandas.read_csv(). Then, use
df.sort_values(by='column_name') to sort by a specific column, such as 'arename'. Ensure to
display the result with display(sorted_df.head()) to see the sorted data in the notebook.
df = pd.read_csv('russian_tourist_attractions.csv')
print(df.head())
df.columns = df.columns.str.strip()
if 'arename' in df.columns:
display.display(sorted_df.head())
else: Page | 13
print("Column 'arename' does not exist. Available columns are:", df.columns)
Output:-
3.3 Principal component analysis
Principal Component Analysis (PCA) is a dimensionality reduction technique used to
transform features into a set of linearly uncorrelated components. These components
capture the maximum variance in the dataset.
Source Code:
import pandas as pd
if len(numeric_columns) > 0:
# Separate features
X = df[numeric_columns]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
pca = PCA(n_components=0.95)
X_pca = pca.fit_transform(X_scaled)
explained_variance = pca.explained_variance_ratio_
plt.figure(figsize=(8, 6))
plt.xlabel('Number of Components')
plt.grid()
plt.show()
else: Page | 14
Output:
Explained Variance Ratio of Components: [0.21369912 0.11971959 0.09238384 0.08994039
0.07685925 0.07479569 0.06741785 0.05907578 0.05584142 0.04849683 0.04017354
0.0332042 ]
Number of components selected: 12
Page | 15
4. K-Fold Cross Validation
K-Fold Cross Validation using scikit-learn in Python, you can use the KFold class from
sklearn.model_selection. K-Fold cross-validation is a technique used to evaluate machine
learning models. It involves splitting the dataset into 'k' subsets (folds), using each fold as a
test set while the remaining folds are used for training. This process is repeated for each
fold, and the model's performance is averaged.
Customizing K-Fold Cross Validation:
Dataset: We assume that you have a dataset loaded into the df DataFrame. The target
variable is specified as 'target', and we drop this column to separate the feature columns X.
Standardization: The feature data is standardized using StandardScaler to ensure that the
features have a mean of 0 and a standard deviation of 1. This is important when working
with models that are sensitive to feature scaling.
Model: We use RandomForestClassifier as an example model. You can replace this with any
other model, such as SVC, LogisticRegression, etc.
Source Code:
import pandas as pd
df = pd.DataFrame({
label_encoders = {}
le = LabelEncoder()
label_encoders[col] = le
Page | 16
numeric_columns = df.select_dtypes(include=['float64', 'int64']).columns
df[numeric_columns] = df[numeric_columns].fillna(df[numeric_columns].mean())
X = df.drop(columns=target_column)
y = df[target_column]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# 8. Output Results
Output:-
Page | 17
5. CONCLUSION
we focused on cleaning and organizing the russian_tourist_attractions.csv
dataset, which includes valuable information about tourist destinations in
Russia. The dataset contains various attributes such as name, type, region,
locality, and geolocation of each attraction. To ensure a high-quality dataset, we
first handled any missing values, either by filling them with appropriate
imputation methods or removing incomplete entries. This process was essential
to ensure the integrity of the data for analysis.
After cleaning the dataset, we applied sorting to the relevant columns (name,
type, region, locality, and geolocation) to create a more structured and organized
view of the data. Sorting helps identify trends, outliers, and relationships among
the attributes, making it easier to draw meaningful insights. Sorting by these
attributes also prepares the dataset for further analysis or modeling, as it
provides a cleaner and more accessible structure to work with.
Ultimately, this process demonstrated the importance of data preprocessing in
ensuring that datasets are ready for in-depth analysis or machine learning tasks.
By focusing on cleaning, handling missing data, and sorting, we were able to
create a refined dataset that can serve as the foundation for future analyses, such
as clustering, predictive modeling, or trend forecasting. This project highlights
the critical steps in transforming raw data into a more structured and actionable
form, enabling deeper insights into Russian tourist attractions and their
geographical distribution.
Page | 18
6. REFERENCES
https://www.geeksforgeeks.org/data-mining-techniques/
https://www.javatpoint.com
Page | 19