Viraj_Project_Documentation
Viraj_Project_Documentation
Project Report on
India Agriculture Crop
Production Analysis
Submitted to
UNIVERSITY OF MUMBAI
In the partial fulfillment of the degree
Of Masters of Computer Science
Project By:
Mr. Viraj Vasudev Pawasakar
Exam Seat No:
1183893
Under the Guidance of
Mrs. Rupali Agavekar
Navkokan Education Society’s
D.B.J College, Chiplun
(2023-2024)
1
Navkokan Education Society’s
CERTIFICATE
This is to certify that Mr. Viraj Vasudev Pawasakar of
MSc. Part-II (Semester IV) Computer Science has
successfully completed the Project in Machine Learning
and has submitted the same to my satisfaction during the
academic year 2023-24 towards partial fulfillment of
MSc. Part-II (Semester IV) Computer Science,
University of Mumbai.
Date:
Guide Signature:
INCHARGE
2
Department of Computer Science
Acknowledgement
4
India Agriculture Crop
Production Analysis
Mr. Viraj Vasudev Pawasakar
5
2. Implementation Details
Project Overview
India is one of the largest agricultural producers in the world,
and understanding the dynamics of crop production is crucial
for ensuring food security and optimizing resource allocation.
This project leverages historical data on crop production to
derive meaningful insights.
6
Libraries and Frameworks Used
Streamlit:
Streamlit is a framework for creating web applications with
Python. It's used for building interactive and customizable
web-based interfaces for data analysis, machine learning,
and more.
Pandas:
Pandas is a powerful data manipulation and analysis library.
It provides data structures like DataFrames and Series, which
are essential for handling structured data.
NumPy:
NumPy is a fundamental package for numerical computing in
Python. It provides support for large, multi-dimensional
arrays and matrices, along with a collection of mathematical
functions to operate on these arrays.
Matplotlib:
Matplotlib is a comprehensive library for creating static,
animated, and interactive visualizations in Python. pyplot is a
module in Matplotlib that provides a MATLAB-like interface
for plotting.
7
Seaborn:
Seaborn is built on top of Matplotlib and provides a higher-
level interface for drawing attractive and informative
statistical graphics. It simplifies the process of creating
complex visualizations such as heatmaps, violin plots, and
more.
scikit-learn:
Scikit-learn is a versatile machine learning library for Python.
It includes various tools for supervised and unsupervised
learning, such as regression, classification, clustering, and
dimensionality reduction. LinearRegression is a model class
for fitting linear regression models, and train_test_split is a
function for splitting data into training and testing sets. The
mean_squared_error is a function that calculates the mean
squared error between predicted values and actual values,
commonly used to evaluate regression models.
8
Implementation Steps
5. Trend Analysis
Analyzes the trends in crop production to identify patterns and
seasonal variations.
9
7. Correlation Analysis
Examines the relationships between different variables to
understand their interdependencies.
8. Seasonal Analysis
Analyzes the seasonal patterns in crop production to understand
the impact of seasons.
9. Linear Regression
Applied to predict future crop production based on historical
data.
10
3. Experimental Setup and Results
Microsoft Visual Studio code:
11
CSV
CSV files are especially useful in data analysis and machine learning
projects where large datasets need to be processed efficiently. Their
straightforward structure allows for quick parsing and integration with
numerous data processing libraries in programming languages like
Python, R, and Java. For instance, in Python, libraries such as pandas
provide robust tools for reading, writing, and manipulating CSV data,
facilitating tasks like data cleaning, transformation, and visualization.
Furthermore, the simplicity of CSV files ensures minimal overhead and
compatibility issues, making them an ideal choice for both small-scale
data operations and large-scale data workflows in various domains.
12
Methodology
Data Collection:
Gather data from reliable sources, including parameters like
crop type, year, area under cultivation, production, yield,
and weather conditions.
Obtain data in CSV format for easy storage and analysis.
Data Preprocessing:
Data Cleaning: Address missing values, remove
duplicates, and correct inconsistencies.
Data Transformation: Ensure correct data types and
create derived features as needed.
13
Correlation Analysis:
Calculate correlation coefficients to evaluate relationships
between variables like rainfall, temperature, and crop
yield.
Identify key factors significantly correlated with
crop production.
Predictive Modeling:
Model Selection: Choose machine learning models (e.g.,
Linear Regression) for future crop production prediction.
Model Training: Split data into training and testing sets,
then train the models.
Model Evaluation: Use metrics such as Mean Squared
Error (MSE) to assess model accuracy.
14
Database Description
Year: The year in which the data was recorded (e.g., 2018-19,
2019-20).
Crop: The type of crop being analyzed (e.g., rice, wheat, maize).
Area: The area under cultivation, typically measured in hectares.
Production: The total production of the crop, usually measured
in tonnes.
Yield: The yield of the crop, calculated as production per unit
area (e.g., tonnes per hectare).
Geographical Location: Details about the location of cultivation,
including state, district, and village.
Fields DataTypes
State object
District object
Crop object
Year object
Season object
Area float64
Area Units object
Production float64
Production Units object
Yield float64
15
4. Analysis of the results
Code:
import streamlit as st
import pandas as pd
import numpy as np
import matplotlib.pyplot as
plt import seaborn as sns
from sklearn.linear_model import
LinearRegression from sklearn.model_selection
import train_test_split from sklearn.metrics
import mean_squared_error
def main():
scroll_to_top()
@st.cache_data
def
load_data():
data = pd.read_csv('India Agriculture Crop
Production.csv') return data
data = load_data()
17
margin-left: -20rem;
}
.css-1d391kg[data-
expanded="true"] { margin-
left: 0;
}
</
style>
""",
unsafe_allow_html=True,
)
# Sidebar content
st.sidebar.title("Navigation")
if st.sidebar.button("Introduction"):
st.session_state.page = "Introduction"
if st.sidebar.button("Analysis of Data"):
st.session_state.page = "Analysis of Data"
if st.sidebar.button("Data Cleaning"):
st.session_state.page = "Data Cleaning"
if st.sidebar.button("Visual Analysis"):
st.session_state.page = "Visual Analysis"
if st.sidebar.button("Trend Analysis"):
st.session_state.page = "Trend Analysis"
if st.sidebar.button("Correlation Analysis"):
st.session_state.page = "Correlation
Analysis"
if st.sidebar.button("Seasonal Analysis"):
st.session_state.page = "Seasonal Analysis"
18
# Initialize session state variables if they don't
exist if 'show_crop_production_years' not in
st.session_state:
st.session_state.show_crop_production_years = False
19
if 'show_crop_production_state' not in st.session_state:
st.session_state.show_crop_production_state = False
if 'show_area_cultivation_state' not in st.session_state:
st.session_state.show_area_cultivation_state = False
if 'show_share_area_cultivation_year' not in st.session_state:
st.session_state.show_share_area_cultivation_year = False
if 'show_production_state_year' not in st.session_state:
st.session_state.show_production_state_year = False
if 'show_production_crop_year' not in st.session_state:
st.session_state.show_production_crop_year = False
if 'show_selected_state_crop_production' not in st.session_state:
st.session_state.show_selected_state_crop_production = False
if 'show_selected_crop_production_top_states' not in
st.session_state:
st.session_state.show_selected_crop_production_top_states =
False
if 'show_total_production_rice_wheat' not in st.session_state:
st.session_state.show_total_production_rice_wheat = False
if 'show_heat_map_average_yield_by_state_year' not
in st.session_state:
st.session_state.show_heat_map_average_yield_by_state_year =
False if 'show_total_production' not in st.session_state:
st.session_state.show_total_production = False
if 'show_future_data_prediction' not in st.session_state:
st.session_state.show_future_data_prediction = False
if 'show_seasonal_analysis' not in st.session_state:
st.session_state.show_seasonal_analysis = False
if 'show_yield_prediction_model'not in
st.session_state:
st.session_state.show_yield_prediction_model =
False
if st.session_state.page == "Introduction":
st.title("India Agriculture Crop Production
Analysis") st.write("""
## Welcome to the Introduction Tab
This project is focused on analyzing the agriculture crop
production in India. The aim of this analysis is to
20
provide insights into crop production trends, identify
high- performing crops and districts, and utilize various
21
data visualization and machine learning techniques to
understand and predict agricultural productivity.
23
- **Linear Regression**: Applied to predict future
crop production based on historical data.
- **Train-Test Split**: Used to validate the
performance of the predictive models.
- **Yield Prediction Model (Mean Squared Error)**:
Evaluates the accuracy of the yield prediction model using the
Mean Squared Error metric.
### Conclusion
This project provides a comprehensive analysis of
agricultural crop production in India, offering valuable
insights through data visualization and machine learning
techniques. We hope that this analysis will contribute
to a better understanding of India's agricultural
landscape and support efforts to improve crop production
efficiency and food security.
""")
24
**First Few Rows of the Dataset**: This displays the first
few rows of the dataset to give an overview of the data
structure and contents.
""")
st.write(data.head())
# Summary statistics
st.write("### Summary
Statistics") st.write("""
**Summary Statistics**: Provides basic descriptive
statistics such as mean, standard deviation, min, max, and
quartiles for each numeric column. This helps in
understanding the distribution and spread of the data.
""")
st.write(data.describe())
25
# Check for missing values
st.write("### Missing Values")
st.write("""
**Missing Values**: Lists the number of missing values in
each column. Identifying missing values is crucial as they need
to be handled before further analysis.
""")
missing_values = data.isnull().sum()
st.write(missing_values)
27
and 'Crop'. This allows for a detailed analysis of these
metrics across different states and crops.
""")
summary_by_state_crop = data_cleaned.groupby(['State', 'Crop'])
[['Area', 'Production', 'Yield']].describe()
st.write(summary_by_state_crop)
if
st.session_state.show_crop_production_ye
ars: @st.cache_resource
def plot_crop_production_years():
plt.figure(figsize=(12, 6))
sns.lineplot(data=data, x='Year',
y='Production') plt.title('Crop Production
Over the Years') plt.xlabel('Year')
plt.ylabel('Production')
plt.xticks(rotation=90)
st.pyplot(plt)
plot_crop_production_years()
if
st.session_state.show_crop_production_st
ate: @st.cache_resource
28
def plot_crop_production_state():
plt.figure(figsize=(12, 8))
29
sns.barplot(data=data, x='State', y='Production',
estimator=sum)
plt.title('Crop Production by State')
plt.xlabel('State')
plt.ylabel('Total Production')
plt.xticks(rotation=90)
st.pyplot(plt)
plot_crop_production_state()
if st.session_state.show_area_cultivation_state:
year = st.selectbox("Select Year", data['Year'].unique())
@st.cache_resource
def plot_area_cultivation_state(year):
crop_df = pd.DataFrame(data)
crop_df = crop_df[crop_df.Crop != 'Coconut']
crop_df_Year = crop_df[crop_df.Year == year]
grouped_state = crop_df_Year.groupby('State')
area_by_state =
grouped_state['Area'].sum().sort_values(ascending=False)
plt.figure(figsize=(12, 4))
plt.bar(area_by_state.index, area_by_state /
1e7) plt.title(f'Area under Cultivation by
State {year}
(million hect)')
plt.ylabel('Area under Cultivation (million hect)')
plt.xticks(rotation=90)
st.pyplot(plt)
plot_area_cultivation_state(year)
30
st.session_state.show_share_area_cultivation_year = not
st.session_state.show_share_area_cultivation_year
if st.session_state.show_share_area_cultivation_year:
year = st.selectbox("Select Year", data['Year'].unique(),
key="share_area_cultivation_year")
@st.cache_resource
def
plot_share_area_cultivation_year(year
): crop_df = pd.DataFrame(data)
crop_df = crop_df[crop_df.Crop != 'Coconut']
crop_df_Year = crop_df[crop_df.Year == year]
grouped_state = crop_df_Year.groupby('State')
area_by_state =
grouped_state['Area'].sum().sort_values(ascending=False)
pie_break = [i for i in
area_by_state.head(10)] +
[area_by_state.sum() - (area_by_state.head(10).sum())]
pie_labels = [i for i in area_by_state.head(10).index] +
['other']
plt.figure(figsize=(10, 6))
plt.pie(pie_break, labels=pie_labels, autopct='%.2f%%')
plt.title(f'Share of Area under Cultivation in Year
{year}'
) st.pyplot(plt)
plot_share_area_cultivation_year(year)
if st.session_state.show_production_state_year:
year = st.selectbox("Select Year", data['Year'].unique(),
key="production_state_year")
@st.cache_resource
def
31
plot_production_state_year(year):
crop_df = pd.DataFrame(data)
32
crop_df = crop_df[crop_df.Crop != 'Coconut']
crop_df_Year = crop_df[crop_df.Year == year]
grouped_state = crop_df_Year.groupby('State')
prod_by_state =
grouped_state['Production'].sum().sort_values(ascending=False)
plt.figure(figsize=(18, 4))
plt.bar(prod_by_state.index, prod_by_state / 1e7)
plt.title(f'Production by State in Year {year}
hect)' (million
)
plt.ylabel('Production (million tonnes)')
plt.xticks(rotation=90)
st.pyplot(plt)
plot_production_state_year(year)
if st.session_state.show_production_crop_year:
year = st.selectbox("Select Year", data['Year'].unique(),
key="production_crop_year")
@st.cache_resource
def plot_production_crop_year(year):
crop_df = pd.DataFrame(data)
crop_df = crop_df[crop_df.Crop !=
'Coconut'] crop_df_Year =
crop_df[crop_df.Year == year]
grouped_crop =
crop_df_Year.groupby('Crop')
percent_crop =
grouped_crop['Production'].sum().sort_values(ascending=False)
hect)'
33
p nt_crop.index, percent_crop)
l plt.title(f'Production by Crop in Year {year}
t (million
.
f plt.ylabel('Production (million tonnes)')
i plt.xticks(rotation=90)
g
u
r
e
(
f
i
g
s
i
z
e
=
(
1
8
,
4
)
)
p
l
t
.
b
a
r
(
p
e
r
c
e
34
st.pyplot(plt)
plot_production_crop_year(year)
if st.session_state.show_selected_state_crop_production:
year = st.selectbox("Select Year", data['Year'].unique(),
key="selected_state_crop_year")
crop = st.selectbox("Select Crop", data['Crop'].unique(),
key="selected_state_crop_crop")
@st.cache_resource
def plot_selected_state_crop_production(year,
crop): crop_df = pd.DataFrame(data)
crop_df = crop_df[crop_df.Crop !=
'Coconut'] crop_df_year =
crop_df[crop_df.Year == year]
selected_crop_df = crop_df_year[crop_df_year.Crop ==
crop] production_by_state =
selected_crop_df.groupby('State')
['Production'].sum().sort_values(asce nding=False)
plt.figure(figsize=(15, 5))
plt.bar(production_by_state.index, production_by_state
1e6) /
36
st.session_state.show_selected_crop_production_top_sta
tes = not
st.session_state.show_selected_crop_production_top_states
if
st.session_state.show_selected_crop_production_top_sta
tes: crop = st.selectbox("Select Crop",
data['Crop'].unique(),
key="selected_crop_production_top_states_crop")
@st.cache_resource
def
plot_selected_crop_production_top_states(crop
): crop_df = pd.DataFrame(data)
crop_df = crop_df[crop_df.Crop != 'Coconut']
selected_crop_df = crop_df[crop_df['Crop'] == crop]
production_by_state =
selected_crop_df.groupby('State')
['Production'].sum().sort_values(asce nding=False).head(10)
plt.figure(figsize=(15, 5))
plt.bar(production_by_state.index,
1e6) production_by_state /
if st.session_state.show_total_production_rice_wheat:
@st.cache_resource
def plot_total_production_rice_wheat():
rw_years =
data[data.Crop.isin(['Rice',
37
'Wheat'])][['Year', 'Yield', 'Area', 'Production', 'State']]
rw_years.drop(rw_years.index[rw_years.Year == '2020-
21'],
inplace=True)
rw_group = rw_years.groupby('Year')
38
plt.figure(figsize=(14, 8))
plt.plot(rw_group['Production'].sum() / 1e7)
plt.title('Total Production of Rice & Wheat Over the
Years'
) plt.xlabel('Year')
plt.ylabel('Production (million tonnes)')
plt.xticks(rotation=90)
st.pyplot(plt)
plot_total_production_rice_wheat()
if st.session_state.show_heat_map_average_yield_by_state_year:
@st.cache_resource
def
plot_heat_map_average_yield_by_state_yea
r(): rw_years =
data[data.Crop.isin(['Rice',
'Wheat'])][['Year', 'Yield', 'Area', 'Production', 'State']]
rw_years.drop(rw_years.index[rw_years.Year == '2020-
21'],
inplace=True)
heatmap_df = rw_years[['State', 'Year',
'Yield']].groupby(['State', 'Year'])
['Yield'].mean().unstack(level=-1)
plot_heat_map_average_yield_by_state_year()
40
elif st.session_state.page == "Trend Analysis":
st.title("Trend Analysis")
st.write("Welcome to the Learning Data Analysis Through
Other Analysis Algorithms tab.")
if st.session_state.show_total_production:
@st.cache_resource
def plot_total_production():
data.drop(data.index[data.Year == '2020-21'], inplace =
True)
production_trend =
data.groupby('Year')['Production'].sum()
plt.figure(figsize=(12, 6))
plt.plot(production_trend.index,
production_trend.values,
marker='o'
) plt.title('Total Crop Production in India (1997-2020)')
plt.xlabel('Year')
plt.ylabel('Total Production (Tonnes)')
plt.grid(True)
plt.xticks(rotation=90)
st.pyplot(plt)
plot_total_production()
42
values based on historical trends. If the data shows a
consistent linear trend, this model provides
a straightforward method for making future
projections.""")
if st.button("Show Future Data Prediction Graph"):
st.session_state.show_future_data_prediction = not
st.session_state.show_future_data_prediction
if
st.session_state.show_future_data_predic
tion: @st.cache_resource
def plot_future_data_prediction():
data.drop(data.index[data.Year == '2020-21'],
inplace=True, errors='ignore')
data['Year'] = data['Year'].apply(lambda x:
int(x.split('-
')[0]))
production_trend =
data.groupby('Year')['Production'].sum()
X = production_trend.index.values.reshape(-
1, 1) y = production_trend.values
model =
LinearRegression()
model.fit(X, y)
plt.figure(figsize=(12, 6))
plt.plot(production_trend.index,
production_trend.values,
marker='o', label='Actual Production')
plt.plot(future_years, predictions,
marker='x', linestyle='--', color='red',
43
label='Predicted Production')
44
plt.grid(True)
plt.xticks(rotation=90)
plt.legend()
st.pyplot(plt)
plot_future_data_prediction()
if st.session_state.show_seasonal_analysis:
@st.cache_resource
def plot_seasonal_analysis():
# Boxplot of production by season
data.drop(data.index[data.Season == 'Whole Year'],
inplace
=
True) # data.Season != 'Whole Season'
plt.figure(figsize=(12, 6))
sns.boxplot(x='Season', y='Production',
data=data) plt.title('Production by Season')
plt.xlabel('Season')
plt.ylabel('Production
(Tonnes)')
46
p
l
t
.
x
t
i
c
k
s
(
r
o
t
a
t
i
o
n
=
4
5
)
47
st.pyplot(plt)
plot_seasonal_analysis
()
if
st.session_state.show_yield_prediction_m
odel: @st.cache_resource
def plot_yield_prediction_model():
data = pd.read_csv('India Agriculture
Crop Production.csv')
data = data.dropna(subset=['Area', 'Production',
'Yield'])
X = data[['Area',
'Production']] y =
data['Yield']
49
X_train, X_test, y_train, y_test =
train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test,
y_pred) st.write(f'Mean Squared
Error: {mse}')
50
Screenshots:
51
52
53
54
55
56
57
58
59
60
61
62
5. Conclusion
63
6. Future Enhancement
Remote Sensing and Satellite Imagery: Utilize remote sensing
technologies and satellite imagery to monitor crop health, soil moisture,
and other critical parameters in real-time, enabling more precise and
timely interventions.
64
References
https://www.youtube.com/
https://www.kaggle.com/
https://docs.streamlit.io/
65