AI-Data Science
AI-Data Science
Artificial Intelligence
Code-417
Data Science
Grade 10
• Data which is available for public usage only should be taken up.
• Personal datasets should only be used with the consent of the owner.
• One should never breach someone’s privacy to collect data.
• Data should only be taken form reliable sources as the data collected
from random sources can be wrong or unusable.
• Reliable sources of data ensure the authenticity of data which helps in
proper training of the AI model.
Data Visualisation
• While collecting data, it is possible that the data might come with some errors. Let
us first take a look at the types of issues we can face with data:
1. Erroneous Data: There are two ways in which the data can be erroneous:
Incorrect values
Invalid or Null values
2. Missing Data: In some datasets, some cells remain empty. The values of these cells
are missing and hence the cells remain empty
3. Outliers: Data which does not fall in the range of a certain element are referred to
as outliers.
To understand this better, let us take an example of marks of students in a class. Let
us assume that a student was absent for exams and hence has got 0 marks in it. If his
marks are taken into account, the whole class’s average would go down. To prevent
this, the average is taken for the range of marks from highest to lowest keeping this
particular result separate. This makes sure that the average marks of the class are
true according to the data.
Pandas
• Pandas is a Python library used for working with data sets.
• It has functions for analyzing, cleaning, exploring, and manipulating data.
Why Use Pandas?
• Analyze big data and make conclusions based on statistical theories.
• Make data readable and relevant.
What Can Pandas Do?
• Pandas gives you answers about the data. Like:
• Is there a correlation between two or more columns?
• What is average value?
• Max value?
• Min value?
• Pandas are also able to delete rows that are not relevant, or contains wrong
values, like empty or NULL values. This is called cleaning the data.
Matplotlib
• Matplotlib is a Python 2D plotting library that we can use to produce
high quality data visualization.
• It has a module named pyplot which makes things easy for plotting by
providing feature to control line styles, font properties, formatting
axes etc.
• It supports a very wide variety of graphs and plots namely -
histogram, bar charts, scatter plot, error charts etc.
Data Visualization
Pandas + Numpy + Matplotlib = Data Visualization
1. What is KNN?
2. How does it work?
3. What are the features of KNN?
Personality Prediction
Step 1
Here is a map. Take a good look at it. In this map you can see the
arrows determine a quality. The qualities mentioned are:
Think for a minute and understand
which of these qualities you have in
you. Now, take a chit and write your
name on it. Place this chit at a point
in this map which best describes
you. It can be placed anywhere on
the graph. Be honest about yourself
and put it on the graph.
Step 2:
Take the quiz
https://tinyurl.com/discanimal
K-Nearest Neighbour Model (KNN)
• A simple, easy-to-implement supervised machine learning algorithm
• Can be used to solve both classification and regression problems.
• The KNN algorithm assumes that similar things exist in close
proximity. In other words, similar things are near to each other as the
saying goes “Birds of a feather flock together”.
Features of KNN
• The KNN prediction model relies on the surrounding points or
neighbors to determine its class or group.
• Utilizes the properties of the majority of the nearest points to decide
how to classify unknown points.
• Based on the concept that similar data points should be close to each
other.
Questions to Practice
1. What are the applications of Data science?
2. Define: Erroneous data, Outliers, Histogram.
3. What is KNN? How does it work?
4. What are the features of KNN?
5. Explain personality prediction.
6. What is the purpose of Pandas?
7. Why do we use matplotlib in Python?