mla_unit-5'2 (1)
mla_unit-5'2 (1)
USING
NAIVES BAYES
TEXT CLASSIFICATION
BINARY
CLASSIFICATION:
Involves only two categories,
such as positive/negative.
It is the simplest form of text classification, where the model decides
between two possible outcomes.
MULTI-CLASS CLASSIFICATION:
Text is classified into one category out of more than two possible
classes (e.g., classifying news articles as sports, politics, tech).
Each text input gets only one label, even if others may seem
relevant.
MULTI-LABEL CLASSIFICATION:
A single text can be assigned multiple labels (e.g., a tweet labeled as
sports and health).
It reflects the real-world scenarios where content may belong to more
than one category simultaneously.
NAIVE BAYES
• The naïve bayes classifier belongs to the family of probabilistic classifiers that
computes the probabilities of each predictive feature of the data belonging to each
class in order to make a prediction of probability distribution over all classes, besides
the most likely class that the data sample is associated with.
•Bayes: It maps the probabilities of observing input features given belonging classes,
to the probability distribution over classes based on bayes theorem.
•Naïve: It simplifies probability computations by assuming that predictive features are
mutually independent.
BAYES THEOREM
•Let A and B denote two events. An event can be that it will rain tomorrow ,two
kings are drawn from a deck of cards, a person has cancer.
•P(A/B) is the probability that A occurs given B is true can be computed by:
• P(A/B)=p(B/A)p(A)/p(B)
•Where p(B/A) is the probability of observing B given A occurs, and p(A),p(B) the
probability of A occurs and B occurs respectively.
PERFORMANCE EVALUATION
Limitation: This method didn't account for how widely terms appeared across the entire collection. Common
words (like “the”, “get”, “make”) may appear frequently, reducing their usefulness for classification.
We can test the effectiveness of tf-idf on our existing spam email detection model, by
simply
replacing the tf feature extractor, CountVectorizer, with the tf-idf feature extractor,
TfidfVectorizer, from scikit-learn. W
The best averaged 10-fold AUC 0.9943 is achieved, which outperforms 0.9856 obtained
based on tf features.
Support Vector Machine (SVM)
•SVM is a powerful classifier often used for text data classification
tasks.
•In classification, SVM finds an optimal hyperplane that separates data
points from different classes.
•A hyperplane is a decision boundary in an n-dimensional feature
space:
• In 2D, it’s a line.
• In 3D, it’s a surface.
• In n dimensions, it’s an (n-1)-dimensional plane.
•The goal is to find the hyperplane that maximizes the margin — the
distance between the hyperplane and the nearest data points from
each class.
•These nearest points to the hyperplane are called support vectors.
•Support vectors are critical because they define the position and
orientation of the hyperplane.
The Mechanics of SVM
•There can be infinite possible hyperplanes that separate data points from different
classes.
•The task is to find the optimal separating hyperplane — the one that correctly divides
the data and maximizes the margin.
•Scenario 1: Identifying the Separating Hyperplane
• A valid hyperplane must successfully separate data points based on their labels.
• In an example with hyperplanes A, B, and C:
• Only Hyperplane C correctly separates the classes.
• Hyperplanes A and B fail to segregate them properly.
•Mathematical Definition:
• In 2D, a line (hyperplane) is defined by:
• A slope vector w (a 2D vector)
• An intercept b
• In n-dimensional space, a hyperplane is similarly defined by:
• An n-dimensional vector w
• An intercept b
• Any point x lying on the hyperplane satisfies the equation:
• A hyperplane is a separating hyperplane if:
• For any data point x from one class, it satisfies
There can be countless possible solutions for w and b. So, next we will learn how to
identify
the best hyperplane among possible separating hyperplanes.
Scenario 2 — Determining the Optimal Hyperplane
•Among many separating hyperplanes, the optimal hyperplane is the one that maximizes the margin between classes.
•Margin = the sum of:
• The distance from the nearest data point on the positive side to the hyperplane.
• The distance from the nearest data point on the negative side to the hyperplane.
•These nearest points from each class define two additional hyperplanes:
• Positive hyperplane: passes through the closest positive class point(s).
• Negative hyperplane: passes through the closest negative class point(s).
•The perpendicular distance between the positive and negative hyperplanes is called the margin.
•A decision hyperplane is optimal when this margin is maximized.
•The points that lie exactly on the margin boundaries (on the positive or negative hyperplane) are called support vectors.
•Support vectors are the critical data points that influence the position and orientation of the optimal hyperplane.
can be portrayed as the distance from the data point to the decision
hyperplane, and also interpreted as the confidence of prediction: the higher the value, the
further away from the decision boundary, the more certainty of the prediction.
Although we cannot wait to implement the SVM algorithm, let's take a step back and look
at a frequent scenario where data points are not perfectly linearly separable.
Scenario 3 - handling outliers
To deal with a set of observations containing outliers that make it unable to linearly segregate the entire dataset, we
allow misclassification of such outliers and try to minimize the introduced error.
Next, extract tf-idf features using the TfidfVectorizer extractor that we just acquired:
>>> tfidf_vectorizer = TfidfVectorizer(sublinear_tf=True,
max_df=0.5, stop_words='english', max_features=8000)
>>> term_docs_train =
tfidf_vectorizer.fit_transform(cleaned_train)
>>> term_docs_test = tfidf_vectorizer.transform(cleaned_test)
Now we can apply our SVM algorithm with features ready. Initialize an SVC model with the
kernel parameter set to linear (we will explain this shortly) and penalty C set to the
default value 1:
>>> from sklearn.svm import SVC
>>> svm = SVC(kernel='linear', C=1.0, random_state=42)
And then predict on the testing set with the trained model and obtain the prediction
accuracy directly:
>>> accuracy = svm.score(term_docs_test, label_test)
>>> print('The accuracy on testing set is:
{0:.1f}%'.format(accuracy*100))
The accuracy on testing set is: 96.4%
Our first SVM model just works so well with 96.4% accuracy achieved
Scenario 4 - dealing with more than two classes
SVM and many other classifiers can be generalized to the multiple class case by two
common approaches, one-vs-rest (also called one-vs-all) and one-vs-one.
In scikit-learn, classifiers handle multiclass cases internally and we do not need to explicitly
write any additional codes to enable it. We can see how simple it is in the following
example of classifying five topics comp.graphics, sci.space, alt.atheism,
talk.religion.misc, and rec.sport.hockey:
>>> categories = [
... 'alt.atheism',
... 'talk.religion.misc',
... 'comp.graphics',
... 'sci.space',
... 'rec.sport.hockey'
... ]
Not bad! And we could, as usual, tweak the value of the parameters kernel='linear' and
C=1.0 as specified in our SVC model. We discussed that parameter C controls the strictness
of separation, and it can be tuned to achieve the best trade-off between bias and variance.
The kernels of SVM
Scenario 5 - solving linearly non-separable problems
The hyperplane we have looked at till now is linear, for example, a line in a two dimensional feature space, a surface in a three dimensional one.
However, in frequently seen scenarios like the following one, we are not able to find any linear hyperplane to
separate two classes.
Again, can be fine-tuned via cross-validation to obtain the best performance.
Some other common kernel functions include the polynomial kernel and sigmoid
kernel:
Choosing between the linear and RBF kernel
The rule of thumb, of course, is linear separability. However, this is most of the time very
difficult to identify, unless you have sufficient prior knowledge or the features are of low
dimension (1 to 3)
Case 1: both the numbers of features and instances are large (more than 104 or 105). As the
dimension of the feature space is high enough, additional features as a result of RBF
transformation will not provide any performance improvement, but will increase
computational expense. Some examples from the UCI Machine Learning Repository are of
this type:
• URL Reputation Data Set: https://archive.ics.uci.edu/ml/datasets/URL+Reputation (number of instances: 2396130,
number of features: 3231961) for malicious URL detection based on their lexical and host information
• YouTube Multiview Video Games Data Set:
https://archive.ics.uci.edu/ml/datasets/YouTube+Multiview+Video+Games+Dataset (number of instances:
120000,number of features: 1000000) for topic classification.
Case 2: the number of features is noticeably large compared to the number of training
samples. Apart from the reasons stated in Scenario 1, the RBF kernel is significantly more
prone to overfitting. Such a scenario occurs in, for example:
• Dorothea Data Set: https://archive.ics.uci.edu/ml/datasets/Dorothea (number of instances: 1950, number of
features: 100000) for drug discovery that classifies chemical compounds as active or inactive by structural molecular
features .
• Arcene Data Set: https://archive.ics.uci.edu/ml/datasets/Arcene
(number of instances: 900, number of features: 10000) a mass-spectrometry
dataset for cancer detection
Case 3: the number of instances is significantly large compared to the number of features.
For a dataset of low dimension, the RBF kernel will, in general, boost the performance by
mapping it to a higher dimensional space. However, due to the training complexity, it
usually becomes no longer efficient on a training set with more than 106 or 107 samples.
Some exemplar datasets include:
• Heterogeneity Activity Recognition Data Set:
https://archive.ics.uci.edu/ml/datasets/Heterogeneity+Activity+Recognition (number of
instances: 43930257, number of features: 16) for human activity recognition
• HIGGS Data Set: https://archive.ics.uci.edu/ml/datasets/HIGGS (number of instances:
11000000, number of features: 28) for distinguishing between a signal process
producing Higgs bosons or a background process
News topic classification with support vector machine
svc_libsvm_best = grid_search.best_estimator_
accuracy = svc_libsvm_best.score(term_docs_test, label_test)
print(f'The accuracy on testing set is: {accuracy * 100:.1f}%’)
df = pd.read_csv("NSE-TATAGLOBAL.csv")
df.columns = df.columns.str.strip
print(df.columns)
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True) df.set_index('Date', inplace=True)
df.dropna(inplace=True)
plt.figure(figsize=(14, 6))
plt.plot(df['Close'], color='blue’)
plt.title('TATAGLOBAL Closing Price Over Time’)
plt.xlabel('Date') plt.ylabel('Closing Price (INR)’)
plt.grid(True)
plt.show()
df['Open-Close'] = df['Open'] - df['Close’]
df['High-Low'] = df['High'] - df['Low’]
df['7day MA'] = df['Close'].rolling(window=7).mean()
df['14day MA'] = df['Close'].rolling(window=14).mean()
df.dropna(inplace=True)
df['Target'] = df['Close'].shift(-1)
df.dropna(inplace=True)
X = df[features]
y = df['Target’]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False
print("Training Set:", X_train.shape)
print("Test Set:", X_test.shape)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"R2 Score: {r2:.2f}")
print(f"RMSE: {rmse:.2f}")
OUTPUT-
R2 Score: 0.99
RMSE: 0.94
plt.figure(figsize=(14, 6))
plt.plot(y_test.index, y_test, label='Actual', color='blue’)
plt.plot(y_test.index, y_pred, label='Predicted', color='orange’)
plt.title("Actual vs Predicted Stock Prices")
plt.xlabel("Date")
plt.ylabel("Price (INR)")
plt.legend()
plt.grid(True)
plt.show()
next_day_input = X.tail(1)
next_day_prediction = model.predict(next_day_input)
print(f"Predicted next day's closing price: ₹{next_day_prediction[0]:.2f}")
OUTPUT-
Predicted next day's closing price: ₹155.44
Applications of Stock Prediction
using Linear Regression