WEEK-10
AIM: Program to implement the naïve Bayesian classifier for a sample training data set
stored as a . CSV file. Compute the accuracy of the classifier, considering a few test data sets.
DESCRIPTION: The Naïve Bayes algorithm is a supervised learning algorithm, which is
based on the Bayes theorem and used for solving classification problems. It is mainly used in
text classification that includes a high-dimensional training dataset. Naïve Bayes Classifier is
one of the simple and most effective Classification algorithms which helps in building the
fast machine learning models that can make quick [Link] is a probabilistic classifier,
which means it predicts on the basis of the probability of an [Link] popular examples of
Naïve Bayes Algorithm are spam filtration, Sentimental analysis, and classifying articles.
Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features.
Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Program:
import pandas as pd
import [Link] as plt
import [Link] as px
from wordcloud import WordCloud
import nltk
import re
import string
from [Link] import stopwords
[Link]('punkt')
[Link]('stopwords')
from [Link] import word_tokenize
from [Link] import WordNetLemmatizer
stop_words = [Link]()
df=pd.read_csv('/content/IMDB [Link]')
[Link]()
21131A4436 1
[Link]()
fig,(ax1,ax2)=[Link](1,2,figsize=(12,8))
[Link](df[df['sentiment']=='positive']['review'].[Link]())
ax1.set_title( 'Positive Reviews')
[Link](df[df['sentiment']=='negative']['review'].[Link]())
ax2.set_title( 'Negative Reviews')
[Link](columns={'review':'text'}, inplace = True)
df
def cleaning(text):
# converting to lowercase, removing URL links, special characters, punctuations...
text = [Link]() # converting to lowercase
text = [Link]('https?://\S+|www\.\S+', '', text) # removing URL links
text = [Link](r"\b\d+\b", "", text) # removing number
text = [Link]('<.*?>+', '', text) # removing special characters,
text = [Link]('[%s]' % [Link]([Link]), '', text) # punctuations
text = [Link]('\n', '', text)
text = [Link]('[’“”…]', '', text)
#removing emoji:
emoji_pattern = [Link]("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
21131A4436 2
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=[Link])
text = emoji_pattern.sub(r'', text)
return text
dt = df['text'].apply(cleaning)
dt = [Link](dt)
dt['sentiment']=df['sentiment']
dt
dt['no_sw'] = dt['text'].apply(lambda x: ' '.join([word for word in [Link]() if word not in
(stop_words)]))
FREQWORDS = set([w for (w, wc) in cnt.most_common(10)])
def remove_freqwords(text):
"""custom function to remove the frequent words"""
return " ".join([word for word in str(text).split() if word not in FREQWORDS])
dt["wo_stopfreq"] = dt["no_sw"].apply(lambda text: remove_freqwords(text))
[Link]()
[Link]('wordnet')
wordnet_lem = WordNetLemmatizer()
dt['wo_stopfreq_lem'] = dt['wo_stopfreq'].apply(wordnet_lem.lemmatize)
dt
# create the cleaned data for the train-test split:
nb=[Link](columns=['text','no_sw', 'wo_stopfreq'])
[Link]=['sentiment','review']
[Link] = [0 if each == "negative" else 1 for each in [Link]]
nb
tokenized_review=nb['review'].apply(lambda x: [Link]())
tokenized_review.head(5)
0 [reviewers, mentioned, watching, oz, episode, ...
1 [wonderful, production, filming, technique, un...
21131A4436 3
2 [wonderful, spend, hot, summer, weekend, sitti...
3 [basically, family, boy, jake, thinks, zombie,...
4 [petter, matteis, love, money, visually, stunn...
Name: review, dtype: object
from sklearn.feature_extraction.text import CountVectorizer
from [Link] import RegexpTokenizer
token = RegexpTokenizer(r'[a-zA-Z0-9]+')
cv = CountVectorizer(stop_words='english',ngram_range = (1,1),tokenizer = [Link])
text_counts = cv.fit_transform(nb['review'])
from sklearn.model_selection import train_test_split
X=text_counts
y=nb['sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20,random_state=30)
from sklearn.naive_bayes import ComplementNB
from [Link] import classification_report, confusion_matrix
CNB = ComplementNB()
[Link](X_train, y_train)
from sklearn import metrics
predicted = [Link](X_test)
accuracy_score = metrics.accuracy_score(predicted, y_test)
print('ComplementNB model accuracy is',str('{:04.2f}'.format(accuracy_score*100))+'%')
print('------------------------------------------------')
print('Confusion Matrix:')
print([Link](confusion_matrix(y_test, predicted)))
print('------------------------------------------------')
print('Classification Report:')
print(classification_report(y_test, predicted))
ComplementNB model accuracy is 86.22%
------------------------------------------------
Confusion Matrix:
0 1
0 4327 650
1 728 4295
------------------------------------------------
Classification Report:
precision recall f1-score support
0 0.86 0.87 0.86 4977
1 0.87 0.86 0.86 5023
accuracy 0.86 10000
macro avg 0.86 0.86 0.86 10000
weighted avg 0.86 0.86 0.86 10000
from sklearn.naive_bayes import MultinomialNB
MNB = MultinomialNB()
[Link](X_train, y_train)
predicted = [Link](X_test)
accuracy_score = metrics.accuracy_score(predicted, y_test)
21131A4436 4
print('MultinominalNB model accuracy is',str('{:04.2f}'.format(accuracy_score*100))+'%')
print('------------------------------------------------')
print('Confusion Matrix:')
print([Link](confusion_matrix(y_test, predicted)))
print('------------------------------------------------')
print('Classification Report:')
print(classification_report(y_test, predicted))
MultinominalNB model accuracy is 86.21%
------------------------------------------------
Confusion Matrix:
0 1
0 4327 650
1 729 4294
------------------------------------------------
Classification Report:
precision recall f1-score support
0 0.86 0.87 0.86 4977
1 0.87 0.85 0.86 5023
accuracy 0.86 10000
macro avg 0.86 0.86 0.86 10000
weighted avg 0.86 0.86 0.86 10000
from sklearn.naive_bayes import BernoulliNB
BNB = BernoulliNB()
[Link](X_train, y_train)
predicted = [Link](X_test)
accuracy_score_bnb = metrics.accuracy_score(predicted,y_test)
print('BernoulliNB model accuracy = ' + str('{:4.2f}'.format(accuracy_score_bnb*100))+'%')
print('------------------------------------------------')
print('Confusion Matrix:')
print([Link](confusion_matrix(y_test, predicted)))
print('------------------------------------------------')
print('Classification Report:')
print(classification_report(y_test, predicted))
BernoulliNB model accuracy = 83.75%
------------------------------------------------
Confusion Matrix:
0 1
0 4403 574
1 1051 3972
------------------------------------------------
Classification Report:
precision recall f1-score support
0 0.81 0.88 0.84 4977
1 0.87 0.79 0.83 5023
accuracy 0.84 10000
macro avg 0.84 0.84 0.84 10000
weighted avg 0.84 0.84 0.84 10000
21131A4436 5