0% found this document useful (0 votes)
63 views5 pages

Naïve Bayes Classifier Implementation

The document discusses implementing a Naive Bayesian classifier on a sample training dataset to classify movie reviews as positive or negative sentiment. It preprocesses the text data, builds CountVectorizer and tokenization, trains ComplementNB, MultinomialNB and BernoulliNB models and evaluates their accuracy on held-out test data, finding the ComplementNB and MultinomialNB models achieve the highest accuracy around 86%.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views5 pages

Naïve Bayes Classifier Implementation

The document discusses implementing a Naive Bayesian classifier on a sample training dataset to classify movie reviews as positive or negative sentiment. It preprocesses the text data, builds CountVectorizer and tokenization, trains ComplementNB, MultinomialNB and BernoulliNB models and evaluates their accuracy on held-out test data, finding the ComplementNB and MultinomialNB models achieve the highest accuracy around 86%.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

WEEK-10

AIM: Program to implement the naïve Bayesian classifier for a sample training data set
stored as a . CSV file. Compute the accuracy of the classifier, considering a few test data sets.
DESCRIPTION: The Naïve Bayes algorithm is a supervised learning algorithm, which is
based on the Bayes theorem and used for solving classification problems. It is mainly used in
text classification that includes a high-dimensional training dataset. Naïve Bayes Classifier is
one of the simple and most effective Classification algorithms which helps in building the
fast machine learning models that can make quick [Link] is a probabilistic classifier,
which means it predicts on the basis of the probability of an [Link] popular examples of
Naïve Bayes Algorithm are spam filtration, Sentimental analysis, and classifying articles.
Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features.
Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.

Program:
import pandas as pd
import [Link] as plt
import [Link] as px
from wordcloud import WordCloud
import nltk
import re
import string
from [Link] import stopwords
[Link]('punkt')
[Link]('stopwords')
from [Link] import word_tokenize
from [Link] import WordNetLemmatizer
stop_words = [Link]()

df=pd.read_csv('/content/IMDB [Link]')
[Link]()

21131A4436 1
[Link]()

fig,(ax1,ax2)=[Link](1,2,figsize=(12,8))
[Link](df[df['sentiment']=='positive']['review'].[Link]())
ax1.set_title( 'Positive Reviews')
[Link](df[df['sentiment']=='negative']['review'].[Link]())
ax2.set_title( 'Negative Reviews')

[Link](columns={'review':'text'}, inplace = True)


df

def cleaning(text):
# converting to lowercase, removing URL links, special characters, punctuations...
text = [Link]() # converting to lowercase
text = [Link]('https?://\S+|www\.\S+', '', text) # removing URL links
text = [Link](r"\b\d+\b", "", text) # removing number
text = [Link]('<.*?>+', '', text) # removing special characters,
text = [Link]('[%s]' % [Link]([Link]), '', text) # punctuations
text = [Link]('\n', '', text)
text = [Link]('[’“”…]', '', text)

#removing emoji:
emoji_pattern = [Link]("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
21131A4436 2
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=[Link])
text = emoji_pattern.sub(r'', text)
return text

dt = df['text'].apply(cleaning)
dt = [Link](dt)
dt['sentiment']=df['sentiment']
dt
dt['no_sw'] = dt['text'].apply(lambda x: ' '.join([word for word in [Link]() if word not in
(stop_words)]))

FREQWORDS = set([w for (w, wc) in cnt.most_common(10)])


def remove_freqwords(text):
"""custom function to remove the frequent words"""
return " ".join([word for word in str(text).split() if word not in FREQWORDS])
dt["wo_stopfreq"] = dt["no_sw"].apply(lambda text: remove_freqwords(text))
[Link]()

[Link]('wordnet')
wordnet_lem = WordNetLemmatizer()
dt['wo_stopfreq_lem'] = dt['wo_stopfreq'].apply(wordnet_lem.lemmatize)
dt
# create the cleaned data for the train-test split:
nb=[Link](columns=['text','no_sw', 'wo_stopfreq'])
[Link]=['sentiment','review']
[Link] = [0 if each == "negative" else 1 for each in [Link]]
nb

tokenized_review=nb['review'].apply(lambda x: [Link]())
tokenized_review.head(5)
0 [reviewers, mentioned, watching, oz, episode, ...
1 [wonderful, production, filming, technique, un...

21131A4436 3
2 [wonderful, spend, hot, summer, weekend, sitti...
3 [basically, family, boy, jake, thinks, zombie,...
4 [petter, matteis, love, money, visually, stunn...
Name: review, dtype: object

from sklearn.feature_extraction.text import CountVectorizer


from [Link] import RegexpTokenizer
token = RegexpTokenizer(r'[a-zA-Z0-9]+')
cv = CountVectorizer(stop_words='english',ngram_range = (1,1),tokenizer = [Link])
text_counts = cv.fit_transform(nb['review'])

from sklearn.model_selection import train_test_split


X=text_counts
y=nb['sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20,random_state=30)

from sklearn.naive_bayes import ComplementNB


from [Link] import classification_report, confusion_matrix
CNB = ComplementNB()
[Link](X_train, y_train)
from sklearn import metrics
predicted = [Link](X_test)
accuracy_score = metrics.accuracy_score(predicted, y_test)
print('ComplementNB model accuracy is',str('{:04.2f}'.format(accuracy_score*100))+'%')
print('------------------------------------------------')
print('Confusion Matrix:')
print([Link](confusion_matrix(y_test, predicted)))
print('------------------------------------------------')
print('Classification Report:')
print(classification_report(y_test, predicted))

ComplementNB model accuracy is 86.22%


------------------------------------------------
Confusion Matrix:
0 1
0 4327 650
1 728 4295
------------------------------------------------
Classification Report:
precision recall f1-score support
0 0.86 0.87 0.86 4977
1 0.87 0.86 0.86 5023
accuracy 0.86 10000
macro avg 0.86 0.86 0.86 10000
weighted avg 0.86 0.86 0.86 10000

from sklearn.naive_bayes import MultinomialNB


MNB = MultinomialNB()
[Link](X_train, y_train)
predicted = [Link](X_test)
accuracy_score = metrics.accuracy_score(predicted, y_test)
21131A4436 4
print('MultinominalNB model accuracy is',str('{:04.2f}'.format(accuracy_score*100))+'%')
print('------------------------------------------------')
print('Confusion Matrix:')
print([Link](confusion_matrix(y_test, predicted)))
print('------------------------------------------------')
print('Classification Report:')
print(classification_report(y_test, predicted))

MultinominalNB model accuracy is 86.21%


------------------------------------------------
Confusion Matrix:
0 1
0 4327 650
1 729 4294
------------------------------------------------
Classification Report:
precision recall f1-score support
0 0.86 0.87 0.86 4977
1 0.87 0.85 0.86 5023
accuracy 0.86 10000
macro avg 0.86 0.86 0.86 10000
weighted avg 0.86 0.86 0.86 10000

from sklearn.naive_bayes import BernoulliNB


BNB = BernoulliNB()
[Link](X_train, y_train)
predicted = [Link](X_test)
accuracy_score_bnb = metrics.accuracy_score(predicted,y_test)
print('BernoulliNB model accuracy = ' + str('{:4.2f}'.format(accuracy_score_bnb*100))+'%')
print('------------------------------------------------')
print('Confusion Matrix:')
print([Link](confusion_matrix(y_test, predicted)))
print('------------------------------------------------')
print('Classification Report:')
print(classification_report(y_test, predicted))

BernoulliNB model accuracy = 83.75%


------------------------------------------------
Confusion Matrix:
0 1
0 4403 574
1 1051 3972
------------------------------------------------
Classification Report:
precision recall f1-score support
0 0.81 0.88 0.84 4977
1 0.87 0.79 0.83 5023
accuracy 0.84 10000
macro avg 0.84 0.84 0.84 10000
weighted avg 0.84 0.84 0.84 10000

21131A4436 5

You might also like