PROJECT REPORT
Introduction To Data Science (CSL-487)
BS(CS)-6(A)
Project Title: TOXIC COMMENT ANALYSIS
Group Members
Name Enrollment
1. Huzaifa Muzaffar
02-134191-066
2. Ali Asghar
02-134191-121
3. Anas Shakeel
02-134191-047
Submitted to:
MISS SOOMAL FATIMA
BAHRIA UNIVERSITY KARACHI CAMPUS
Department of Computer Science
1|Page
TABLE OF CONTENT
1. ABSTRACT-------------------------------------------------------------------------------03
2. INTRODUCTION------------------------------------------------------------------------03
a. PROJECTAIMS& OBJECTIVES-------------------------------------------03
b. BACKGROUNDOFPROJECT----------------------------------------------04
3. SYSTEM ANALYSIS-------------------------------------------------------------------04
a. SOFTWAREREQUIRMENTSPECIFICATION--------------------------04
b. SOFTWARE& HARDWAREREQUIRMENT----------------------------04
c. OPERATION ENVIROMENT------------------------------------------------05
4. SYSTEM IMPLEMENTATION-------------------------------------------------------05
5. SCREENSHOT---------------------------------------------------------------------------06
6. SYSTEM TESTING---------------------------------------------------------------------20
7. CONCLUSION AND FUTURE SCOPE---------------------------------------------20
8. REFERENCES---------------------------------------------------------------------------20
2|Page
ABSTRACT
Online forums and social media platforms have provided individuals with the means to put
forward their thoughts and freely express their opinion on various issues and incidents. In some
cases, these online comments contain explicit language which may hurt the readers. Comments
containing explicit language can be classified into myriad categories such as Toxic, Severe Toxic,
Obscene, Threat, Insult, and Identity Hate. The threat of abuse and harassment means that many
people stop expressing themselves and give up on seeking different opinions.
To protect users from being exposed to offensive language on online forums or social media sites,
companies have started flagging comments and blocking users who are found guilty of using
unpleasant language. Several Machine Learning models have been developed and deployed to
filter out the unruly language and protect internet users from becoming victims of online
harassment and cyberbullying.
INTRODUCTION
Our main area of focus is the study of negative online behaviors like toxic comment.
PROJECT AIMS AND OBJECTIVES
The aims and objectives are as follows:
To learn how data preprocessing works.
To analyze the toxic comments.
To figure out the levels of toxicity in comments.
To figure out the best model for further analyses.
BACKGROUND OF PROJECT
According to Oxford Language Dictionary, comments are a verbal or written remark expressing an opinion
or reaction. When ever a comment is passed, it can harass someone or that comment can be very
motivating for someone. The
3|Page
SYSTEM IMPLEMENTATION
Toxic Comment Classifier is a competition that has been organized by Jigsaw/Conversation AI and
hosted on Kaggle. The data set for building the classification model was acquired from the
competition site and it included the training set as well as the test set. The steps elaborated in
the workflow below will describe the entire process from Data Pre-Processing to Model Testing.
Data Exploration, Data Pre-processing, and Feature Engineering
Step 1: Checking for missing values.
First and foremost, after importing the training and test data into the pandas dataframe, I
decided to check for missing values in the downloaded data. Using the “isnull” function on both
the training and test data, I discovered that there were no missing records and therefore, I
moved on to the next step of my project.
Step 2: Text Normalization.
As I was now certain that there are no missing records in my data, I decided to start with data
pre-processing. Firstly, I decided to normalize the text data since comments from online forums
usually contain inconsistent language, use of special characters in place of letters (e.g.
@rgument), as well as the use of numbers to represent letters (e.g. n0t). To tackle such
inconsistencies in data, I decided to use Regex. The text normalization steps that I performed are
listed below: -
Removing Characters in between Text.
Removing Repeated Characters.
Converting data to lower-case.
Removing Punctuation.
Removing unnecessary white spaces in between words.
Removing “\n”.
Removing Non-English characters.
4|Page
Step 4: Stopwords Removal.
Stopwords Removal, as we all know, is one of the most critical steps in text pre-processing for
use-cases that involve text classification. Removing stopwords ensures that more focus is on
those words that define the meaning of the text.
i. To remove stopwords from my data, I took the help of the “spacy” library. Spacy has a list of
common stopwords, “STOP_WORDS” that can be used to remove stopwords from any textual
data.
ii. Although the list provided by spacy’s library is quite extensive, I decided to search for
additional stopwords that might be unique to my dataset.
iii. Firstly, I decided to add single-letter and two-letter words to the list of stopwords. While
reading through random comments in my dataset, I came across instances where single-letter or
two-letter words existed without any context, (e.g. Wow such a lovely pillow w!! or He is such a
happy guy bb.) To make sure that such instances of single-letter or two-letter words do not affect
the performance of my deep learning model, I added them to the list of stopwords. Although, I
made sure that words like me, am, as, or letters like I and a are not added to the list of
stopwords.
PROJECT DESIGN
5|Page
Libraries are an important essential part of Python:
6|Page
The above figure shows the information of the data. The output shows the that
the data is clean and there are no non null data.
The above figure shows the comment, we can see that the comments are not
clean. Thus, we need to clean the data.
Next, we will see how many columns are there and what is the percentage of
toxicity in comments.
7|Page
Graphs are visual representation of data. Below figure shows the visual data
representation:
8|Page
The following figure shows the early stages of preprocessing data, numbers,
letters and punctuation has been removed.
9|Page
Separate our dataset into 6 sections. Each section is comment +1 category.
10 | P a g e
11 | P a g e
Import relevant packages for Modeling
12 | P a g e
13 | P a g e
14 | P a g e
15 | P a g e
16 | P a g e
17 | P a g e
Transpose the combined F1 dataframe to make it suitable for presentation on a
graph
18 | P a g e
Pickling trained RandomForest models for all categories
19 | P a g e
System testing
The objective of our program was that it stays free from all sorts of bugs and errors, the flow of
program shows that the program was smooth and user friendly by looking at the Web Page. The
testing of the program was based on random user inputs and selections which exhibits that the
program has consistency. The program was handed over to different people from the group and
outside the group to check the solidarity of the program in random hands. Criticisms were
welcomed peacefully. The program was smooth and out of errors.
Step by step implementations was performed and each part was focused on the program. The flow
and the presentation of the program was kept in notice. The errors were removed, and the flow of
program was as per expectations. The program will create difficulty to one who finds it difficult to
understand the GUI.
CONCLUSION & FUTURE SCOPE
While making of the project on “Restaurant Web Page” we made our progress by solving several
problems. Solution to each problem was the biggest challenge for us and was the most important
lesson in our experience. We hope that these solutions and problems will help us in future.
We came to know how different types of properties can be implemented and programed in Web
Page. We also learned how good program is written and how to convert real life problems into
solutions. How to understand and fully write a program architecture.
This project has a great wide scope as restaurant centers are increasing not decreasing and in fact
every food center needs a management system to rely on. We hope that our project will be updated
with time as we move on to gain new heights in our life.
REFERENCES
[Link]
[Link]
[Link]
[Link]
[Link]
[Link]
[Link]
20 | P a g e