# 🔹 Lab 1: Introduction to NLP
# 🎯 Objective: Learn basic text processing
# Step 1: Import libraries
import nltk
from [Link] import word_tokenize
from [Link] import stopwords
from [Link] import PorterStemmer, WordNetLemmatizer
# Download required resources (only first time)
[Link]('punkt')
[Link]('stopwords')
[Link]('wordnet')
# Step 2: Example text
text = "I am learning Natural Language Processing in class."
print("Original Text:")
print(text)
[nltk_data] Downloading package punkt to C:\Users\Sujay
[nltk_data] Kumar\AppData\Roaming\nltk_data...
[nltk_data] Unzipping tokenizers\[Link].
[nltk_data] Downloading package stopwords to C:\Users\Sujay
[nltk_data] Kumar\AppData\Roaming\nltk_data...
[nltk_data] Unzipping corpora\[Link].
[nltk_data] Downloading package wordnet to C:\Users\Sujay
[nltk_data] Kumar\AppData\Roaming\nltk_data...
Original Text:
I am learning Natural Language Processing in class.
# Step 2: Example text
text = "I am learning Natural Language Processing in class."
print("Original Text:")
print(text)
Original Text:
I am learning Natural Language Processing in class.
Tokenization (Breaking text into words)
In simple words:
word_tokenize(text) → cuts the sentence into small pieces (words/punctuation).
tokens = ... → saves those words into a variable called tokens.
print("After Tokenization:") → shows a label so students know what’s coming.
print(tokens) → shows the actual list of words.
👉 Example: Input → "I am learning NLP in class." Output → ['I', 'am', 'learning', 'NLP', 'in', 'class',
'.']
# Step 3: Tokenization (Breaking text into words)
tokens = word_tokenize(text) # This line takes the sentence in
'text' and splits it into individual words (tokens)
print("After Tokenization:") # This line just prints a heading
so output looks clear
print(tokens) # This line prints the list of
tokens (words) after splitting
After Tokenization:
['I', 'am', 'learning', 'Natural', 'Language', 'Processing', 'in',
'class', '.']
Removing Stopwords (like "am", "in", "the")
In simple words: We are cleaning the sentence by throwing away unnecessary words (like “is”,
“the”, “in”) that don’t add much meaning.
👉 Example: Input Tokens → ['I', 'am', 'learning', 'NLP', 'in', 'class'] Output → ['I', 'learning', 'NLP',
'class']
# Step 4: Removing Stopwords (like "am", "in", "the")
stop_words = set([Link]('english'))
# This line loads a list of common English words (like "am", "is",
"the") from NLTK
# and stores them inside a set called stop_words.
# A "set" is used because it's faster to check membership.
filtered_tokens = [word for word in tokens if [Link]() not in
stop_words]
# This line creates a new list.
# It goes through each word in tokens one by one,
# converts the word into lowercase ([Link]()),
# and keeps it ONLY if it is NOT in stop_words.
print("After Removing Stopwords:")
# Prints a heading so the output looks clear.
print(filtered_tokens)
# Prints the list of words after removing stopwords.
After Removing Stopwords:
['learning', 'Natural', 'Language', 'Processing', 'class', '.']
Stemming (cutting words to their root form)
In simple words: Stemming is like chopping words down to their base. It may not always be a
real word but gives the core root.
👉 Example: Input → ['learning', 'NLP', 'classes'] Output → ['learn', 'nlp', 'class']
# Step 5: Stemming (cutting words to their root form)
from [Link] import PorterStemmer
stemmer = PorterStemmer()
# Creates a "stemmer" object using the Porter algorithm.
stemmed_tokens = [[Link](word) for word in filtered_tokens]
# For each word in filtered_tokens,
# apply [Link](word) which cuts the word down to its root form.
# Example: "learning" → "learn", "classes" → "class"
print("After Stemming:")
print(stemmed_tokens)
# Prints the list of words after stemming.
After Stemming:
['learn', 'natur', 'languag', 'process', 'class', '.']
# Lemmatization (finding correct base word using grammar rules)
In simple words: Lemmatization is smarter than stemming because it uses grammar + dictionary
to find the correct base word.
👉 Example: Input → ['better', 'running', 'classes'] Output → ['good', 'run', 'class']
# Step 6: Lemmatization (finding correct base word using grammar
rules)
from [Link] import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
# Creates a "lemmatizer" object that uses the WordNet dictionary.
lemmatized_tokens = [[Link](word) for word in
filtered_tokens]
# For each word in filtered_tokens,
# apply [Link](word) which returns the meaningful base
form.
# Example: "better" → "good", "running" → "run"
print("After Lemmatization:")
print(lemmatized_tokens)
# Prints the list of words after lemmatization.
After Lemmatization:
['learning', 'Natural', 'Language', 'Processing', 'class', '.']
In simple words:
word_tokenize(text) → cuts the sentence into small pieces (words/punctuation).
tokens = ... → saves those words into a variable called tokens.
print("After Tokenization:") → shows a label so students know what’s coming.
print(tokens) → shows the actual list of words.
Installing Other Necessary Packages
📊 Core Data Handling & Analysis
NumPy – For numerical computing, arrays, and mathematical operations.
Pandas – For data manipulation, cleaning, and analysis (DataFrames).
📈 Data Visualization
Matplotlib – For basic 2D plotting and charts.
Seaborn – Built on Matplotlib, makes statistical graphics more attractive.
Plotly – For interactive, dynamic, and web-based visualizations.
🤖 Machine Learning
Scikit-learn – For machine learning models (classification, regression, clustering, etc.).
XGBoost – For gradient boosting and high-performance ML models.
🧠 Deep Learning
TensorFlow – Popular deep learning framework from Google.
PyTorch – Deep learning framework widely used in research and industry.
🌐 NLP & Data Preprocessing
NLTK / SpaCy – For natural language processing (tokenization, text cleaning, etc.).
# 📊 Core Data Handling
import numpy as np
import pandas as pd
# 📈 Visualization
import [Link] as plt
import seaborn as sns
import [Link] as px # For interactive plots
# 🤖 Machine Learning
import sklearn
import xgboost as xgb
# 🧠 Deep Learning
import tensorflow as tf
# 🌐 NLP & Text Processing
import nltk
Chunking: Dividing Data into Chunks
Example 1: Chunking Text Data
Sometimes in NLP (Natural Language Processing), you break text into chunks of words for
processing.
text = "Data Science is an exciting field to learn"
# Split into words
words = [Link]()
words
['Data', 'Science', 'is', 'an', 'exciting', 'field', 'to', 'learn']
text = "Data Science is an exciting field to learn"
# Split into words
words = [Link]()
# Chunk into size 2 (pairs of words)
def chunk_words(words, size):
for i in range(0, len(words), size):
yield words[i:i+size]
chunks = list(chunk_words(words, 2))
print("Words:", words)
print("Chunks:", chunks)
Words: ['Data', 'Science', 'is', 'an', 'exciting', 'field', 'to',
'learn']
Chunks: [['Data', 'Science'], ['is', 'an'], ['exciting', 'field'],
['to', 'learn']]
Meaning of each word that used in above
example
text = "Data Science is an exciting field to learn"
words = [Link]()
def chunk_words(words, size):
for i in range(0, len(words), size):
yield words[i:i+size]
chunks = list(chunk_words(words, 2))
print("Words:", words)
print("Chunks:", chunks)
1. Bag of Words (BoW) Model
👉 A Bag of Words model is a simple way of converting text into numbers so that a computer can
understand it.
It ignores grammar and order of words, only keeps word counts.
Think of it like making a shopping list of words.
Example:
Sentence 1: "I love Data Science" Sentence 2: "Data Science is fun"
BoW Vocabulary = {I, love, Data, Science, is, fun}
Now represent each sentence as numbers (counts):
S1 → [1,1,1,1,0,0]
S2 → [0,0,1,1,1,1]
📌 In short: BoW = text → numbers, based on how many times each word appears.
Step 1: Collect all unique words (Vocabulary)
We take all words from both sentences and make a list of unique words.
Sentence 1: "I love Data Science" Sentence 2: "Data Science is fun"
Unique words = {I, love, Data, Science, is, fun} 👉 This is called the vocabulary.
Step 2: Count how many times each word appears in a sentence
We create a table with all vocabulary words as columns:
Sentence I love Data Science is fun S1: "I love Data Science" 1 1 1 1 0 0 S2: "Data Science is fun" 0
0 1 1 1 1 Step 3: Represent each sentence as a vector (row of numbers)
S1 → [1, 1, 1, 1, 0, 0] (because S1 has 1 "I", 1 "love", 1 "Data", 1 "Science", 0 "is", 0 "fun")
S2 → [0, 0, 1, 1, 1, 1] (because S2 has 0 "I", 0 "love", 1 "Data", 1 "Science", 1 "is", 1 "fun")
📌 Key idea:
Each sentence becomes a list of numbers (vector).
The position of each number corresponds to a word in the vocabulary.
The value tells how many times that word appears.
# Sentences
sentences = ["I love Data Science", "Data Science is fun"]
# Define vocabulary (fixed order)
vocab = ["I", "love", "Data", "Science", "is", "fun"]
# Function to create BoW vector
def create_bow(sentence, vocab):
words = [Link]() # split sentence into words
return [[Link](word) for word in vocab]
# Generate BoW for each sentence
s1 = create_bow(sentences[0], vocab)
s2 = create_bow(sentences[1], vocab)
print("S1 →", s1)
print("S2 →", s2)
S1 → [1, 1, 1, 1, 0, 0]
S2 → [0, 0, 1, 1, 1, 1]
print("S1 →", s1)
print("S2 →", s2)
Prints the final Bag of Words vectors.