0% found this document useful (0 votes)
82 views11 pages

NLP Basics: Tokenization, Stemming, Lemmatization

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views11 pages

NLP Basics: Tokenization, Stemming, Lemmatization

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

# 🔹 Lab 1: Introduction to NLP

# 🎯 Objective: Learn basic text processing

# Step 1: Import libraries


import nltk
from [Link] import word_tokenize
from [Link] import stopwords
from [Link] import PorterStemmer, WordNetLemmatizer

# Download required resources (only first time)


[Link]('punkt')
[Link]('stopwords')
[Link]('wordnet')

# Step 2: Example text


text = "I am learning Natural Language Processing in class."

print("Original Text:")
print(text)

[nltk_data] Downloading package punkt to C:\Users\Sujay


[nltk_data] Kumar\AppData\Roaming\nltk_data...
[nltk_data] Unzipping tokenizers\[Link].
[nltk_data] Downloading package stopwords to C:\Users\Sujay
[nltk_data] Kumar\AppData\Roaming\nltk_data...
[nltk_data] Unzipping corpora\[Link].
[nltk_data] Downloading package wordnet to C:\Users\Sujay
[nltk_data] Kumar\AppData\Roaming\nltk_data...

Original Text:
I am learning Natural Language Processing in class.

# Step 2: Example text


text = "I am learning Natural Language Processing in class."

print("Original Text:")
print(text)

Original Text:
I am learning Natural Language Processing in class.

Tokenization (Breaking text into words)


In simple words:

word_tokenize(text) → cuts the sentence into small pieces (words/punctuation).

tokens = ... → saves those words into a variable called tokens.


print("After Tokenization:") → shows a label so students know what’s coming.

print(tokens) → shows the actual list of words.

👉 Example: Input → "I am learning NLP in class." Output → ['I', 'am', 'learning', 'NLP', 'in', 'class',
'.']

# Step 3: Tokenization (Breaking text into words)


tokens = word_tokenize(text) # This line takes the sentence in
'text' and splits it into individual words (tokens)
print("After Tokenization:") # This line just prints a heading
so output looks clear
print(tokens) # This line prints the list of
tokens (words) after splitting

After Tokenization:
['I', 'am', 'learning', 'Natural', 'Language', 'Processing', 'in',
'class', '.']

Removing Stopwords (like "am", "in", "the")


In simple words: We are cleaning the sentence by throwing away unnecessary words (like “is”,
“the”, “in”) that don’t add much meaning.

👉 Example: Input Tokens → ['I', 'am', 'learning', 'NLP', 'in', 'class'] Output → ['I', 'learning', 'NLP',
'class']

# Step 4: Removing Stopwords (like "am", "in", "the")

stop_words = set([Link]('english'))
# This line loads a list of common English words (like "am", "is",
"the") from NLTK
# and stores them inside a set called stop_words.
# A "set" is used because it's faster to check membership.

filtered_tokens = [word for word in tokens if [Link]() not in


stop_words]
# This line creates a new list.
# It goes through each word in tokens one by one,
# converts the word into lowercase ([Link]()),
# and keeps it ONLY if it is NOT in stop_words.

print("After Removing Stopwords:")


# Prints a heading so the output looks clear.

print(filtered_tokens)
# Prints the list of words after removing stopwords.
After Removing Stopwords:
['learning', 'Natural', 'Language', 'Processing', 'class', '.']

Stemming (cutting words to their root form)


In simple words: Stemming is like chopping words down to their base. It may not always be a
real word but gives the core root.

👉 Example: Input → ['learning', 'NLP', 'classes'] Output → ['learn', 'nlp', 'class']

# Step 5: Stemming (cutting words to their root form)


from [Link] import PorterStemmer
stemmer = PorterStemmer()
# Creates a "stemmer" object using the Porter algorithm.

stemmed_tokens = [[Link](word) for word in filtered_tokens]


# For each word in filtered_tokens,
# apply [Link](word) which cuts the word down to its root form.
# Example: "learning" → "learn", "classes" → "class"

print("After Stemming:")
print(stemmed_tokens)
# Prints the list of words after stemming.

After Stemming:
['learn', 'natur', 'languag', 'process', 'class', '.']

# Lemmatization (finding correct base word using grammar rules)

In simple words: Lemmatization is smarter than stemming because it uses grammar + dictionary
to find the correct base word.

👉 Example: Input → ['better', 'running', 'classes'] Output → ['good', 'run', 'class']

# Step 6: Lemmatization (finding correct base word using grammar


rules)
from [Link] import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
# Creates a "lemmatizer" object that uses the WordNet dictionary.

lemmatized_tokens = [[Link](word) for word in


filtered_tokens]
# For each word in filtered_tokens,
# apply [Link](word) which returns the meaningful base
form.
# Example: "better" → "good", "running" → "run"

print("After Lemmatization:")
print(lemmatized_tokens)
# Prints the list of words after lemmatization.

After Lemmatization:
['learning', 'Natural', 'Language', 'Processing', 'class', '.']

In simple words:
word_tokenize(text) → cuts the sentence into small pieces (words/punctuation).

tokens = ... → saves those words into a variable called tokens.

print("After Tokenization:") → shows a label so students know what’s coming.

print(tokens) → shows the actual list of words.

Installing Other Necessary Packages


📊 Core Data Handling & Analysis

NumPy – For numerical computing, arrays, and mathematical operations.

Pandas – For data manipulation, cleaning, and analysis (DataFrames).

📈 Data Visualization

Matplotlib – For basic 2D plotting and charts.

Seaborn – Built on Matplotlib, makes statistical graphics more attractive.

Plotly – For interactive, dynamic, and web-based visualizations.

🤖 Machine Learning

Scikit-learn – For machine learning models (classification, regression, clustering, etc.).

XGBoost – For gradient boosting and high-performance ML models.

🧠 Deep Learning

TensorFlow – Popular deep learning framework from Google.

PyTorch – Deep learning framework widely used in research and industry.

🌐 NLP & Data Preprocessing

NLTK / SpaCy – For natural language processing (tokenization, text cleaning, etc.).

# 📊 Core Data Handling


import numpy as np
import pandas as pd

# 📈 Visualization
import [Link] as plt
import seaborn as sns
import [Link] as px # For interactive plots

# 🤖 Machine Learning
import sklearn
import xgboost as xgb

# 🧠 Deep Learning
import tensorflow as tf

# 🌐 NLP & Text Processing


import nltk

Chunking: Dividing Data into Chunks


Example 1: Chunking Text Data

Sometimes in NLP (Natural Language Processing), you break text into chunks of words for
processing.

text = "Data Science is an exciting field to learn"

# Split into words


words = [Link]()
words

['Data', 'Science', 'is', 'an', 'exciting', 'field', 'to', 'learn']

text = "Data Science is an exciting field to learn"

# Split into words


words = [Link]()

# Chunk into size 2 (pairs of words)


def chunk_words(words, size):
for i in range(0, len(words), size):
yield words[i:i+size]

chunks = list(chunk_words(words, 2))

print("Words:", words)
print("Chunks:", chunks)

Words: ['Data', 'Science', 'is', 'an', 'exciting', 'field', 'to',


'learn']
Chunks: [['Data', 'Science'], ['is', 'an'], ['exciting', 'field'],
['to', 'learn']]
Meaning of each word that used in above
example
text = "Data Science is an exciting field to learn"

words = [Link]()

def chunk_words(words, size):

for i in range(0, len(words), size):


yield words[i:i+size]

chunks = list(chunk_words(words, 2))

print("Words:", words)
print("Chunks:", chunks)

1. Bag of Words (BoW) Model


👉 A Bag of Words model is a simple way of converting text into numbers so that a computer can
understand it.

It ignores grammar and order of words, only keeps word counts.

Think of it like making a shopping list of words.

Example:
Sentence 1: "I love Data Science" Sentence 2: "Data Science is fun"

BoW Vocabulary = {I, love, Data, Science, is, fun}

Now represent each sentence as numbers (counts):

S1 → [1,1,1,1,0,0]

S2 → [0,0,1,1,1,1]

📌 In short: BoW = text → numbers, based on how many times each word appears.

Step 1: Collect all unique words (Vocabulary)

We take all words from both sentences and make a list of unique words.

Sentence 1: "I love Data Science" Sentence 2: "Data Science is fun"

Unique words = {I, love, Data, Science, is, fun} 👉 This is called the vocabulary.

Step 2: Count how many times each word appears in a sentence

We create a table with all vocabulary words as columns:

Sentence I love Data Science is fun S1: "I love Data Science" 1 1 1 1 0 0 S2: "Data Science is fun" 0
0 1 1 1 1 Step 3: Represent each sentence as a vector (row of numbers)

S1 → [1, 1, 1, 1, 0, 0] (because S1 has 1 "I", 1 "love", 1 "Data", 1 "Science", 0 "is", 0 "fun")

S2 → [0, 0, 1, 1, 1, 1] (because S2 has 0 "I", 0 "love", 1 "Data", 1 "Science", 1 "is", 1 "fun")

📌 Key idea:

Each sentence becomes a list of numbers (vector).

The position of each number corresponds to a word in the vocabulary.

The value tells how many times that word appears.

# Sentences
sentences = ["I love Data Science", "Data Science is fun"]

# Define vocabulary (fixed order)


vocab = ["I", "love", "Data", "Science", "is", "fun"]

# Function to create BoW vector


def create_bow(sentence, vocab):
words = [Link]() # split sentence into words
return [[Link](word) for word in vocab]

# Generate BoW for each sentence


s1 = create_bow(sentences[0], vocab)
s2 = create_bow(sentences[1], vocab)
print("S1 →", s1)
print("S2 →", s2)

S1 → [1, 1, 1, 1, 0, 0]
S2 → [0, 0, 1, 1, 1, 1]
print("S1 →", s1)

print("S2 →", s2)

Prints the final Bag of Words vectors.

You might also like