0% found this document useful (0 votes)

82 views11 pages

NLP Basics: Tokenization, Stemming, Lemmatization

Uploaded by

Sujay Kumar (2100622)

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

82 views11 pages

NLP Basics: Tokenization, Stemming, Lemmatization

Uploaded by

Sujay Kumar (2100622)

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

# 🔹 Lab 1: Introduction to NLP

# 🎯 Objective: Learn basic text processing

# Step 1: Import libraries

import nltk
from [Link] import word_tokenize
from [Link] import stopwords
from [Link] import PorterStemmer, WordNetLemmatizer

# Download required resources (only first time)

[Link]('punkt')
[Link]('stopwords')
[Link]('wordnet')

# Step 2: Example text

text = "I am learning Natural Language Processing in class."

print("Original Text:")
print(text)

[nltk_data] Downloading package punkt to C:\Users\Sujay

[nltk_data] Kumar\AppData\Roaming\nltk_data...
[nltk_data] Unzipping tokenizers\[Link].
[nltk_data] Downloading package stopwords to C:\Users\Sujay
[nltk_data] Kumar\AppData\Roaming\nltk_data...
[nltk_data] Unzipping corpora\[Link].
[nltk_data] Downloading package wordnet to C:\Users\Sujay
[nltk_data] Kumar\AppData\Roaming\nltk_data...

Original Text:
I am learning Natural Language Processing in class.

# Step 2: Example text

text = "I am learning Natural Language Processing in class."

print("Original Text:")
print(text)

Original Text:
I am learning Natural Language Processing in class.

Tokenization (Breaking text into words)

In simple words:

word_tokenize(text) → cuts the sentence into small pieces (words/punctuation).

tokens = ... → saves those words into a variable called tokens.

print("After Tokenization:") → shows a label so students know what’s coming.

print(tokens) → shows the actual list of words.

👉 Example: Input → "I am learning NLP in class." Output → ['I', 'am', 'learning', 'NLP', 'in', 'class',
'.']

# Step 3: Tokenization (Breaking text into words)

tokens = word_tokenize(text) # This line takes the sentence in
'text' and splits it into individual words (tokens)
print("After Tokenization:") # This line just prints a heading
so output looks clear
print(tokens) # This line prints the list of
tokens (words) after splitting

After Tokenization:
['I', 'am', 'learning', 'Natural', 'Language', 'Processing', 'in',
'class', '.']

Removing Stopwords (like "am", "in", "the")

In simple words: We are cleaning the sentence by throwing away unnecessary words (like “is”,
“the”, “in”) that don’t add much meaning.

👉 Example: Input Tokens → ['I', 'am', 'learning', 'NLP', 'in', 'class'] Output → ['I', 'learning', 'NLP',
'class']

# Step 4: Removing Stopwords (like "am", "in", "the")

stop_words = set([Link]('english'))
# This line loads a list of common English words (like "am", "is",
"the") from NLTK
# and stores them inside a set called stop_words.
# A "set" is used because it's faster to check membership.

filtered_tokens = [word for word in tokens if [Link]() not in

stop_words]
# This line creates a new list.
# It goes through each word in tokens one by one,
# converts the word into lowercase ([Link]()),
# and keeps it ONLY if it is NOT in stop_words.

print("After Removing Stopwords:")

# Prints a heading so the output looks clear.

print(filtered_tokens)
# Prints the list of words after removing stopwords.
After Removing Stopwords:
['learning', 'Natural', 'Language', 'Processing', 'class', '.']

Stemming (cutting words to their root form)

In simple words: Stemming is like chopping words down to their base. It may not always be a
real word but gives the core root.

👉 Example: Input → ['learning', 'NLP', 'classes'] Output → ['learn', 'nlp', 'class']

# Step 5: Stemming (cutting words to their root form)

from [Link] import PorterStemmer
stemmer = PorterStemmer()
# Creates a "stemmer" object using the Porter algorithm.

stemmed_tokens = [[Link](word) for word in filtered_tokens]

# For each word in filtered_tokens,
# apply [Link](word) which cuts the word down to its root form.
# Example: "learning" → "learn", "classes" → "class"

print("After Stemming:")
print(stemmed_tokens)
# Prints the list of words after stemming.

After Stemming:
['learn', 'natur', 'languag', 'process', 'class', '.']

# Lemmatization (finding correct base word using grammar rules)

In simple words: Lemmatization is smarter than stemming because it uses grammar + dictionary
to find the correct base word.

👉 Example: Input → ['better', 'running', 'classes'] Output → ['good', 'run', 'class']

# Step 6: Lemmatization (finding correct base word using grammar

rules)
from [Link] import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
# Creates a "lemmatizer" object that uses the WordNet dictionary.

lemmatized_tokens = [[Link](word) for word in

filtered_tokens]
# For each word in filtered_tokens,
# apply [Link](word) which returns the meaningful base
form.
# Example: "better" → "good", "running" → "run"

print("After Lemmatization:")
print(lemmatized_tokens)
# Prints the list of words after lemmatization.

After Lemmatization:
['learning', 'Natural', 'Language', 'Processing', 'class', '.']

In simple words:
word_tokenize(text) → cuts the sentence into small pieces (words/punctuation).

tokens = ... → saves those words into a variable called tokens.

print("After Tokenization:") → shows a label so students know what’s coming.

print(tokens) → shows the actual list of words.

Installing Other Necessary Packages

📊 Core Data Handling & Analysis

NumPy – For numerical computing, arrays, and mathematical operations.

Pandas – For data manipulation, cleaning, and analysis (DataFrames).

📈 Data Visualization

Matplotlib – For basic 2D plotting and charts.

Seaborn – Built on Matplotlib, makes statistical graphics more attractive.

Plotly – For interactive, dynamic, and web-based visualizations.

🤖 Machine Learning

Scikit-learn – For machine learning models (classification, regression, clustering, etc.).

XGBoost – For gradient boosting and high-performance ML models.

🧠 Deep Learning

TensorFlow – Popular deep learning framework from Google.

PyTorch – Deep learning framework widely used in research and industry.

🌐 NLP & Data Preprocessing

NLTK / SpaCy – For natural language processing (tokenization, text cleaning, etc.).

# 📊 Core Data Handling

import numpy as np
import pandas as pd

# 📈 Visualization
import [Link] as plt
import seaborn as sns
import [Link] as px # For interactive plots

# 🤖 Machine Learning
import sklearn
import xgboost as xgb

# 🧠 Deep Learning
import tensorflow as tf

# 🌐 NLP & Text Processing

import nltk

Chunking: Dividing Data into Chunks

Example 1: Chunking Text Data

Sometimes in NLP (Natural Language Processing), you break text into chunks of words for
processing.

text = "Data Science is an exciting field to learn"

# Split into words

words = [Link]()
words

['Data', 'Science', 'is', 'an', 'exciting', 'field', 'to', 'learn']

text = "Data Science is an exciting field to learn"

# Split into words

words = [Link]()

# Chunk into size 2 (pairs of words)

def chunk_words(words, size):
for i in range(0, len(words), size):
yield words[i:i+size]

chunks = list(chunk_words(words, 2))

print("Words:", words)
print("Chunks:", chunks)

Words: ['Data', 'Science', 'is', 'an', 'exciting', 'field', 'to',

'learn']
Chunks: [['Data', 'Science'], ['is', 'an'], ['exciting', 'field'],
['to', 'learn']]
Meaning of each word that used in above
example
text = "Data Science is an exciting field to learn"

words = [Link]()

def chunk_words(words, size):

for i in range(0, len(words), size):

yield words[i:i+size]

chunks = list(chunk_words(words, 2))

print("Words:", words)
print("Chunks:", chunks)

1. Bag of Words (BoW) Model

👉 A Bag of Words model is a simple way of converting text into numbers so that a computer can
understand it.

It ignores grammar and order of words, only keeps word counts.

Think of it like making a shopping list of words.

Example:
Sentence 1: "I love Data Science" Sentence 2: "Data Science is fun"

BoW Vocabulary = {I, love, Data, Science, is, fun}

Now represent each sentence as numbers (counts):

S1 → [1,1,1,1,0,0]

S2 → [0,0,1,1,1,1]

📌 In short: BoW = text → numbers, based on how many times each word appears.

Step 1: Collect all unique words (Vocabulary)

We take all words from both sentences and make a list of unique words.

Sentence 1: "I love Data Science" Sentence 2: "Data Science is fun"

Unique words = {I, love, Data, Science, is, fun} 👉 This is called the vocabulary.

Step 2: Count how many times each word appears in a sentence

We create a table with all vocabulary words as columns:

Sentence I love Data Science is fun S1: "I love Data Science" 1 1 1 1 0 0 S2: "Data Science is fun" 0
0 1 1 1 1 Step 3: Represent each sentence as a vector (row of numbers)

S1 → [1, 1, 1, 1, 0, 0] (because S1 has 1 "I", 1 "love", 1 "Data", 1 "Science", 0 "is", 0 "fun")

S2 → [0, 0, 1, 1, 1, 1] (because S2 has 0 "I", 0 "love", 1 "Data", 1 "Science", 1 "is", 1 "fun")

📌 Key idea:

Each sentence becomes a list of numbers (vector).

The position of each number corresponds to a word in the vocabulary.

The value tells how many times that word appears.

# Sentences
sentences = ["I love Data Science", "Data Science is fun"]

# Define vocabulary (fixed order)

vocab = ["I", "love", "Data", "Science", "is", "fun"]

# Function to create BoW vector

def create_bow(sentence, vocab):
words = [Link]() # split sentence into words
return [[Link](word) for word in vocab]

# Generate BoW for each sentence

s1 = create_bow(sentences[0], vocab)
s2 = create_bow(sentences[1], vocab)
print("S1 →", s1)
print("S2 →", s2)

S1 → [1, 1, 1, 1, 0, 0]
S2 → [0, 0, 1, 1, 1, 1]
print("S1 →", s1)

print("S2 →", s2)

Prints the final Bag of Words vectors.

NLP Basics: Text Processing Steps
No ratings yet
NLP Basics: Text Processing Steps
7 pages
NLP Text Processing Basics
No ratings yet
NLP Text Processing Basics
4 pages
Understanding Natural Language Processing
No ratings yet
Understanding Natural Language Processing
25 pages
NLP Tokenization, Stemming, Lemmatization Guide
No ratings yet
NLP Tokenization, Stemming, Lemmatization Guide
29 pages
Text Analytics Assignment Overview
No ratings yet
Text Analytics Assignment Overview
15 pages
NLP Practical Record for M.Sc. Students
No ratings yet
NLP Practical Record for M.Sc. Students
33 pages
Python NLP Techniques and Examples
No ratings yet
Python NLP Techniques and Examples
7 pages
Antonyms in NLTK WordNet Usage
No ratings yet
Antonyms in NLTK WordNet Usage
42 pages
Introduction to Part of Speech Tagging
No ratings yet
Introduction to Part of Speech Tagging
45 pages
NLP Lab Manual for Tokenization and Stemming
No ratings yet
NLP Lab Manual for Tokenization and Stemming
45 pages
NLP Practical Journal for M.Sc. IT
No ratings yet
NLP Practical Journal for M.Sc. IT
22 pages
NLP Techniques: Tokenization, Stemming, Encoding
No ratings yet
NLP Techniques: Tokenization, Stemming, Encoding
12 pages
Python Libraries for ML and NLP
No ratings yet
Python Libraries for ML and NLP
37 pages
NLP Final Exam Overview and Techniques
No ratings yet
NLP Final Exam Overview and Techniques
32 pages
NLP Tokenization and Feature Extraction
No ratings yet
NLP Tokenization and Feature Extraction
18 pages
NLP Techniques: Word Clouds & Language Models
No ratings yet
NLP Techniques: Word Clouds & Language Models
69 pages
NLP Techniques for Text Preprocessing
No ratings yet
NLP Techniques for Text Preprocessing
55 pages
NLP Lab Manual for Python Programming
No ratings yet
NLP Lab Manual for Python Programming
41 pages
Stemming and Lemmatization in NLP
No ratings yet
Stemming and Lemmatization in NLP
6 pages
NLTK and spaCy Text Processing Techniques
No ratings yet
NLTK and spaCy Text Processing Techniques
19 pages
NLP Techniques: Tokenization & Stemming
No ratings yet
NLP Techniques: Tokenization & Stemming
8 pages
NLP Essentials with Gensim Examples
No ratings yet
NLP Essentials with Gensim Examples
3 pages
VTU NLP Lab Manual Programs
No ratings yet
VTU NLP Lab Manual Programs
15 pages
NLP Lab Manual for B.Tech Students
No ratings yet
NLP Lab Manual for B.Tech Students
21 pages
Virtual Lab in NLP Practices
No ratings yet
Virtual Lab in NLP Practices
25 pages
NLP Techniques: Stemming and Lemmatization
No ratings yet
NLP Techniques: Stemming and Lemmatization
15 pages
Introduction to Natural Language Processing
No ratings yet
Introduction to Natural Language Processing
39 pages
Python Libraries for Data Analysis & NLP
No ratings yet
Python Libraries for Data Analysis & NLP
4 pages
Data Science Python Codebook Guide
No ratings yet
Data Science Python Codebook Guide
11 pages
Python NLP Techniques and Examples
No ratings yet
Python NLP Techniques and Examples
15 pages
NLP Lab Manual for CSE Students
No ratings yet
NLP Lab Manual for CSE Students
65 pages
NLP Tokenization and Text Processing Techniques
No ratings yet
NLP Tokenization and Text Processing Techniques
12 pages
NLP Applications and Text Preprocessing
No ratings yet
NLP Applications and Text Preprocessing
56 pages
NLTK Text Processing and Analysis
No ratings yet
NLTK Text Processing and Analysis
17 pages
NLP Lecture: Sentiment Analysis & Text Processing
No ratings yet
NLP Lecture: Sentiment Analysis & Text Processing
47 pages
Remove Stop Words and Punctuation
No ratings yet
Remove Stop Words and Punctuation
7 pages
Python Text Processing Techniques
No ratings yet
Python Text Processing Techniques
15 pages
Python NLP Tokenization Techniques
No ratings yet
Python NLP Tokenization Techniques
34 pages
NLP Practical Lab Manual for CSE Students
No ratings yet
NLP Practical Lab Manual for CSE Students
32 pages
Text Preprocessing in NLP with Python
No ratings yet
Text Preprocessing in NLP with Python
6 pages
NLP Tokenization and Techniques Guide
No ratings yet
NLP Tokenization and Techniques Guide
3 pages
DSBD 7 Ass
No ratings yet
DSBD 7 Ass
9 pages
Text Preprocessing Techniques in NLP
No ratings yet
Text Preprocessing Techniques in NLP
28 pages
Text Analytics Assignment: TF-IDF Methods
No ratings yet
Text Analytics Assignment: TF-IDF Methods
14 pages
Understanding NLP: Concepts & Applications
No ratings yet
Understanding NLP: Concepts & Applications
71 pages
Python Text Processing and WSD Implementation
No ratings yet
Python Text Processing and WSD Implementation
13 pages
NLTK Tokenization and Stop Word Removal
No ratings yet
NLTK Tokenization and Stop Word Removal
17 pages
NLP Python Functions for Text Analysis
No ratings yet
NLP Python Functions for Text Analysis
6 pages
Text Preprocessing for NLP Analysis
No ratings yet
Text Preprocessing for NLP Analysis
45 pages
NLTK and SpaCy Text Processing Guide
No ratings yet
NLTK and SpaCy Text Processing Guide
10 pages
Python NLP
No ratings yet
Python NLP
15 pages
Web & Social Media Analytics Lab Guide
No ratings yet
Web & Social Media Analytics Lab Guide
58 pages
Language Modeling with NLTK in Python
No ratings yet
Language Modeling with NLTK in Python
13 pages
NLTK Functions Overview
No ratings yet
NLTK Functions Overview
48 pages
How to Install and Use NLTK in Python
No ratings yet
How to Install and Use NLTK in Python
15 pages
NLP Pre-processing Techniques Overview
No ratings yet
NLP Pre-processing Techniques Overview
81 pages
Essential Steps in Text Processing
No ratings yet
Essential Steps in Text Processing
5 pages
Document Preprocessing in Text Analytics
No ratings yet
Document Preprocessing in Text Analytics
4 pages
Tokenization and Stemming with NLTK
No ratings yet
Tokenization and Stemming with NLTK
77 pages
Certified Data Scientist Profile Summary
No ratings yet
Certified Data Scientist Profile Summary
1 page
Flight Fare Prediction Project Report
No ratings yet
Flight Fare Prediction Project Report
25 pages
Certified Data Scientist Program Overview
No ratings yet
Certified Data Scientist Program Overview
18 pages
Python and NLP Test Answer Sheet
No ratings yet
Python and NLP Test Answer Sheet
1 page
Crystallography and Heat Treatment Guide
No ratings yet
Crystallography and Heat Treatment Guide
5 pages
Data Science & AI Training Institute
No ratings yet
Data Science & AI Training Institute
1 page
Certified Data Scientist Profile Summary
No ratings yet
Certified Data Scientist Profile Summary
1 page
Forward and Backward Filling Data
No ratings yet
Forward and Backward Filling Data
9 pages
Datamites AIEng Brochure
No ratings yet
Datamites AIEng Brochure
22 pages
Demmisto Technologies AI/ML Instructor Role
No ratings yet
Demmisto Technologies AI/ML Instructor Role
1 page
Python Basic Practice Exercises
No ratings yet
Python Basic Practice Exercises
1 page
Mechanical Engineering Graduate Profile
No ratings yet
Mechanical Engineering Graduate Profile
1 page
Data Scientist Resume: Skills & Projects
No ratings yet
Data Scientist Resume: Skills & Projects
2 pages
Mechanical Engineer Profile: MD Jahid
No ratings yet
Mechanical Engineer Profile: MD Jahid
2 pages
Python for AI: Basics & Setup Guide
No ratings yet
Python for AI: Basics & Setup Guide
2 pages
Mechanical Engineer Resume - MD Jahid
No ratings yet
Mechanical Engineer Resume - MD Jahid
2 pages
Evolution of Computer Generations
No ratings yet
Evolution of Computer Generations
1 page
B.Tech Grade Sheet - Sujay Kumar
No ratings yet
B.Tech Grade Sheet - Sujay Kumar
1 page
NLTK for Natural Language Processing
No ratings yet
NLTK for Natural Language Processing
17 pages
Data Science & ML Internship Resume
No ratings yet
Data Science & ML Internship Resume
1 page
Python Basics: Keywords, Variables, and Data Types
No ratings yet
Python Basics: Keywords, Variables, and Data Types
13 pages
Alpha Builders Kuwait Offer Letter
No ratings yet
Alpha Builders Kuwait Offer Letter
5 pages
Python Basics for AI Beginners
No ratings yet
Python Basics for AI Beginners
1 page
AI Agents Presentation Overview 2025
No ratings yet
AI Agents Presentation Overview 2025
12 pages
3-Month Data Analyst Study Plan
No ratings yet
3-Month Data Analyst Study Plan
5 pages
Spotfire Skills for Data Analysts
No ratings yet
Spotfire Skills for Data Analysts
2 pages
NLP Questions and Answers Guide
No ratings yet
NLP Questions and Answers Guide
3 pages
NLP Concepts and Python Exercises
No ratings yet
NLP Concepts and Python Exercises
1 page
Embracing the Digital Era in 2023
No ratings yet
Embracing the Digital Era in 2023
2 pages
2012-13 Ford Focus Owners Manual
No ratings yet
2012-13 Ford Focus Owners Manual
1 page
Siemens SONOLINE Adara Ultrasound System
No ratings yet
Siemens SONOLINE Adara Ultrasound System
4 pages
Assembly Language Addressing Modes Quiz
No ratings yet
Assembly Language Addressing Modes Quiz
8 pages
Course Syllabus: Become A Superlearner
No ratings yet
Course Syllabus: Become A Superlearner
17 pages
FSL USB MSD Bootloader PDF
No ratings yet
FSL USB MSD Bootloader PDF
17 pages
Taiwan's Role in Global Semiconductor Supply
No ratings yet
Taiwan's Role in Global Semiconductor Supply
27 pages
Estimating Staffing Levels for Software Projects
No ratings yet
Estimating Staffing Levels for Software Projects
3 pages
Transmission Line Characteristics Manual
No ratings yet
Transmission Line Characteristics Manual
64 pages
Playwright QA with TypeScript Guide
No ratings yet
Playwright QA with TypeScript Guide
3 pages
InTunes Classic IM 26.07.2022
No ratings yet
InTunes Classic IM 26.07.2022
2 pages
Emotion and Sentiment in Music Lyrics Analysis
No ratings yet
Emotion and Sentiment in Music Lyrics Analysis
6 pages
Nigerian Domain Registration Guide
No ratings yet
Nigerian Domain Registration Guide
11 pages
IREDA Assistant Manager Exam Call Letter
No ratings yet
IREDA Assistant Manager Exam Call Letter
4 pages
ADW1009 Testing DBM01 in Android IPHONE IPAD
No ratings yet
ADW1009 Testing DBM01 in Android IPHONE IPAD
20 pages
Understanding Floating Point Numbers
No ratings yet
Understanding Floating Point Numbers
23 pages
Web Development Basics: HTML & CSS Guide
No ratings yet
Web Development Basics: HTML & CSS Guide
58 pages
Ancient Kannada Script Recognition
No ratings yet
Ancient Kannada Script Recognition
4 pages
Solution Document GSMR - July15 PDF
No ratings yet
Solution Document GSMR - July15 PDF
79 pages
Divide and Conquer Control Abstraction
No ratings yet
Divide and Conquer Control Abstraction
7 pages
Social Capital in Project-Based Organizations - Its Role, Structure, and Impact - 111640
No ratings yet
Social Capital in Project-Based Organizations - Its Role, Structure, and Impact - 111640
10 pages
Octal 2:1 Multiplexer Bus Switch
No ratings yet
Octal 2:1 Multiplexer Bus Switch
5 pages
How to Access Your Facebook Activity Log
No ratings yet
How to Access Your Facebook Activity Log
1 page
AI and FinTech for ESG Data Solutions
No ratings yet
AI and FinTech for ESG Data Solutions
5 pages
M.Tech Thesis Topics for Computer Science
No ratings yet
M.Tech Thesis Topics for Computer Science
3 pages
Tree Data Structures and Algorithms Guide
No ratings yet
Tree Data Structures and Algorithms Guide
15 pages
Relational Algebra Queries for Database
No ratings yet
Relational Algebra Queries for Database
4 pages
SpiceJet Online Check-In Guide
No ratings yet
SpiceJet Online Check-In Guide
268 pages
Computer Fundamentals: Chapter 1 MCQs
83% (6)
Computer Fundamentals: Chapter 1 MCQs
162 pages
Android Developer Resume of Md Shariful
No ratings yet
Android Developer Resume of Md Shariful
2 pages

NLP Basics: Tokenization, Stemming, Lemmatization

Uploaded by

NLP Basics: Tokenization, Stemming, Lemmatization

Uploaded by

# 🔹 Lab 1: Introduction to NLP

# 🎯 Objective: Learn basic text processing

# Step 1: Import libraries

# Download required resources (only first time)

# Step 2: Example text

[nltk_data] Downloading package punkt to C:\Users\Sujay

# Step 2: Example text

Tokenization (Breaking text into words)

word_tokenize(text) → cuts the sentence into small pieces (words/punctuation).

tokens = ... → saves those words into a variable called tokens.

print(tokens) → shows the actual list of words.

# Step 3: Tokenization (Breaking text into words)

Removing Stopwords (like "am", "in", "the")

# Step 4: Removing Stopwords (like "am", "in", "the")

filtered_tokens = [word for word in tokens if [Link]() not in

print("After Removing Stopwords:")

Stemming (cutting words to their root form)

👉 Example: Input → ['learning', 'NLP', 'classes'] Output → ['learn', 'nlp', 'class']

# Step 5: Stemming (cutting words to their root form)

stemmed_tokens = [[Link](word) for word in filtered_tokens]

# Lemmatization (finding correct base word using grammar rules)

👉 Example: Input → ['better', 'running', 'classes'] Output → ['good', 'run', 'class']

# Step 6: Lemmatization (finding correct base word using grammar

lemmatized_tokens = [[Link](word) for word in

tokens = ... → saves those words into a variable called tokens.

print("After Tokenization:") → shows a label so students know what’s coming.

print(tokens) → shows the actual list of words.

Installing Other Necessary Packages

NumPy – For numerical computing, arrays, and mathematical operations.

Pandas – For data manipulation, cleaning, and analysis (DataFrames).

Matplotlib – For basic 2D plotting and charts.

Seaborn – Built on Matplotlib, makes statistical graphics more attractive.

Plotly – For interactive, dynamic, and web-based visualizations.

Scikit-learn – For machine learning models (classification, regression, clustering, etc.).

XGBoost – For gradient boosting and high-performance ML models.

TensorFlow – Popular deep learning framework from Google.

PyTorch – Deep learning framework widely used in research and industry.

🌐 NLP & Data Preprocessing

# 📊 Core Data Handling

# 🌐 NLP & Text Processing

Chunking: Dividing Data into Chunks

text = "Data Science is an exciting field to learn"

# Split into words

['Data', 'Science', 'is', 'an', 'exciting', 'field', 'to', 'learn']

text = "Data Science is an exciting field to learn"

# Split into words

# Chunk into size 2 (pairs of words)

chunks = list(chunk_words(words, 2))

Words: ['Data', 'Science', 'is', 'an', 'exciting', 'field', 'to',

def chunk_words(words, size):

for i in range(0, len(words), size):

chunks = list(chunk_words(words, 2))

1. Bag of Words (BoW) Model

It ignores grammar and order of words, only keeps word counts.

Think of it like making a shopping list of words.

BoW Vocabulary = {I, love, Data, Science, is, fun}

Now represent each sentence as numbers (counts):

Step 1: Collect all unique words (Vocabulary)

Sentence 1: "I love Data Science" Sentence 2: "Data Science is fun"

Step 2: Count how many times each word appears in a sentence

We create a table with all vocabulary words as columns:

S1 → [1, 1, 1, 1, 0, 0] (because S1 has 1 "I", 1 "love", 1 "Data", 1 "Science", 0 "is", 0 "fun")

S2 → [0, 0, 1, 1, 1, 1] (because S2 has 0 "I", 0 "love", 1 "Data", 1 "Science", 1 "is", 1 "fun")

Each sentence becomes a list of numbers (vector).

The position of each number corresponds to a word in the vocabulary.

The value tells how many times that word appears.

# Define vocabulary (fixed order)

# Function to create BoW vector

# Generate BoW for each sentence

print("S2 →", s2)

Prints the final Bag of Words vectors.

You might also like