0% found this document useful (0 votes)

144 views81 pages

NLP Pre-processing Techniques Overview

The document provides a comprehensive overview of Natural Language Pre-processing (NLP), detailing essential steps to prepare raw text data for analysis and machine learning. Key pre-processing techniques include lowercasing, tokenization, removing punctuation, stemming, lemmatization, and vectorization, among others. Each technique is explained with its purpose, importance, and examples of implementation.

Uploaded by

Umaid Ali Keerio

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

144 views81 pages

NLP Pre-processing Techniques Overview

Uploaded by

Umaid Ali Keerio

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Natural Language Pre-processing

Prepared by: Syed Afroz Ali

Learn NLP Step by Step: [Link]
Follow me on LinkedIn for more Content: [Link]
Natural Language Pre-processing (NLP)
Natural Language Processing (NLP) involves a series of
pre-processing steps to transform raw text data into a
format suitable for analysis or machine learning
models. These steps help improve the quality of the
data and make it easier for algorithms to understand
and process the text. Below are the key pre-
processing steps used in NLP.

Common Pre-processing step commonly used before

feeding data into an NLP model
Common NLP Pre-processing step
01. Lowercasing
18. Removing HTML Tags
02. Tokenization
19. Handling URLs
03. Removing Punctuation
20. Handling Mentions and Hashtags
04. Removing Stopwords
21. Sentence Segmentation
05. Stemming
22. Handling Abbreviations
06. Lemmatization
23. Language Detection
07. Removing Numbers
24. Text Encoding
08. Removing Extra Spaces
25. Handling Whitespace Tokens
09. Handling Contractions
26. Handling Dates and Times
10. Removing Special Characters
27. Text Augmentation
11. Part-of-Speech (POS) Tagging
28. Handling Negations
12. Named Entity Recognition (NER)
29. Dependency Parsing
13. Vectorization
30. Handling Rare Words
14. Handling Missing Data
31. Text Chunking
15. Normalization
32. Handling Synonyms
16. Spelling Correction
33. Text Normalization for Social Media
17. Handling Emojis and Emoticons
1. Lowercasing
Purpose: Converts all text to lowercase to ensure
uniformity.
Why: Reduces the vocabulary size and avoids treating
the same word in different cases as different tokens
(e.g., "Apple" vs. "apple").
1. Lowercasing

text = "Hello World! This is NLP."

text = [Link]()
print(text)

Prepared by: Syed Afroz Ali

Learn NLP Step by Step: [Link]
Follow me on LinkedIn for more Content: [Link]
2. Tokenization
Purpose: Splits text into individual words, phrases,
or sentences (tokens).
Why: Breaks down text into manageable units for
further processing.
2. Tokenization
import nltk
[Link]('punkt_tab')
from [Link] import word_tokenize

text = "Hello World! This is NLP."

tokens = word_tokenize(text)
print(tokens)
3. Removing Punctuation
Purpose: Removes punctuation marks like commas,
periods, exclamation marks, etc.
Why: Punctuation often doesn’t contribute to the
meaning in many NLP tasks and can add noise.
3. Removing Punctuation
import string

text = "Hello, World! This is NLP."

text = [Link]([Link]('', '',
[Link]))
print(text)

Prepared by: Syed Afroz Ali

Learn NLP Step by Step: [Link]
Follow me on LinkedIn for more Content: [Link]
4. Removing Stopwords
Purpose: Removes common words like "the," "is,"
"and," which don’t carry significant meaning.
Why: Reduces noise and focuses on meaningful
words.
4. Removing Stopwords
import nltk
[Link]('stopwords') # Download the 'stopwords' dataset
from [Link] import stopwords

stop_words = set([Link]('english'))

tokens = ["this", "is", "a", "sample", "sentence"]

filtered_tokens = [word for word in tokens if [Link]()
not in stop_words]

print(filtered_tokens)
5. Stemming
Purpose: Reduces words to their root form by
chopping off suffixes (e.g., "running" → "run").
Why: Simplifies words to their base form, reducing
vocabulary size.
5. Stemming
from [Link] import PorterStemmer

stemmer = PorterStemmer()
words = ["running", "runner", "ran"]
stemmed_words = [[Link](word) for
word in words]

print(stemmed_words)
6. Lemmatization
Purpose: Converts words to their base or dictionary
form (e.g., "better" → "good").
Why: More accurate than stemming as it uses
vocabulary and morphological analysis.
6. Lemmatization
import nltk
[Link]('wordnet')
from [Link] import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
words = ["running", "runner", "ran", "flies", "flying"]
lemmatized_words = [[Link](word,
pos='v') for word in words]
print(lemmatized_words)
7. Removing Numbers
Purpose: Removes numeric values from the text.
Why: Numbers may not be relevant in certain NLP
tasks like sentiment analysis.
7. Removing Numbers
import re
text = "There are 3 apples and 5
oranges."
text = [Link](r'\d+', '', text)
print(text)
8. Removing Extra Spaces
Purpose: Eliminates multiple spaces, tabs, or newlines.
Why: Ensures clean and consistent text formatting.

Prepared by: Syed Afroz Ali

Learn NLP Step by Step: [Link]
Follow me on LinkedIn for more Content: [Link]
8. Removing Extra Spaces
text = " This is a sentence. "
text = ' '.join([Link]())
print(text)

Prepared by: Syed Afroz Ali

Learn NLP Step by Step: [Link]
Follow me on LinkedIn for more Content: [Link]
9. Handling Contractions
Purpose: Expands contractions (e.g., "can't" →
"cannot").
Why: Standardizes text for better processing.
9. Handling Contractions
!pip install contractions
from contractions import fix
text = "I can't do this."
text = fix(text)
print(text)
10. Removing Special Characters
Purpose: Removes non-alphanumeric characters like
@, #, $, etc.
Why: Reduces noise and irrelevant symbols.
10. Removing Special Characters
import re
text = "This is a #sample text with
@special characters!"
text = [Link](r'[^\w\s]', '', text)
print(text)
11. Part-of-Speech (POS) Tagging
Purpose: Assigns grammatical tags to words (e.g.,
noun, verb, adjective).
Why: Helps in understanding the syntactic structure
of sentences.
11. Part-of-Speech (POS) Tagging
Example:
Let's take the sentence: "The cat sat on the mat."
A POS tagger would label each word as follows:
The/DT (Determiner)
cat/NN (Noun)
sat/VBD (Verb, past tense)
on/IN (Preposition)
the/DT (Determiner)
mat/NN (Noun)

Note: Part-of-Speech tagging is a crucial technique in NLP that assigns

grammatical labels to words, providing valuable information for various language
processing tasks.
11. Part-of-Speech (POS) Tagging
import nltk
from nltk import pos_tag
from [Link] import word_tokenize # Download
the required resource

[Link]('averaged_perceptron_tagger_eng')
tokens = word_tokenize("This is a sample sentence.")
pos_tags = pos_tag(tokens)
print(pos_tags)
12. Named Entity Recognition (NER)
Purpose: Identifies and classifies entities like names,
dates, locations, etc.
Why: Useful for tasks like information extraction.
12. Named Entity Recognition (NER)
"Elon Musk, CEO of Tesla, announced on Twitter that he will
visit Berlin, Germany on July 15th, 2024 to discuss the new
Gigafactory.“
An NER system would identify:
Elon Musk - PERSON
Tesla - ORGANIZATION
Twitter - ORGANIZATION
Berlin, Germany - LOCATION
July 15th, 2024 - DATE
Gigafactory - ORGANIZATION
12. Named Entity Recognition (NER)
import nltk
from nltk import pos_tag, ne_chunk
from [Link] import word_tokenize # Download the required resources

[Link]('words')
[Link]('maxent_ne_chunker')
[Link]('averaged_perceptron_tagger') # Download the 'maxent_ne_chunker_tab' resource

[Link]('maxent_ne_chunker_tab') # This line is crucial to fix the error.

tokens = word_tokenize("John works at Google in New York.")

pos_tags = pos_tag(tokens)
ner_tags = ne_chunk(pos_tags)
print(ner_tags)
(S (PERSON John/NNP) works/VBZ at/IN (ORGANIZATION Google/NNP)
in/IN (GPE New/NNP York/NNP) ./.)
13. Vectorization
Purpose: Converts text into numerical vectors (e.g.,
Bag of Words, TF-IDF, Word Embeddings).
Why: Machine learning models require numerical
input.
Vectorization in NLP
In Natural Language Processing (NLP),
vectorization is the process of converting text
data into numerical representations (vectors) that
machine learning models can understand and
process. Since computers primarily work with
numbers, text data needs to be transformed into
a numerical format before it can be used for
tasks like sentiment analysis, text classification,
or machine translation.
Why is Vectorization Important?
Machine Learning Compatibility: Most machine learning
algorithms require numerical input. Text data, in its raw
form, is not directly compatible.

Feature Extraction: Vectorization helps extract

meaningful features from text, capturing semantic
relationships and patterns.

Dimensionality Reduction: It can help reduce the

dimensionality of text data, making it more manageable
for computation and analysis.
1. Bag of Words (BoW):
Concept: Represents a document as an unordered set of
words, disregarding grammar and word order but keeping
track of word frequencies.
Process:
Vocabulary Creation: A vocabulary of all unique words across all
documents is created.
Vector Representation: Each document is represented as a
vector where each element corresponds to a word in the
vocabulary. The value of each element indicates the frequency
of that word in the document.
Example:
Documents:
Document 1: "The cat sat on the mat."
Document 2: "The dog chased the cat."
Vocabulary: ["the", "cat", "sat", "on", "mat", "dog", "chased"]
Vectors:
Document 1: [2, 1, 1, 1, 1, 0, 0]
Document 2: [1, 1, 0, 0, 0, 1, 1]
Limitations:
Ignores word order and context.
Doesn't capture semantic meaning.
High dimensionality for large vocabularies.
2. Term Frequency-Inverse Document Frequency (TF-IDF):

Concept: Improves upon BoW by weighing words based

on their importance in a document relative to the entire
corpus.
Process:
Term Frequency (TF): Measures how frequently a term
appears in a document.
Inverse Document Frequency (IDF): Measures how rare or
common a term is across the entire corpus. Rare terms have
higher IDF values.
TF-IDF Score: Calculated as TF * IDF.
Example: (Using the same documents as above)
Let's say "the" appears in many documents,
while "chased" appears in only a few. TF-IDF
would assign a lower weight to "the" and a
higher weight to "chased" in Document 2.
Advantages:
Gives more importance to relevant terms.
Reduces the impact of common words.
4. Document Embeddings (Doc2Vec, Sentence-BERT):

Concept: Extends word embeddings to represent entire

documents or sentences as vectors.
Process: Similar to word embeddings but trained to learn
representations for larger chunks of text.
Example:
Doc2Vec can create vectors for entire paragraphs, capturing the
overall meaning and context.
Advantages:
Captures document-level semantics.
Useful for tasks like document similarity and clustering.
Choosing the Right Vectorization Technique:
The best vectorization technique depends on the specific NLP task
and dataset.
Simple Tasks: BoW or TF-IDF might be sufficient for tasks like
spam detection or basic text classification.
Complex Tasks: Word embeddings and document embeddings
are generally preferred for tasks involving semantic
understanding, such as sentiment analysis, question answering,
and machine translation.
In summary, vectorization is a fundamental step in NLP that
allows us to convert text data into a numerical format suitable
for machine learning models, enabling computers to understand
and process human language.
14. Handling Missing Data
Purpose: Fills or removes missing or incomplete text
data.
Why: Ensures the dataset is complete and consistent.
14. Handling Missing Data
import pandas as pd
data = {"text": ["Hello", None, "World"]}
df = [Link](data)
df["text"].fillna("My Dear", inplace=True) # Fill
missing values
print(df)
15. Normalization
Purpose: Standardizes text (e.g., converting all
dates to a single format).
Why: Ensures consistency in the dataset.

Prepared by: Syed Afroz Ali

Learn NLP Step by Step: [Link]
Follow me on LinkedIn for more Content: [Link]
15. Normalization
import unicodedata
text = "Café, résumé, naïve"
text = [Link]('NFKD',
text).encode('ascii', 'ignore').decode('utf-8')
print(text)
16. Spelling Correction

Purpose: Corrects spelling errors in the text.

Why: Improves the quality of the text for analysis.
16. Spelling Correction
from textblob import TextBlob
text = "I made a many mistakes in
Artificial intellengence"
blob = TextBlob(text)
corrected_text = [Link]()
print(corrected_text)
17. Handling Emojis and Emoticons
Purpose: Converts emojis and emoticons into text or
removes them.
Why: Emojis can carry sentiment or meaning that
needs to be captured.
17. Handling Emojis and Emoticons
!pip install emoji
import emoji text = "I love Python! 😆 " #
Convert emojis to text
text = [Link](text)
print(text) # Output: "I love Python! :smiling_face_with_smiling_eyes:"
# Remove emojis
text = emoji.replace_emoji(text, replace="")
print(text)

I love Python! smiling_face_with_smiling_eyes

18. Removing HTML Tags
Purpose: Removes HTML tags from web scraped
text.
Why: HTML tags are irrelevant for most NLP tasks.
18. Removing HTML Tags
from bs4 import BeautifulSoup
text = "<p>This is a <b>sample</b>
text.</p>"
soup = BeautifulSoup(text, "[Link]")
clean_text = soup.get_text()
print(clean_text)
19. Handling URLs
Purpose: Removes or replaces URLs in the text.
Why: URLs are often irrelevant for text analysis.
19. Handling URLs
import re
text = "Visit my website at
[Link]
text =
[Link](r'http\S+|www\S+|https\S+', '',
text, flags=[Link])
print(text)
20. Handling Mentions and Hashtags
Purpose: Processes or removes social media mentions
(@user) and hashtags (#topic).
Why: Useful for social media text analysis.

Prepared by: Syed Afroz Ali

Learn NLP Step by Step: [Link]
Follow me on LinkedIn for more Content: [Link]
20. Handling Mentions and Hashtags
import re
text = "Hey @user, check out #NLP!"
text = [Link](r'@\w+|#\w+', '', text)
print(text)
21. Sentence Segmentation
Purpose: Splits text into individual sentences.
Why: Important for tasks like machine translation or
summarization.
21. Sentence Segmentation
import nltk
from [Link] import sent_tokenize

# Download the 'punkt_tab' dataset before using sent_tokenize

[Link]('punkt_tab') # This line is crucial to fix the error.

text = "This is the first sentence. This is the second

sentence."
sentences = sent_tokenize(text)
print(sentences)
22. Handling Abbreviations
Purpose: Expands abbreviations (e.g., "ASAP" → "as
soon as possible").
Why: Ensures clarity and consistency.
22. Handling Abbreviations
!pip install contractions
import contractions
text = "I'll be there ASAP."
expanded_text = [Link](text)
print(expanded_text)
23. Language Detection
Purpose: Identifies the language of the text.
Why: Ensures the correct NLP model is applied.

Prepared by: Syed Afroz Ali

Learn NLP Step by Step: [Link]
Follow me on LinkedIn for more Content: [Link]
23. Language Detection
!pip install langdetect
from langdetect import detect
text = "Ceci est un texte en
français."
language = detect(text)
print(language)
24. Text Encoding
Purpose: Converts text into a specific encoding
format (e.g., UTF-8).
Why: Ensures compatibility with NLP tools and
models.
24. Text Encoding
text = "Café"
text = [Link]('utf-
8').decode('utf-8')
print(text)
25. Handling Whitespace Tokens
Purpose: Removes or processes tokens that are just
spaces or empty strings.
Why: Ensures clean and meaningful tokens.
25. Handling Whitespace Tokens
tokens = ["This", " ", "is", " ", "a", "
", "sample", " "]
tokens = [token for token in tokens
if [Link]()]
print(tokens)
26. Handling Dates and Times
Purpose: Standardizes or extracts date and time
formats.
Why: Useful for time-sensitive analysis.
26. Handling Dates and Times
import [Link] as dparser
text = "The event is on 2023-10-15."
date = [Link](text,
fuzzy=True)
print(date)
27. Text Augmentation
Purpose: Generates additional training data by
modifying existing text (e.g., synonym replacement).
Why: Improves model robustness and performance.
27. Text Augmentation
!pip install nlpaug
import nltk
from [Link] import SynonymAug
# Download the 'averaged_perceptron_tagger_eng' dataset
[Link]('averaged_perceptron_tagger_eng')

aug = SynonymAug(aug_src='wordnet')
text = "This is a sample text."
augmented_text = [Link](text)
print(augmented_text)
28. Handling Negations
Purpose: Identifies and processes negations (e.g.,
"not good").
Why: Important for sentiment analysis and
understanding context.

Prepared by: Syed Afroz Ali

Learn NLP Step by Step: [Link]
Follow me on LinkedIn for more Content: [Link]
28. Handling Negations
from nltk import word_tokenize
text = "This is not good."
tokens = word_tokenize(text)
for i, token in enumerate(tokens):
if token == "not" and i + 1 < len(tokens):
tokens[i + 1] = "not_" + tokens[i + 1]
print(tokens)
29. Dependency Parsing
Purpose: Analyzes the grammatical structure of a
sentence.
Why: Helps in understanding relationships between
words.
29. Dependency Parsing
import spacy
!python -m spacy download en_core_web_sm # Download
the model if not already downloaded
nlp = [Link]("en_core_web_sm") # Load the model
directly using [Link]
# The rest of your code remains the same
text = "This is a sample sentence."
doc = nlp(text)
for token in doc:
print([Link], token.dep_, [Link])
30. Handling Rare Words
Purpose: Replaces or removes rare words that occur
infrequently.
Why: Reduces noise and improves model efficiency.
30. Handling Rare Words
from collections import Counter
tokens = ["this", "is", "a", "rare", "word",
"word"]
word_counts = Counter(tokens)
rare_words = {word for word, count in
word_counts.items() if count < 2}
tokens = [token if token not in rare_words
else "<UNK>" for token in tokens]
print(tokens)
31. Text Chunking
Purpose: Groups words into "chunks" based on POS
tags (e.g., noun phrases).
Why: Useful for information extraction.
31. Text Chunking
from nltk import pos_tag, word_tokenize
from [Link] import RegexpParser
text = "This is a sample sentence."
tokens = word_tokenize(text)
pos_tags = pos_tag(tokens)
grammar = "NP: {<DT>?<JJ>*<NN>}"
chunk_parser = RegexpParser(grammar)
tree = chunk_parser.parse(pos_tags)
print(tree)
32. Handling Synonyms
Purpose: Replaces words with their synonyms.
Why: Helps in text augmentation and reducing
redundancy.

Prepared by: Syed Afroz Ali

Learn NLP Step by Step: [Link]
Follow me on LinkedIn for more Content: [Link]
32. Handling Synonyms
from [Link] import wordnet
word = "happy"
synonyms = [Link](word)
print([[Link]()[0].name() for
syn in synonyms])
33. Text Normalization for Social Media
Purpose: Processes informal text (e.g., "u" → "you",
"gr8" → "great").
Why: Social media text often contains informal
language and slang.
33. Text Normalization for Social Media
import re
text = "I loooove this!"
text = [Link](r'(.)\1+', r'\1', text)
print(text)

Prepared by: Syed Afroz Ali

Learn NLP Step by Step: [Link]
Follow me on LinkedIn for more Content: [Link]
NLP Preprocessing
These pre-processing steps are crucial for
cleaning, standardizing, and transforming raw
text into a format suitable for NLP models. The
specific steps used depend on the task (e.g.,
sentiment analysis, machine translation) and the
nature of the text (e.g., formal documents,
social media posts).

Prepared by: Syed Afroz Ali

Learn NLP Step by Step: [Link]
Follow me on LinkedIn for more Content: [Link]

NLP Data Preprocessing Workflow Guide
No ratings yet
NLP Data Preprocessing Workflow Guide
20 pages
NLP Text Preprocessing Techniques
No ratings yet
NLP Text Preprocessing Techniques
20 pages
Script Validation in NLP Techniques
No ratings yet
Script Validation in NLP Techniques
13 pages
Essential NLP Concepts and Preprocessing
No ratings yet
Essential NLP Concepts and Preprocessing
48 pages
Introduction to Natural Language Processing
No ratings yet
Introduction to Natural Language Processing
12 pages
NLP Techniques and Applications Overview
No ratings yet
NLP Techniques and Applications Overview
29 pages
NLP Basics: Text Processing Steps
No ratings yet
NLP Basics: Text Processing Steps
7 pages
Text Preprocessing in NLP with Python
No ratings yet
Text Preprocessing in NLP with Python
6 pages
Comprehensive NLP Notes and Guide
No ratings yet
Comprehensive NLP Notes and Guide
21 pages
Text Analysis and ML Pipeline Overview
No ratings yet
Text Analysis and ML Pipeline Overview
17 pages
NLP Origins, Challenges, and Techniques
100% (1)
NLP Origins, Challenges, and Techniques
16 pages
NLP Data Preprocessing Techniques
No ratings yet
NLP Data Preprocessing Techniques
35 pages
Understanding Natural Language Processing
No ratings yet
Understanding Natural Language Processing
19 pages
NLP Text Preprocessing Guide
No ratings yet
NLP Text Preprocessing Guide
19 pages
NLP Techniques for Text Preprocessing
No ratings yet
NLP Techniques for Text Preprocessing
55 pages
Text Processing Techniques in NLP
No ratings yet
Text Processing Techniques in NLP
4 pages
Essential Steps in Text Processing
No ratings yet
Essential Steps in Text Processing
5 pages
Introduction to NLP and Text Analytics
No ratings yet
Introduction to NLP and Text Analytics
24 pages
Understanding Natural Language Processing
No ratings yet
Understanding Natural Language Processing
52 pages
Introduction to Natural Language Processing
No ratings yet
Introduction to Natural Language Processing
37 pages
Spelling Correction in NLP Techniques
No ratings yet
Spelling Correction in NLP Techniques
35 pages
Understanding NLP: Concepts & Applications
No ratings yet
Understanding NLP: Concepts & Applications
71 pages
Machine Learning Algorithms Overview
No ratings yet
Machine Learning Algorithms Overview
17 pages
NLP Overview and Key Concepts
No ratings yet
NLP Overview and Key Concepts
25 pages
NLP Text Processing Basics
No ratings yet
NLP Text Processing Basics
4 pages
2 Marks
No ratings yet
2 Marks
22 pages
Python Text Processing and NLP Basics
No ratings yet
Python Text Processing and NLP Basics
32 pages
Text Preprocessing Steps in NLP
No ratings yet
Text Preprocessing Steps in NLP
8 pages
Text Preprocessing vs. Text Wrangling
No ratings yet
Text Preprocessing vs. Text Wrangling
61 pages
Tokenization in Natural Language Processing
No ratings yet
Tokenization in Natural Language Processing
179 pages
Named Entity Recognition and NLP Applications
No ratings yet
Named Entity Recognition and NLP Applications
91 pages
Understanding Natural Language Processing
No ratings yet
Understanding Natural Language Processing
43 pages
NLP Basics and Techniques Overview
No ratings yet
NLP Basics and Techniques Overview
71 pages
Understanding Natural Language Processing
No ratings yet
Understanding Natural Language Processing
7 pages
NLP Lecture Notes - January 2025
No ratings yet
NLP Lecture Notes - January 2025
8 pages
End-to-End NLP Pipeline Overview
No ratings yet
End-to-End NLP Pipeline Overview
50 pages
NLP Applications and Text Preprocessing
No ratings yet
NLP Applications and Text Preprocessing
56 pages
History and Evolution of NLP
No ratings yet
History and Evolution of NLP
26 pages
Tokenization and Text Processing Techniques
No ratings yet
Tokenization and Text Processing Techniques
24 pages
Class 10 NLP Study Notes
No ratings yet
Class 10 NLP Study Notes
3 pages
Overview of Natural Language Processing
No ratings yet
Overview of Natural Language Processing
26 pages
NLP Origins, Challenges, and Techniques
No ratings yet
NLP Origins, Challenges, and Techniques
16 pages
NLP 101: Basics and Applications
No ratings yet
NLP 101: Basics and Applications
26 pages
Introduction to Natural Language Processing
No ratings yet
Introduction to Natural Language Processing
29 pages
Morphological Models in NLP Syllabus
No ratings yet
Morphological Models in NLP Syllabus
71 pages
Introduction to Natural Language Processing
No ratings yet
Introduction to Natural Language Processing
39 pages
Understanding Natural Language Processing
No ratings yet
Understanding Natural Language Processing
16 pages
Introduction to Natural Language Processing
No ratings yet
Introduction to Natural Language Processing
9 pages
NLTK and spaCy: NLP Tools Overview
No ratings yet
NLTK and spaCy: NLP Tools Overview
24 pages
NLP Stages: Tokenization & Stemming
No ratings yet
NLP Stages: Tokenization & Stemming
46 pages
Understanding Tokenization in NLP
No ratings yet
Understanding Tokenization in NLP
21 pages
Tokenization in Natural Language Processing
No ratings yet
Tokenization in Natural Language Processing
61 pages
NLP Applications and Techniques Overview
No ratings yet
NLP Applications and Techniques Overview
17 pages
NLP Basics: Tokenization, Stemming, Lemmatization
No ratings yet
NLP Basics: Tokenization, Stemming, Lemmatization
11 pages
History and Applications of NLP
No ratings yet
History and Applications of NLP
11 pages
Understanding Natural Language Processing
No ratings yet
Understanding Natural Language Processing
25 pages
Introduction to Natural Language Processing
No ratings yet
Introduction to Natural Language Processing
6 pages
AI Task Themes and Submission Guidelines
No ratings yet
AI Task Themes and Submission Guidelines
18 pages
Spring 2025 - SE601P - 1
No ratings yet
Spring 2025 - SE601P - 1
1 page
End-to-End Machine Learning Project
No ratings yet
End-to-End Machine Learning Project
11 pages
XML DOM Concepts and Methods Guide
No ratings yet
XML DOM Concepts and Methods Guide
18 pages
Machine Learning for Wine Quality Classification
No ratings yet
Machine Learning for Wine Quality Classification
6 pages
Entry Test Quick Solution Maths Shortcuts
92% (48)
Entry Test Quick Solution Maths Shortcuts
91 pages
MUET Past Papers for Mehran University
100% (4)
MUET Past Papers for Mehran University
148 pages
Neet Ug Jee Main Absolute Physics Vol 1 12870
0% (1)
Neet Ug Jee Main Absolute Physics Vol 1 12870
32 pages
Trigonometry Tricks and Shortcuts Guide
No ratings yet
Trigonometry Tricks and Shortcuts Guide
64 pages
Physics & Measurement
No ratings yet
Physics & Measurement
20 pages
Liquid Properties Test for Chemistry XI
No ratings yet
Liquid Properties Test for Chemistry XI
2 pages
Scientific Insights from the Quran
No ratings yet
Scientific Insights from the Quran
4 pages
Std. 12th Perfect Maths Part - II
100% (1)
Std. 12th Perfect Maths Part - II
29 pages
Practical Centre Karachi Overview
No ratings yet
Practical Centre Karachi Overview
11 pages
Biochemistry Lab Instrumentation Guide
No ratings yet
Biochemistry Lab Instrumentation Guide
15 pages
Analog & Digital Communication Syllabus
No ratings yet
Analog & Digital Communication Syllabus
55 pages
Class 3 Maths: Fair Share Solutions
No ratings yet
Class 3 Maths: Fair Share Solutions
23 pages
Informed Search Techniques Overview
No ratings yet
Informed Search Techniques Overview
67 pages
Hydra Rig Coil Tubing Manual Download
0% (1)
Hydra Rig Coil Tubing Manual Download
3 pages
Ksp Values and Molar Solubility Analysis
No ratings yet
Ksp Values and Molar Solubility Analysis
54 pages
Java Shopping Cart Example Code
No ratings yet
Java Shopping Cart Example Code
6 pages
Motor-Protective Circuit-Breakers Overview
No ratings yet
Motor-Protective Circuit-Breakers Overview
59 pages
Alkargo Medium Power Transformers
No ratings yet
Alkargo Medium Power Transformers
12 pages
Understanding Quadrilaterals and Curves
No ratings yet
Understanding Quadrilaterals and Curves
19 pages
Associative Solvability in Paths Analysis
No ratings yet
Associative Solvability in Paths Analysis
17 pages
Class 7 Maths: Decimals Solutions
No ratings yet
Class 7 Maths: Decimals Solutions
32 pages
Class 9 Science Motion Exam Questions
No ratings yet
Class 9 Science Motion Exam Questions
3 pages
MAC Frame Format in Data Link Layer
No ratings yet
MAC Frame Format in Data Link Layer
31 pages
Industrial Safety Indicators Analysis
No ratings yet
Industrial Safety Indicators Analysis
8 pages
Creating Dynamic Reports in Access
No ratings yet
Creating Dynamic Reports in Access
18 pages
An4894 Eeprom Emulation Techniques and Software For stm32 Microcontrollers Stmicroelectronics
No ratings yet
An4894 Eeprom Emulation Techniques and Software For stm32 Microcontrollers Stmicroelectronics
40 pages
Physics MCQs on Motion in a Straight Line
No ratings yet
Physics MCQs on Motion in a Straight Line
6 pages
2024 Half-Yearly Physics Exam Paper
No ratings yet
2024 Half-Yearly Physics Exam Paper
6 pages
Database Table Normalization Guide
No ratings yet
Database Table Normalization Guide
37 pages
MiCOM P120-P123 MODBUS Database Guide
No ratings yet
MiCOM P120-P123 MODBUS Database Guide
222 pages
Introduction to Calculus Concepts
No ratings yet
Introduction to Calculus Concepts
9 pages
28nm FDSOI Fully-Differential OTA Design
No ratings yet
28nm FDSOI Fully-Differential OTA Design
4 pages
74LS83 IC Pin Configuration Guide
No ratings yet
74LS83 IC Pin Configuration Guide
12 pages
Blow Drying Techniques for Hair Styling
No ratings yet
Blow Drying Techniques for Hair Styling
7 pages
WWK 222 H Heat Pump Overview
No ratings yet
WWK 222 H Heat Pump Overview
4 pages
Understanding Autocorrelation in Econometrics
100% (1)
Understanding Autocorrelation in Econometrics
36 pages
M.Sc Climate Science Mid-Sem Exam 2022
No ratings yet
M.Sc Climate Science Mid-Sem Exam 2022
1 page
Gas-Oil Ratio Impact on Oil Production
No ratings yet
Gas-Oil Ratio Impact on Oil Production
62 pages
O Level Physics: Forces MCQs Series
No ratings yet
O Level Physics: Forces MCQs Series
16 pages

NLP Pre-processing Techniques Overview

Uploaded by

NLP Pre-processing Techniques Overview

Uploaded by

Natural Language Pre-processing

Prepared by: Syed Afroz Ali

Common Pre-processing step commonly used before

text = "Hello World! This is NLP."

Prepared by: Syed Afroz Ali

text = "Hello World! This is NLP."

text = "Hello, World! This is NLP."

Prepared by: Syed Afroz Ali

tokens = ["this", "is", "a", "sample", "sentence"]

Prepared by: Syed Afroz Ali

Prepared by: Syed Afroz Ali

Note: Part-of-Speech tagging is a crucial technique in NLP that assigns

[Link]('maxent_ne_chunker_tab') # This line is crucial to fix the error.

tokens = word_tokenize("John works at Google in New York.")

Feature Extraction: Vectorization helps extract

Dimensionality Reduction: It can help reduce the

Concept: Improves upon BoW by weighing words based

Concept: Extends word embeddings to represent entire

Prepared by: Syed Afroz Ali

Purpose: Corrects spelling errors in the text.

I love Python! smiling_face_with_smiling_eyes

Prepared by: Syed Afroz Ali

# Download the 'punkt_tab' dataset before using sent_tokenize

text = "This is the first sentence. This is the second

Prepared by: Syed Afroz Ali

Prepared by: Syed Afroz Ali

Prepared by: Syed Afroz Ali

Prepared by: Syed Afroz Ali

Prepared by: Syed Afroz Ali

You might also like