0% found this document useful (0 votes)
176 views7 pages

NLP Techniques and Tools in Python

This is NlP pdf

Uploaded by

Manoj B R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
176 views7 pages

NLP Techniques and Tools in Python

This is NlP pdf

Uploaded by

Manoj B R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Natural Language Processing

with Python
A Comprehensive Cheat Sheet for NLP Tasks and Techniques

🔤 🚀 🤗 📊 🧠
NLTK spaCy Transformers Gensim sklearn

Text Preprocessing

▶ Cleaning & Tokenization NLTK

import re, nltk


from [Link] import stopwords
from [Link] import word_tokenize

[Link]('punkt')
[Link]('stopwords')

def clean_text(text):
text = [Link]()
text = [Link](r'[^\w\s]', '', text)
tokens = word_tokenize(text)
stop_words = set([Link]('english'))
return [w for w in tokens if w not in stop_words]

▶ Stemming vs Lemmatization

# Stemming with NLTK


from [Link] import PorterStemmer
porter = PorterStemmer()
[Link]("running") # "run"

# Lemmatization with spaCy


import spacy
nlp = [Link]("en_core_web_sm")
doc = nlp("I am running in the park")
[token.lemma_ for token in doc] # ['I', 'be', 'run', 'in', 'the',
'park']

Preprocessing Tips

HTML to PDF API Printed using PDFCrowd HTML to PDF


Always lowercase text for consistency
Remove stopwords for topic modeling, but keep them for sentiment analysis
Use lemmatization over stemming when meaning preservation is important
Consider domain-specific preprocessing (e.g., hashtags for social media)

Feature Extraction

▶ Bag of Words sklearn

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
"Natural language processing.",
"I love learning about NLP."
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
# Get feature names & document vectors
vectorizer.get_feature_names_out()
[Link]()

▶ TF-IDF sklearn

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

▶ Word Embeddings Gensim

from [Link] import Word2Vec

sentences = [["natural", "language"], ["machine", "learning"]]

model = Word2Vec(sentences, vector_size=100, window=5, min_count=1)

# Get vector & similar words


vector = [Link]['natural']
similar = [Link].most_similar('natural', topn=5)

When to Use Each Feature Type

BoW/TF-IDF: Text classification, document clustering


Word Embeddings: Semantic tasks, text similarity, transfer learning

HTML to PDF API Printed using PDFCrowd HTML to PDF


Contextual Embeddings: Advanced tasks requiring context understanding

Text Classification

▶ Basic Pipeline sklearn

from sklearn.feature_extraction.text import TfidfVectorizer


from sklearn.naive_bayes import MultinomialNB
from [Link] import Pipeline

X_train = ["I love this product", "This is terrible"]


y_train = ["positive", "negative"]

text_clf = Pipeline([
('tfidf', TfidfVectorizer()),
('clf', MultinomialNB())
])

text_clf.fit(X_train, y_train)
text_clf.predict(["This was awesome"]) # ['positive']

▶ Using Transformers Transformers

from transformers import pipeline

classifier = pipeline('sentiment-analysis')

result = classifier("I've been waiting for this movie!")


# [{'label': 'POSITIVE', 'score': 0.9998}]

Classification Task Recommended Approach

Sentiment Analysis VADER (rule-based) or fine-tuned BERT

Topic Classification TF-IDF + SVM or DistilBERT

Intent Recognition Fine-tuned RoBERTa

Spam Detection TF-IDF + Naive Bayes

Named Entity Recognition

▶ spaCy NER spaCy

HTML to PDF API Printed using PDFCrowd HTML to PDF


import spacy

nlp = [Link]("en_core_web_sm")
doc = nlp("Apple is buying U.K. startup for $1 billion")

for ent in [Link]:


print([Link], ent.label_)
# Apple ORG
# U.K. GPE
# $1 billion MONEY

▶ Transformers NER Transformers

from transformers import pipeline

ner = pipeline("ner")

text = "My name is Sarah and I work at Google in London"


ner_results = ner(text)
# [{'entity': 'I-PER', 'score': 0.99, 'word': 'Sarah'}, ...]

Common Entity Types

PER/PERSON: People names


ORG: Organizations, companies
LOC/GPE: Locations, geopolitical entities
DATE/TIME: Temporal expressions
MONEY: Monetary values

Sentiment Analysis

▶ TextBlob TextBlob

from textblob import TextBlob

text = "The movie was absolutely amazing!"


blob = TextBlob(text)

# Polarity: -1 (negative) to 1 (positive)


print([Link]) # 0.8

# Subjectivity: 0 (objective) to 1 (subjective)


print([Link]) # 0.75

▶ VADER Sentiment NLTK

HTML to PDF API Printed using PDFCrowd HTML to PDF


from [Link] import SentimentIntensityAnalyzer
import nltk

[Link]('vader_lexicon')
sia = SentimentIntensityAnalyzer()

text = "The movie was absolutely amazing!"


scores = sia.polarity_scores(text)

print(scores)
# {'neg': 0.0, 'neu': 0.295, 'pos': 0.705, 'compound': 0.8316}

Sentiment Analysis Tips

Rule-based approaches work well for straightforward text


Consider using domain-specific models for specialized content
ML/DL approaches handle context, sarcasm, and negation better
Use BERT variants for state-of-the-art performance

Topic Modeling

▶ LDA with Gensim Gensim

from [Link] import Dictionary


from [Link] import LdaModel

docs = [
"Machine learning is a subset of AI",
"NLP is used for text analysis"
]

# Tokenize
tokenized_docs = [[Link]().split() for doc in docs]

# Create dictionary & corpus


dictionary = Dictionary(tokenized_docs)
corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]

# Train LDA model


lda_model = LdaModel(
corpus=corpus,
id2word=dictionary,
num_topics=2,
passes=10
)

# Print topics
topics = lda_model.print_topics()
for topic in topics:
print(topic)

HTML to PDF API Printed using PDFCrowd HTML to PDF


Topic Modeling Approaches

LDA: Classic probabilistic approach


NMF: Non-negative Matrix Factorization
BERTopic: Leverages BERT embeddings
Top2Vec: Document embeddings + clustering

Advanced Techniques

▶ Text Summarization Transformers

from transformers import pipeline

summarizer = pipeline("summarization")

long_text = """NLP is a field of AI that focuses on..."""

summary = summarizer(long_text, max_length=100, min_length=30)


print(summary[0]['summary_text'])

▶ Translation Transformers

from transformers import pipeline

translator = pipeline("translation_en_to_fr")

translation = translator("Hello, how are you?")


print(translation[0]['translation_text'])

▶ Question Answering Transformers

from transformers import pipeline

qa = pipeline("question-answering")

context = "Python is a programming language created by..."


question = "Who created Python?"

result = qa(question=question, context=context)


print(result['answer'])

Task Beginner Approach Advanced Approach

Summarization Extractive (TextRank) Abstractive (T5, BART)

Translation Pre-trained pipeline Custom Seq2Seq models

HTML to PDF API Printed using PDFCrowd HTML to PDF


Q&A Rule-based systems Fine-tuned BERT/T5

Text Generation Markov Chains GPT models

NLP Project Evaluation & Tips

Evaluation Metrics by Task

Classification: Accuracy, F1-score, Precision, Recall


NER: F1-score, Precision, Recall (by entity type)
Summarization: ROUGE-N, ROUGE-L, BLEU
Translation: BLEU, METEOR, TER
Generation: Perplexity, human evaluation

Best Practices for NLP Projects

Start simple: Try basic models before complex ones


Clean your data thoroughly: Good preprocessing is crucial
Consider context: Many NLP problems need contextual understanding
Leverage pre-trained models: Often outperform models trained from scratch
Handle class imbalance: Use oversampling or adjusted weights
Use cross-validation: Especially for small datasets
Evaluate properly: Choose appropriate metrics for your task

Natural Language Processing with Python Cheat Sheet • 2025 Edition

HTML to PDF API Printed using PDFCrowd HTML to PDF

Common questions

Powered by AI

VADER sentiment analysis is preferred in scenarios involving straightforward text datasets that include social media or general public sentiment data because it is a rule-based system specifically tuned for understanding the nuances of social media text. Unlike complex ML/DL approaches that require extensive training and data, VADER provides immediate, easily interpretable sentiment scores without the need for training, making it efficient for real-time applications .

Transformer-based approaches for text summarization, such as T5 and BART, provide more sophisticated abstractive summaries by generating new sentences that capture the core ideas of the text, rather than just extracting key phrases or sentences like traditional methods. This allows for a more coherent and concise representation of the original document. These models leverage attention mechanisms to weigh the importance of different input segments, creating summaries that are contextually aware and retain the original document's meaning more effectively .

Bag of Words and TF-IDF approaches represent text as a set of independent words, ignoring the order and context in which words appear. This leads to a lack of semantic understanding, making them insufficient for tasks like semantic similarity or text inference where the meaning of a sentence as a whole is important. Context-free nature of these models results in losing nuanced meaning necessary for high-level comprehension tasks .

NER implemented using spaCy relies on a pre-trained statistical model with a fixed pipeline adept at handling a limited number of entity types and is optimized for speed. In contrast, Transformers use deep learning architectures, which employ attention mechanisms for context understanding and can be fine-tuned for specific tasks or domains, handling more complex input and potentially providing more accurate and comprehensive NER results with a higher computational cost .

Cross-validation can be effectively applied in NLP projects by partitioning the dataset into multiple subsets where each subset serves as a validation set while the remaining data forms the training set. This technique provides a robust measure of a model's predictive power and variance across different data samples. It is particularly useful for small datasets as it makes maximal use of limited data, providing an unbiased evaluation metric. By averaging the performance across all iterations, cross-validation ensures that the model generalizes well to unseen data, preventing overfitting to small or specific datasets .

A novice NLP practitioner should consider starting with simple models in scenarios where the dataset is relatively small or when the task can be effectively solved using rule-based or basic statistical methods. This approach allows the practitioner to establish a baseline for performance, understand the data and limitations better, and iteratively build complexity into the solutions. Starting simple also helps to develop an understanding of foundational NLP concepts before diving into the computationally expensive and conceptually complex Transformer models. It's a strategic way to ensure resource efficiency and avoid overfitting .

When preparing text data for topic modeling, it is generally recommended to lowercase the text for consistency, remove stopwords to avoid common but non-informative words, and consider domain-specific preprocessing such as handling hashtags in social media data. Using lemmatization is preferred to maintain word meanings, and specific tokens related to the context of the topics being modeled should be preserved to improve the quality of topic extraction .

For text classification tasks, metrics such as accuracy, F1-score, precision, and recall are crucial. Using multiple metrics is essential because relying on a single evaluation metric may not provide a complete picture of the model's performance. For instance, high accuracy might be achieved on an imbalanced dataset by predicting the majority class, but precision and recall can reveal significant shortcomings in model predictions. The F1-score, which balances precision and recall, provides a more nuanced understanding of the model's effectiveness, especially in cases where class imbalance is a concern .

Lemmatization is often preferred over stemming when the preservation of word meaning is important. Unlike stemming, which merely cuts off word suffixes to produce a base form, lemmatization considers the context and reduces words to their root form or lemma, ensuring the base form is a valid word. This helps in maintaining the semantic meaning crucial for tasks like topic modeling and information retrieval .

Contextual embeddings, unlike traditional word embeddings such as Word2Vec or GloVe, capture the meaning of words within the context they appear. This means that a word can have different embeddings depending on its surrounding words. They are particularly useful for advanced tasks requiring a deep understanding of the context, such as named entity recognition, sentiment analysis, and sarcasm detection, where the same word can change meaning based on its context .

You might also like