0% found this document useful (0 votes)
50 views24 pages

Understanding Natural Language Processing

Natural Language Processing (NLP) is a field that enables computers to understand and interact with human languages through various components like Natural Language Understanding (NLU) and Natural Language Generation (NLG). The document outlines the history of NLP, key challenges such as ambiguity and language differences, and the importance of grammar and syntax in processing languages. It also highlights the unique challenges of NLP for Indian languages and mentions Python libraries that can be used for text processing in these languages.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views24 pages

Understanding Natural Language Processing

Natural Language Processing (NLP) is a field that enables computers to understand and interact with human languages through various components like Natural Language Understanding (NLU) and Natural Language Generation (NLG). The document outlines the history of NLP, key challenges such as ambiguity and language differences, and the importance of grammar and syntax in processing languages. It also highlights the unique challenges of NLP for Indian languages and mentions Python libraries that can be used for text processing in these languages.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Natural Language Processing

1. Natural Language Processing – Introduction

➢ Humans communicate through some form of language either by text or speech.


➢ To make interactions between computers and humans, computers need to understand
natural languages used by humans.
➢ Natural language processing is all about making computers learn, understand, analyze,
manipulate and interpret natural(human) languages.
➢ NLP stands for Natural Language Processing, which is a part of Computer Science,
Human languages or Linguistics, and Artificial Intelligence.
➢ Processing of Natural Language is required when you want an intelligent system like
robot to perform as per your instructions, when you want to hear decision from a
dialogue based clinical expert system, etc.
➢ The ability of machines to interpret human language is now at the core of many
applications that we use every day - chatbots, Email classification and spam filters,
search engines, grammar checkers, voice assistants, and social language translators.
➢ The input and output of an NLP system can be Speech or Written Text.

Definition:
Natural language processing (NLP) is a field of computer science, artificial intelligence (also
called machine learning), and linguistics concerned with the interactions between computers
and human (natural) languages. Specifically, the process of a computer extracting meaningful
information from natural language input and/or producing natural language output “

Components of NLP
There are two components of NLP as given −

Natural Language Understanding (NLU)


Understanding involves the following tasks −
● Mapping the given input in natural language into useful representations.
● Analyzing different aspects of the language.
Natural Language Generation (NLG)
It is the process of producing meaningful phrases and sentences in the form of natural
language from some internal representation.
It involves −
● Text planning − It includes retrieving the relevant content from the knowledge
base.
● Sentence planning − It includes choosing required words, forming meaningful
phrases, setting tone of the sentence.
● Text Realization − It is mapping sentence plan into sentence structure.
History of NLP
First Phase (Machine Translation Phase) - Late 1940s to late 1960s
● The work done in this phase focused mainly on machine translation (MT). This
phase was a period of enthusiasm and optimism

Let us now see all that the first phase had in it −


● The research on NLP started in early 1950s after Booth & Richens’ investigation
and Weaver’s memorandum on machine translation in 1949.
● 1954 was the year when a limited experiment on automatic translation from
Russian to English demonstrated in the Georgetown-IBM experiment.
● In the same year, the publication of the journal MT (Machine Translation)
started.
● The first international conference on Machine Translation (MT) was held in 1952
and second was held in 1956.
● In 1961, the work presented in Teddington International Conference on Machine
Translation of Languages and Applied Language analysis was the high point of
this phase.

Second Phase (AI Influenced Phase) – Late 1960s to late 1970s


In this phase, the work done was majorly related to world knowledge and on its role in
the construction and manipulation of meaning representations. That is why, this phase
is also called AI-flavored phase.

The phase had in it, the following −


● In early 1961, the work began on the problems of addressing and constructing
data or knowledge base. This work was influenced by AI.
● In the same year, a BASEBALL question-answering system was also developed.
The input to this system was restricted and the language processing involved was
a simple one.
● A much advanced system was described in Minsky (1968). This system, when
compared to the BASEBALL question-answering system, was recognized and
provided for the need of inference on the knowledge base in interpreting and
responding to language input.

Third Phase (Grammatico-logical Phase) – Late 1970s to late 1980s


This phase can be described as the grammatico-logical phase. Due to the failure of
practical system building in the last phase, the researchers moved towards the use of
logic for knowledge representation and reasoning in AI.

The third phase had the following in it −


● The grammatico-logical approach, towards the end of decade, helped us with
powerful general-purpose sentence processors like SRI’s Core Language Engine
and Discourse Representation Theory, which offered a means of tackling more
extended discourse.
● In this phase we got some practical resources & tools like parsers, e.g. Alvey
Natural Language Tools along with more operational and commercial systems,
e.g. for database query.
● The work on lexicon in the 1980s also pointed in the direction of
grammatico-logical approach.

Fourth Phase (Lexical & Corpus Phase) – The 1990s


We can describe this as a lexical & corpus phase. The phase had a lexicalized approach
to grammar that appeared in late 1980s and became an increasing influence. There was
a revolution in natural language processing in this decade with the introduction of
machine learning algorithms for language processing.

Steps in Natural language processing

1)Tokenizations
2)Stemming
3)Lemmatization
4) POS tags
5) Named Entity Recognition
Tokenization
● Cutting the big sentence into small tokens
● Example : Welcome to Last moment tuitions will be divided into tokens
[welcome] [to] [last] [moment] [tuitions]

Stemming
● Normalize words into its base or root forms
● Waits, waited,waiting -> wait ( In this example the root meaning or word is wait
so in stemming we cut remaining part and find the root word)
Lemmatization
● Group together different inflected forms of a word called Lemmatization
● Somehow similar to stemming, as it maps several words into one common root
● Output of lemmatization is a proper word
● Example : Gone,going and went -> Go

POS Tags
● POS stands for Parts of speech tags
● It indicated how a word function in meaning as well as grammatically within a
sentence.

● But problem in POS tag is sometime one word can have different meaning
example “ google something on internet” now we know google is a proper
noun but here it is use as verb
● To solve this problem we use Named entity recognition
Named Entity Recognition
● It is a process of name entity such as: person , organization , location ,
quantities ,etc.

● Here Apple is an Organization , Tim cook is a person and cupertino flint center is
location. In NER we get exact information like apple is a organization not a fruit.

Chunking
● Picking individual pieces of information and grouping them into bigger pieces.

● This helps getting insight and meaningful information from the text.
Knowledge Requires in NLP

A natural language understanding system must have knowledge about what the words
mean, how words combine to form sentences, how word meanings combine to form
sentence meanings and so on. The different forms of knowledge required for natural
language understanding are given below.

● Phonetic And Phonological Knowledge


● Morphological Knowledge
● Syntactic Knowledge
● Semantic Knowledge
● Pragmatic Knowledge
● Discourse Knowledge
● World Knowledge

Phonetic And Phonological Knowledge


1. -Phonetics is the study of language at the level of sounds while phonology is the
study of combination of sounds into organized units of speech
2. -Phonetic and phonological knowledge are essential for speech based systems as
they deal with how words are related to the sounds that realize them.

Morphological Knowledge
1. Morphology concerns word formation.
2. It is a study of the patterns of formation of words by the combination of sounds
into minimal distinctive units of meaning called morphemes
3. Morphological knowledge concerns how words are constructed from morphemes.

SYNTACTIC KNOWLEDGE
1. Syntax is the level at which we study how words combine to form phrases,
phrases combine to form clauses and clauses join to make sentences.
2. Syntactic analysis concerns sentence formation.
3. It deals with how words can be put together to form correct sentences
Semantic Knowledge
1. It concerns the meanings of the words and sentences.
2. Defining the meaning of a sentence is very difficult due to the ambiguities involved.

Pragmatic Knowledge
1. Pragmatics is the extension of the meaning or semantics.
2. Pragmatics deals with the contextual aspects of meaning in particular situations
3. It concerns how sentences are used in different situations

Discourse Knowledge
1. Discourse concerns connected sentences. It is a study of chunks of language which are
bigger than a single sentence.
2. Discourse language concerns inter-sentential links that is how the immediately
preceding sentences affect the interpretation of the next sentence.
3. Discourse knowledge is important for interpreting pronouns and temporal aspects of
the information conveyed.

World Knowledge
1. Word knowledge is nothing but everyday knowledge that all speakers share about
the world.
2. It includes the general knowledge about the structure of the world and what each
language user must know about the other user’s beliefs and goals.
3. This is essential to make the language understanding much better.
Challenges in NLP
1. Ambiguity in Natural Language Processing

● Ambiguity, generally used in natural language processing, can be referred to as


the ability of being understood in more than one way.
● In simple terms, we can say that ambiguity is the capability of being understood
in more than one way.
● Natural language is very ambiguous.

NLP has the following types of ambiguities −

Lexical Ambiguity
The ambiguity of a single word is called lexical ambiguity. For example, treating the
word silver as a noun, an adjective, or a verb.
● She won two silver medals
● She made a silver speech
● His worries has silvered his hair

Syntactic Ambiguity
● This kind of ambiguity occurs when a sentence is parsed in different ways.
● For example, the sentence “The man saw the girl with the telescope”. It is
ambiguous whether the man saw the girl carrying a telescope or he saw her
through his telescope.

Semantic Ambiguity
● This kind of ambiguity occurs when the meaning of the words themselves can be
misinterpreted even after syntax and the meaning of individual word have been
resolved.
● In other words, semantic ambiguity happens when a sentence contains an
ambiguous word or phrase.
Example 1
● “Seema loves her mother and shreya does too.”
● Here there are two meaning “seema lover her mother and shreya loves her own
mother” and “seema lover her mother and shreya also loves seema mother”
Example 2
“The car hit the pole while it was moving” is having semantic ambiguity because the
interpretations can be “The car, while moving, hit the pole” and “The car hit the pole
while the pole was moving”.

Anaphoric Ambiguity
● This kind of ambiguity arises due to the use of anaphora entities in discourse.
● Anaphora : when same beginning of sentence is repeated several times
● Example of what anaphora is “my mother liked the house very much but she
couldn’t buy it” here we are not repeating mother again and again we replaced it
by she so here she is anaphora
● Now lets come back to anaphoric ambiguity and understand it with the help of
example
● For example, the horse ran up the hill. It was very steep. It soon got tired. Here,
the anaphoric reference of “it” in two situations can either be horse or hill which
cause anaphoric ambiguity.

Pragmatic ambiguity
● It occurs when sentence gives it multiple interpretations or it is not specific.
● For example, the sentence “I like you too” can have multiple interpretations like I
like you (just like you like me), I like you (just like someone else does).

2) Language Differences: If we speak English and if we are thinking of reaching an


international and/or multicultural audience, we shall need to provide support for multiple
languages.

Different languages have not only vastly different sets of vocabulary, but also different types
of phrasing, different modes of inflection and different cultural expectations. We shall need
to spend time retraining our NLP system for each new languages

3) Training Data: NLP is all about analysing language to better understand it. One must
spend years constantly to become fluent in a language One must spend a significant amount
of time reading, listening to, and utilizing a language.
The abilities of an NLP system depends on the training data provided to it.
If questionable data is fed to the system it is going to learn wrong things, or learn in an
inefficient way

4) Phrasing Ambiguities: Sometimes, it is hard even for another human being to parse
out what someone means when they say something ambiguous. There may not be a clear,
concise meaning to be found in a strict analysis of their words.
In order to resolve this, an NLP system must be able to seek context that can help it
understand the phrasing. It may also need to ask the user for clarity.

(5) Misspelling: Misspellings are a simple problem for human beings, but for a machine,
misspellings can be harder to identify.
One should use an NLP tool with capabilities to recognise common misspellings of words,
and move beyond them.

6) Words with Multiple Meaning: Most of the languages have words that could have
multiple meanings, depending on the context. For example, a user who asks, "how are you'
has a totally different goal than a user.

Good NLP tools should be able to differentiate between these phrases with the
help of context.

Language and Grammar

Grammar in NLP is a set of rules for constructing sentences in a language used to


understand and analyze the structure of sentences in text data. This includes identifying
parts of speech such as nouns, verbs, and adjectives, determining the subject and predicate
of a sentence, and identifying the relationships between words and phrases.

Grammar is defined as the rules for forming well-structured sentences. Grammar also plays
an essential role in describing the syntactic structure of well-formed programs, like
denoting the syntactical rules used for conversation in natural languages.

Each natural language has an underlying structure usually referred to under Syntax. The
fundamental idea of syntax is that words group together to form the constituents like
groups of words or phrases which behave as a single unit. These constituents can combine
to form bigger constituents and, eventually, sentences.

• Syntax describes the regularity and productivity of a language making explicit the
structure of sentences, and the goal of syntactic analysis or parsing is to detect if a
sentence is correct and provide a syntactic structure of a sentence

Syntax also refers to the way words are arranged together. Let us see some basic ideas
related to syntax:

• Constituency: Groups of words may behave as a single unit or phrase - A


constituent, for example, like a Noun phrase.
• Grammatical relations: These are the formalization of ideas from traditional
grammar. Examples include - subjects and objects.
• Subcategorization and dependency relations: These are the relations between
words and phrases, for example, a Verb followed by an infinitive verb.
• Regular languages and part of speech: Refers to the way words are arranged
together but cannot support easily. Examples are Constituency, Grammatical
relations, and Subcategorization and dependency relations.
• Syntactic categories and their common denotations in NLP: np - noun
phrase, vp - verb phrase, s - sentence, det - determiner (article), n - noun, tv -
transitive verb (takes an object), iv - intransitive verb, prep - preposition, pp -
prepositional phrase, adj - adjective
Processing Indian Languages

Natural Language Processing (NLP) for Indian languages presents unique challenges
due to the diverse linguistic features, scripts, and grammatical structures of
these languages. India has 22 official languages and over 1600 dialects, making NLP
for Indian languages a complex but essential task.

• iNLTK- Hindi, Punjabi, Sanskrit, Gujarati, Kannada, Malyalam, Nepali, Odia,


Marathi, Bengali, Tamil, Urdu
• Indic NLP Library- Assamese, Sindhi, Sinhala, Sanskrit, Konkani, Kannada,
Telugu,
• StanfordNLP- Many of the above languages

Text Processing for Indian Languages using Python

There are a handful of Python libraries we can use to perform text processing and build
NLP applications for Indian languages.

1. iNLTK (Natural Language Toolkit for Indic Languages)

As the name suggests, the iNLTK library is the Indian language equivalent of
the popular NLTK Python package. This library is built with the goal of
providing features that an NLP application developer will need.

iNLTK currently supports 12 languages


2. Indic NLP Library

I find the Indic NLP Library quite useful for performing advanced text processing tasks
for Indian languages. Just like iNLTK was targeted towards a developer working with
vernacular languages, this library is for researchers working in this area.

Here is what the official documentation says about Indic NLP’s objective:

This library provides the following set of functionalities:

• Text Normalization – The process of converting text into a standard form by removing
inconsistencies like punctuation, casing, and special characters.

• Script Information – The identification of the writing system of a language, including character
set and script rules.

• Tokenization – The process of breaking text into meaningful units such as words, phrases, or
sentences.

• Word Segmentation – The task of identifying word boundaries in a text, especially in languages
where words are not separated by spaces.

• Script Conversion – The transformation of text from one script to another while maintaining the
original meaning.

• Romanization – The representation of words from non-Latin scripts using the Latin alphabet.

• Indicization – The adaptation of foreign words to Indian languages by modifying them based on
phonetic and grammatical rules.

• Transliteration – The conversion of text from one script to another while preserving its
pronunciation.

• Translation – The process of converting text from one language to another while maintaining its
meaning
As you can see, the Indic NLP Library supports a few more languages than iNLTK,
including Konkani, Sindhi, Telugu, etc. Let’s explore the library further

3) StanfordNLP is an NLP library right from Stanford’s ResearchGroup on Natural


language processing

The most striking feature of this library is that it supports around 53 human
languages for text processing!

StanfordNLP comes with built-in processors to perform five basic NLP tasks:

• Tokenization – The process of splitting text into individual words or sentences for
further analysis.

Multi-Word Token Expansion – The identification and expansion of multi-word


expressions into their meaningful components.

Lemmatization – The process of reducing a word to its base or dictionary form while
preserving its meaning.

Parts of Speech (POS) Tagging – The classification of words in a sentence based on


their grammatical role, such as noun, verb, or adjective.

Dependency Parsing – The analysis of grammatical relationships between words in a


sentence to determine their syntactic structure.
Application of Natural Language Processing

● Machine Translation
● Sentimental Analysis
● Automatic Summarization
● Question answering
● Speech recognition

Machine Translation

● Machine translation (MT), the process of translating one source language or text
into another language, is one of the most important applications of NLP.
● There are different types of machine translation systems. Let us see what the
different types are.
● Bilingual MT systems produce translations between two particular languages.
● Multilingual MT systems produce translations between any pair of languages.
They may be either unidirectional or bi-directional in nature.
● Example : Google translator

Sentiment Analysis

● Another important application of natural language processing (NLP) is


sentiment analysis.
● As the name suggests, sentiment analysis is used to identify the sentiments
among several posts.
● It is also used to identify the sentiment where the emotions are not expressed
explicitly.
● Companies are using sentiment analysis, an application of natural language
processing (NLP) to identify the opinion and sentiment of their customers
online.
● It will help companies to understand what their customers think about the
products and services.
● Companies can judge their overall reputation from customer posts with the help
of sentiment analysis.
● In this way, we can say that beyond determining simple polarity, sentiment
analysis understands sentiments in context to help us better understand what is
behind the expressed opinion.

Automatic Summarization
● In this digital era, the most valuable thing is data, or you can say information.
● However, do we really get useful as well as the required amount of information?
The answer is ‘NO’ because the information is overloaded and our access to
knowledge and information far exceeds our capacity to understand it.
● We are in a serious need of automatic text summarization and information
because the flood of information over the internet is not going to stop.
● Text summarization may be defined as the technique to create short, accurate
summary of longer text documents.
● Automatic text summarization will help us with relevant information in less
time. Natural language processing (NLP) plays an important role in developing
an automatic text summarization.

Question-answering
● Another main application of natural language processing (NLP) is
question-answering. Search engines put the information of the world at our
fingertips, but they are still lacking when it comes to answering the questions
posted by human beings in their natural language.
● We have big tech companies like Google are also working in this direction.
● Question-answering is a Computer Science discipline within the fields of AI and
NLP.
● It focuses on building systems that automatically answer questions posted by
human beings in their natural language.
● A computer system that understands the natural language has the capability of a
program system to translate the sentences written by humans into an internal
representation so that the valid answers can be generated by the system.
● The exact answers can be generated by doing syntax and semantic analysis of the
questions. Lexical gap, ambiguity and multilingualism are some of the challenges
for NLP in building a good question answering system.
Speech recognition

● Speech recognition (enables computers to recognize and transform spoken


language into text – dictation – and, if programmed, act upon that recognition –
e.g. in case of assistants like Google Assistant Cortana or Apple’s Siri)
● Speech recognition is simply the ability of a software to recognise speech.
● Anything that a person says, in a language of their choice, must be recognised by
the software.
● Speech recognition technology can be used to perform an action based on the
instructions defined by the human.
● Humans need to train the speech recognition system by storing speech patterns
and vocabulary of their language into the system.
● By doing so, they can essentially train the system to understand them when they
speak.
● Speech recognition and Natural Language processing are usually used together in
Automatic Speech Recognition engines, Voice Assistants and Speech analytics
tools.

Language Modeling

● The goal of the language model is to assign probability to a sentence


● There are many sorts of applications for Language Modeling, like Machine
Translation, Spell Correction Speech Recognition, Summarization, Question Answering,
Sentiment analysis etc. Each of those tasks require the use of a language model.
● Language models are required to represent the text in an understandable
form from the machine point of view.
● Language model has a lot of application and it helps to choose the right word which
fits in sentence and make correct meaning out of it

Some Application of Language Model


Machine translation: translating a sentence saying about height it would probably state
that P(tall man)>P(large man)
● P(tall man)>P(large man) as the ‘large’ might also refer to weight or general
appearance thus, not as probable as ‘tall’

Spelling Correction: Spell correcting sentence: “Put you name into form”, so that
P(name into form)>P(name into from)
● P(name into form)>P(name into from)
Speech Recognition: Call my nurse:
P(Call my nurse.)≫P(coal miners)
● Example 2 : I have no idea.
P(no idea.)≫P(No eye deer.)
P(no idea.)≫P(No eye deer.)
Summarization, question answering, sentiment analysis etc.

Working of Language Model


Part 1: Defining Language Models
The goal of probabilistic language modelling is to calculate the probability of a sentence
of sequence of words

and can be used to find the probability of the next word in the sequence:

A model that computes either of these is called a Language Model.

Initial Method for Calculating Probabilities


Definition: Conditional Probability
let A and B be two events with P(B) =/= 0, the conditional probability of A given B is:

Ex
This is a lot to calculate, could we not simply estimate this by counting and dividing the
results as shown in the following formula:

Methods using the Markov Assumption


Definition: Markov Property
● A stochastic process has the Markov property if the conditional probability
distribution of future states of the process (conditional on both past and present
states) depends only upon the present state, not on the sequence of events that
preceded it. A process with this property is called a Markov process.
● In other words, the probability of the next word can be estimated given only the
previous k number of words.
N-gram Models
From the Markov Assumption, we can formally define N-gram models where k = n-1 as
the following:

N-gram can be defined as the contiguous sequence of n items from a given sample of
text or speech. The items can be letters, words, or base pairs according to the
application. The N-grams typically are collected from a text or speech corpus (A long
text dataset).
For instance, N-grams can be unigrams like (“This”, “article”, “is”, “on”, “NLP”) or
bigrams (“This article”, “article is”, “is on”, “on NLP”).

Unigram Model (k=1): is a single word from a sequence. It represents individual


words without considering the context of neighboring words.
• Example sentence: "Natural Language Processing is amazing"
• Unigrams: ["Natural", "Language", "Processing", "is", "amazing"]

Bigram Model (k=2):

A bigram is a sequence of two consecutive words from a given text. It helps


capture some context by considering word pairs.

Example 1
Ex:2

Paninion Framework

It was first thought that machine translation (MT) of one language to another should be
easy matter once one has compiled dictionaries and obtained mathematical
representation of the grammars of the languages in question. It was believed that actual
translation would proceed by replacing the words of source language by those of target
language equivalents, and then rearranging and modifying these new words according
to grammar of the target language. But, it was soon found that this task was not easy, as
a word can have several equivalents, and the correct one can be decided only by context.
The mathematical representation of the grammar of a natural language was also given
up as an intractable problem.

Paanini’s Grammar Paanini’s grammar Astadhyayi (Eight chapters) deal ostensibly with
the Sanskrit language; however, it represents the framework for a universal grammar
that may (and possibly does) apply to any language. His book consists a little over 4000
rules and aphorism. Paanini’s grammar attempts to completely describe Sanskrit the
spoken language of its time. Panini’s grammar begins with meta-rules, or rules about
rules. To facilitate his description he establishes a special language, or meta-language.
This is followed by several sections on how to generate words and sentences starting
from roots, as well as rules for transformation of structure. The last part of the grammar
is a one-directional string of rules, where a given rule in the sequence ignores all the
rules that follow. Panini also uses recursion by allowing elements of earlier rules to
recur in later rules. This anticipates in form

A "Paninion framework" in NLP refers to a linguistic approach based on the grammar


system developed by the ancient Indian grammarian Panini, which utilizes the concept
of "karakas" (semantic roles) to analyze sentence structure and meaning, often focusing
on the relationships between words within a sentence rather than just their
grammatical categories, making it particularly useful for languages with complex
morphology like Sanskrit

Common questions

Powered by AI

NLP for Indian languages faces challenges due to linguistic diversity, multiple scripts, and distinct grammatical structures. With 22 official languages and over 1600 dialects, developing NLP applications requires careful consideration of language-specific nuances. Libraries such as iNLTK and Indic NLP Library have been developed to address these challenges, providing tools for text normalization, script conversion, tokenization, and translation across various Indian languages. These libraries support language processing by offering features like script information and transliteration, aiding researchers and developers in handling the complex linguistic landscape of India .

NLP applications have evolved significantly across different phases, from early machine translation efforts to sophisticated AI-driven models today. Initial phases focused on rule-based systems and direct translation tasks, but as NLP progressed through the grammatico-logical and lexical phases, the incorporation of machine learning and statistical models allowed for handling more context and variability in language. Contemporary applications such as chatbots, speech recognition systems, and automated content summarization reflect this evolution by leveraging NLP techniques for improved user interaction and efficiencies, driven by advancements in data processing and computational power .

The first phase of NLP, known as the Machine Translation Phase (late 1940s to late 1960s), focused primarily on direct translation tasks, such as the Georgetown-IBM experiment in 1954, which involved translating Russian to English. This phase was characterized by enthusiasm but faced challenges as translating required more than just word replacement; context and grammar complexity were often neglected . The second phase, or the AI Influenced Phase (late 1960s to late 1970s), integrated artificial intelligence concepts to address these shortcomings, emphasizing world knowledge and the construction of meaning representations. This phase saw developments like the BASEBALL question-answering system and Minsky's more advanced system, which recognized the need for inference on the knowledge base .

Syntax provides the framework for analyzing the arrangement and relationship of words in a sentence, crucial for constructing meaningful and grammatically correct sentences in natural language processing. Its components include constituents like noun and verb phrases that behave as single units, grammatical relations such as subject and object, and subcategorization specifying the expected arrangement of words and phrases. This syntactic knowledge ensures that sentences generated by NLP systems are structurally sound and comprehend grammatically, aiding in effective communication and understanding .

The 1960s saw crucial breakthroughs in AI that significantly influenced NLP, primarily through integrating world knowledge for constructing meaning representations. This era, referred to as the AI-flavored phase, saw the development of systems like the BASEBALL question-answering system and Minsky's advanced system which highlighted the need for inference on a knowledge base. These systems demonstrated AI's role in enhancing language processing capabilities by providing more sophisticated context understanding and response generation, which were pivotal in advancing NLP techniques .

Named Entity Recognition (NER) extends beyond the basic Parts of Speech (POS) tagging by identifying and categorizing words according to specific categories such as person names, organizations, locations, and other significant entities. While POS tagging determines the grammatical category of words (e.g., noun, verb), NER provides contextual information—like distinguishing between 'Apple' as a company or a fruit. This differentiation is crucial in applications that require understanding the precise context and meaning of words, such as customer feedback analysis, information extraction, and automated assistants .

The Markov property aids in constructing N-gram models by assuming that the probability of a word depends only on a finite number of preceding words, which simplifies the modeling of language sequences. N-gram models utilize this property by creating contiguous sequences of 'n' items (words or tokens) from a text corpora. These models are advantageous because they require less computational complexity while still capturing enough context to predict the likelihood of subsequent words, thus proving useful in applications like text prediction and autocomplete features .

Natural Language Processing (NLP) consists of two main components: Natural Language Understanding (NLU) and Natural Language Generation (NLG). NLU involves mapping the given input in natural language into useful representations and analyzing different aspects of the language by interpreting its structure and meaning. In contrast, NLG is concerned with producing meaningful phrases and sentences in the form of natural language from some internal representation, which involves text planning, choosing required words, forming meaningful phrases, and mapping sentence plans into sentence structures .

The introduction of machine learning algorithms in the 1990s revolutionized the field of natural language processing by enabling more flexible and scalable approaches to understanding language patterns. These algorithms facilitated the lexical and corpus phase of NLP by allowing systems to learn from large datasets of text, improving their ability to process and generate human language. Machine learning models, supported by the increasing availability of computational resources, allowed for better handling of ambiguity and variability in language, making NLP applications like sentiment analysis, machine translation, and predictive text much more accurate and efficient .

The grammatico-logical phase (late 1970s to late 1980s) marked a shift from previous NLP approaches primarily through the adoption of logic-based methods for knowledge representation and reasoning. This was a response to the failure of practical system building in the previous, AI-influenced phase. Researchers in this phase used logic to develop powerful sentence processors like SRI's Core Language Engine and tools for handling more complex discourse, illustrating a movement towards more generalized, formalized methods in understanding and generating language . These innovations indicated a turning point from context-driven translation efforts to systems capable of engaging with extended discourse and complex grammatical constructions.

You might also like