0% found this document useful (0 votes)
31 views16 pages

Understanding Computer Vision Basics

Computer Vision is a subfield of AI that enables machines to interpret and understand visual data, drawing parallels to human vision. The field has evolved from early analog systems to complex tasks like object detection, facial recognition, and image segmentation, significantly aided by advancements in deep learning and datasets like ImageNet. Applications of computer vision span various industries, including augmented reality in e-commerce, education, gaming, travel, and healthcare, enhancing user experiences and operational efficiencies.

Uploaded by

jaiwanthish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views16 pages

Understanding Computer Vision Basics

Computer Vision is a subfield of AI that enables machines to interpret and understand visual data, drawing parallels to human vision. The field has evolved from early analog systems to complex tasks like object detection, facial recognition, and image segmentation, significantly aided by advancements in deep learning and datasets like ImageNet. Applications of computer vision span various industries, including augmented reality in e-commerce, education, gaming, travel, and healthcare, enhancing user experiences and operational efficiencies.

Uploaded by

jaiwanthish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

What is computer vision?

Computer Vision is a subfield of Deep Learning and Artificial Intelligence where humans
teach computers to see and interpret the world around them.
While humans and animals naturally solve vision as a problem from a very young age,
helping machines interpret and perceive their surroundings via vision remains a largely
unsolved problem.
Limited perception of the human vision along with the infinitely varying scenery of our
dynamic world is what makes Machine Vision complex at its core.

Human Vision System vs. Computer Vision system

A brief history of computer vision


Like all great things in the world of technology, computer vision started with a cat.
Two Swedish scientists, Hubel and Wiesel, placed a cat in a restricting harness and an
electrode in its visual [Link] scientists showed the cat a series of images through a
projector, hoping its brain cells in the visual cortex would start firing.
With no avail with images, the eureka moment happened when a projector slide was
removed, and a single horizontal line of light appeared on the wall—
Neurons fired, emitting a crackling electrical noise.
The scientists had just realized that the early layers of the visual cortex respond to simple
shapes, like lines and curves, much like those in the early layers of a deep neural network.
They then used an oscilloscope to create these and observe the brain’s reaction.
This experiment marks the beginning of our understanding of the interconnection between
computer vision and the human brain, which will be helpful for our understanding of artificial
neural networks.
However
Before cat brains entered the scene, analog computer vision began as early as the 1950s
at universities pioneering artificial intelligence.

Computer vision vs. human vision


The notion that machine vision must be derived from the animal vision was predominant as
early as 1959—when the neurophysiologists mentioned above tried to understand cat
vision.
Since then, the history of computer vision is dotted with milestones formed by the rapid
development of image capturing and scanning instruments complemented by state-of-the-
art image processing algorithms’ design.
The 1960s saw the emergence of AI as an academic field of study, followed by the
development of the first robust Optical Character Recognition system in 1974.
By the 2000s, the focus of Computer Vision has been shifted to much more complex topics,
including:
• Object identification
• Facial recognition
• Image Segmentation
• Image Classification
And more—
All of them have achieved commendable accuracies over the years.
The year 2010 saw the birth of the ImageNet dataset with millions of labeled images freely
available for research. This led to the formation of the AlexNet architecture two years later—
making it one of the biggest breakthroughs in Computer Vision, cited over 82K times.

Image Processing as a Part of Computer Vision


Digital Image Processing, or Image Processing, in short, is a subset of Computer Vision. It
deals with enhancing and understanding images through various algorithms.
More than just a subset, Image Processing forms the precursor of modern-day computer
vision, overseeing the development of numerous rule-based and optimization-based
algorithms that have led machine vision to what it is today.
Image Processing may be defined as the task of performing a set of operations on an image
based on data collected by algorithms to analyze and manipulate the contents of an image
or the image data.
Now that you know the theory behind computer vision let’s talk about its practical side.

How does computer vision work?


Here’s a simple visual representation that answers this question on the most basic level.

However—
While the three steps outlining the basics of computer vision seem easy, processing and
understanding an image via machine vision are quite difficult. Here’s why—
An image consists of several pixels, with a pixel being the smallest quanta in which the
image can be divided into.
Computers process images in the form of an array of pixels, where each pixel has a set of
values, representing the presence and intensity of the three primary colors: red, green, and
blue.
All pixels come together to form a digital image.
The digital image, thus, becomes a matrix, and Computer Vision becomes a study of
matrices. While the simplest computer vision algorithms use linear algebra to manipulate
these matrices, complex applications involve operations like convolutions with learnable
kernels and downsampling via pooling.
Below is an example of how a computer “sees” a small image.

The values represent the pixel values at the particular coordinates in the image, with 255
representing a complete white point and 0 representing a complete dark point.
For larger images, matrices are much larger.
While it is easy for us to get an idea of the image by looking at it, a peek at the pixel values
shows that the pixel matrix gives us no information on the image!
Therefore, the computer has to perform complex calculations on these matrices and
formulate relationships with neighboring pixel elements just to say that this image
represents a person’s face.
Developing algorithms for recognizing complex patterns in images might make you realize
how complex our brains are to excel at pattern recognition so naturally.
Some operations commonly used in computer vision based on a Deep Learning
perspective include:
1. Convolution: Convolution in computer vision is an operation in which a learnable
kernel is “convolved” with the image. In other words—the kernel is slided across the
image pixel by pixel, and an element-wise multiplication is performed between the
kernel and the image at every pixel group.
2. Pooling: Pooling is an operation used to reduce the dimensions of an image by
performing operations at a pixel level. A pooling kernel slides across the image, and
only one pixel from the corresponding pixel group is selected for further processing,
thus reducing the image size., eg., Max Pooling, Average Pooling.
3. Non-Linear Activations: Non-Linear activations introduce non-linearity to the
neural network, thereby allowing the stacking of multiple convolutions and pooling
blocks to increase model depth.

7 common computer vision tasks


In essence, computer vision tasks are about making computers understand digital images
as well as visual data from the real world. This can involve extracting, processing, and
analyzing information from such inputs to make decisions.
The evolution of machine vision saw the large-scale formalization of difficult problems into
popular solvable problem statements.
Division of topics into well-formed groups with proper nomenclature helped researchers
around the globe identify problems and work on them efficiently.
The most popular computer vision tasks that we regularly find in AI jargon include:
Image classification
Image classification is one of the most studied topics ever since the ImageNet dataset was
released in 2010.
Being the most popular computer vision task taken up by both beginners and experts,
image classification as a problem statement is quite simple.
Given a group of images, the task is to classify them into a set of predefined classes using
solely a set of sample images that have already been classified.
As opposed to complex topics like object detection and image segmentation, which have
to localize (or give positions for) the features they detect, image classification deals with
processing the entire image as a whole and assigning a specific label to it.

Object detection
Object detection, as the name suggests, refers to detection and localization of objects
using bounding boxes.
Object detection looks for class-specific details in an image or a video and identifies them
whenever they appear. These classes can be cars, animals, humans, or anything on which
the detection model has been trained.
Previously methods of object detection used Haar Features, SIFT, and HOG Features to
detect features in an image and classify them based on classical machine learning
approaches.
This process, other than being time-consuming and largely inaccurate, has severe
limitations on the number of objects that can be detected.
As such, Deep Learning models like YOLO, RCNN, SSD that use millions of parameters to
break through these limitations are popularly employed for this task.
Object detection is often accompanied by Object Recognition, also known as Object
Classification.

Image segmentation
Image segmentation is the division of an image into subparts or sub-objects to demonstrate
that the machine can discern an object from the background and/or another object in the
same image.
A “segment” of an image represents a particular class of object that the neural network has
identified in an image, represented by a pixel mask that can be used to extract it.
This popular domain of Computer Vision has been studied widely both with the use of
traditional image processing algorithms like watershed algorithms, clustering-based
segmentation and with the use of popular modern-day deep learning architectures like
PSPNet, FPN, U-Net, SegNet, etc.

Face and person recognition


Facial Recognition is a subpart of object detection where the primary object being detected
is the human face.
While similar to object detection as a task, where features are detected and localized, facial
recognition performs not only detection, but also recognition of the detected face.
Facial recognition systems search for common features and landmarks like eyes, lips, or a
nose, and classify a face using these features and the positioning of these landmarks.
Traditional Image Processing based methods for facial recognition include Haar Cascades,
which is easily accessible via the OpenCV library. Some more robust methods using Deep
Learning based algorithms can be found in papers like FaceNet.

Edge detection
Edge detection is the task of detecting boundaries in objects.
It is algorithmically performed with the help of mathematical methods that help detect sharp
changes or discontinuities in the brightness of the image. Often used as a data pre-
processing step for many tasks, edge detection is primarily done by traditional image
processing-based algorithms like Canny Edge detection and by convolutions with specially
designed edge detection filters.
Furthermore, edges in an image give us paramount information about the image contents,
resulting in all deep learning methods performing edge detection internally for the capture
of global low-level features with the help of learnable kernels.

Image restoration
Image Restoration refers to the restoration or the reconstruction of faded and old image
hard copies that have been captured and stored in an improper manner, leading to loss of
quality of the image.

Typical image restoration processes involve the reduction of additive noise via
mathematical tools, while at times, reconstruction requires major changes, leading to further
analysis and the use of image inpainting.
In Image inpainting, damaged parts of an image are filled with the help of generative models
that make an estimate of what the image is trying to convey. Often the restoration process
is followed by a colorization process that colors the subject of the picture (if black and white)
in the most realistic manner possible.

Feature matching
Features in computer vision are regions of an image that tell us the most about a particular
object in the image.
While edges are strong indicators of object detail and. therefor,e important features, much
more localized and sharp details—like corners, also serve as features. Feature matching
helps us to relate the features of similar region of one image with those of another image.
The applications of feature matching are found in computer vision tasks like object
identification and camera calibration. The task of feature matching is generally performed
in the following order:
1. Detection of features: Detection of regions of interest is generally performed by
Image Processing algorithms like Harris Corner Detection, SIFT, and SURF.
2. Formation of local descriptors: After features are detected, the region
surrounding each keypoint is captured and the local descriptors of these regions of
interest are obtained. A local descriptor is the representation of a point’s local
neighborhood and thus can be helpful for feature matching.
3. Feature matching: The features and their local descriptors are matched in the
corresponding images to complete the feature matching step.

Scene reconstruction
One of the most complex problems of computer vision, scene reconstruction is the digital
3D reconstruction of an object from a photograph.
Most algorithms in scene reconstruction roughly work by forming a point cloud at the
surface of the object and reconstructing a mesh from this point cloud.

Video motion analysis


Video motion analysis is a task in machine vision that refers to the study of moving objects
or animals and the trajectory of their bodies.
Motion analysis as a whole is a combination of many subtasks, particularly object detection,
tracking, segmentation, and pose estimation.
While human motion analysis is used in areas like sports, medicine, intelligent video
analytics, and physical therapy, motion analysis is also used in other areas like
manufacturing and to count and track microorganisms like bacteria and viruses.

Computer vision technology challenges


One of the biggest challenges in machine vision is our lack of understanding of how the
human brain and the human visual system works.
We have an enhanced and complex sense of vision that we can figure out at a very young
age but are unable to explain the process by which we can understand what we see.
Furthermore, day-to-day tasks like walking across the street at the zebra crossing, pointing
at something in the sky, checking out the time on the clock, require us to know enough
about the objects around us to understand our surroundings.
Such aspects are quite different from simple vision, but are largely inseparable from it. The
simulation of human vision via algorithms and mathematical representation thus
requires the identification of an object in an image and an understanding of its presence
and its behaviour.

Computer Vision Application in Augmented reality


Augmented reality (AR) is a method of providing an experience of the natural
surroundings with a computer-generated augmentation appropriate to the surroundings.
With the help of computer vision, AR can be virtually limitless, with augmentations providing
translations of written text and applying filters to objects in the world we see, directly when
we see them.

Computer Vision and Augmented Reality for E-Commerce


Recently, AR has made it possible for retailers to showcase their products in
real-time. IKEA was one of the first to roll out an AR application that enabled buyers
to visualize products within their homes. Today, more and more retailers are
utilizing AI software to elevate the shopping experience and ease purchasing
decisions for their clients.

The same goes for online clothing stores. The technology enables shoppers to
virtually try on clothes and find their perfect fit. Nowadays, the number of online stores
unveiling their fitting rooms by an app is growing. For now, computer vision-based
augmented reality has been proven efficient in providing a better customer experience,
improving brand perception and boosting sales.

AI Augmented Reality For Education


When it comes to education, AR and VR are total game-changers. They have
the potential to improve the learning process and better motivate and engage students.
More importantly, the technologies have proven to be successful in real-time training.
When theory fails to improve the recall, virtual reality AI comes to the rescue. Thanks
to this, students have a range of VR tutorials and modeling sessions where they can
obtain hands-on experience and polish up their techniques.
Machine Learning-Based VR for Gaming
Virtual reality is evolving at breakneck speed. And some of the latest VR
breakthroughs haven’t been possible without machine learning and computer vision.

According to Grand View Research, the global virtual reality in gaming market size is
expected to reach USD 45.09 billion by 2025. Lately, virtual reality powered by
computer vision has given a brand new twist to the video game industry. There’s a
number of benefits VR has to offer for the gaming business:

• significant increase in sales


• enhanced user experience
• improved player retention

AI-Powered Augmented Reality for Travel and Tourism


The technology has been actively utilized by travel agencies and hotels to better
the overall brand reputation and boost revenue. Within the hospitality and tourism
industry, augmented reality fueled by computer vision acts as a powerful tool to bring
more interaction into hotels and resorts and convince travelers into impulse booking.

On top of that, some travel agencies develop AR apps that offer breathtaking
immersive tours. The ultimate goal of those apps is to take the potential traveler on an
interactive tour somewhere sunny and give them the most of the information about the
destination. For sophisticated travelers, there’s an opportunity to get a unique view of
a sight or a resort from the drone. Thanks to AR drone image processing, tourism
agencies are now reaping the benefits of drone technology and promoting tourist
destinations.

VR and AR for Healthcare

These days, AR is making a significant contribution to the healthcare industry.


The innovation empowers healthcare professionals to provide better diagnosis and
make surgery safer. Using AI coupled with computer vision and AR, surgeons can now
place surgical incisions more precisely and prevent tissue damage.

Moreover, computer-vision based virtual reality is the next big thing for mental
health and psychotherapy. It’s utilized to treat patients with post-traumatic stress
disorders (PTSD), depression, anxiety, and other mental-related issues.

You might also like