0% found this document useful (0 votes)
241 views10 pages

Big Data: Nature and Challenges

This document contains questions and notes about big data and intelligent data analysis. It begins with three parts for a question bank (Parts A, B, and C) that cover topics such as the 5 V's of big data, challenges of conventional systems, intelligent data analysis stages and processes. The notes section defines big data and its sources, provides examples of big data in healthcare, and describes intelligent data analysis and the nature of data.

Uploaded by

Avin Vinod
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
241 views10 pages

Big Data: Nature and Challenges

This document contains questions and notes about big data and intelligent data analysis. It begins with three parts for a question bank (Parts A, B, and C) that cover topics such as the 5 V's of big data, challenges of conventional systems, intelligent data analysis stages and processes. The notes section defines big data and its sources, provides examples of big data in healthcare, and describes intelligent data analysis and the nature of data.

Uploaded by

Avin Vinod
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

1

UNIT 1 – INTRODUCTION TO BIG DATA


QUESTION BANK
PART A ( 2 Marks each)
1. What you mean by Big Data? Imp
2. What are the different sources of big data?
3. Explain different applications of Big Data. Imp
4. Write down the nature of data. Imp
5. What are the different stages in Intelligent Data Analysis?
6. Distinguish between Analysis vs Reporting. Imp
7. What you mean by statistical distributions?
8. What you mean by re-sampling?
9. Define prediction error.

PART B ( 5 Marks each)


10. Explain 5V’s of Big Data.
11. Explain the challenges of conventional systems.
12. Explain Intelligent Data Analysis. Imp
13. Explain analytical process in detail.

PART C ( 15 Marks each)


14. Explain in detail nature of data.
15. Explain Intelligent Data Analysis. Imp
16. Explain Modern Data Analytical Tools. Imp

NOTES
INTRODUCTION TO BIG DATA PLATFORM
▪ Big Data is high-volume, high-velocity and/or high-variety information asset that
requires new forms of processing for enhanced decision making, insight discovery
and process optimization.
▪ A collection of data sets so large or complex that traditional data processing
applications are inadequate.
▪ Data of a very large size, typically to the extent that its manipulation and
management present significant logistical challenges.
▪ A single mobile phone user will generate about 40 exabytes of data on every month.
▪ This massive amount of data is termed as Big data.
▪ Hence these are a large amount of data.
▪ To classify any data as Big data, this is possible with the the concept of 5v.
1. Volume
2. Velocity
3. Variety

Prepared by NITHIN SEBASTIAN


2

4. Veracity
5. Value
▪ Consider the following example of Health care industry.
➢ Volume:
o High data volumes impose distinct data storage and processing
demands, as well as additional data preparation, curation, and
management processes.
o Hospitals and clinics across the world generate massive volumes of
data.
o 2314 Exabytes of data are collected annually.
➢ Velocity:
o In Big Data environments, data can arrive at fast speeds, and
enormous datasets can accumulate within very short periods of time.
o These data are patient records and test results.
o All this data is generated at a very high speed. Which attributes to the
velocity of big data.
➢ Variety:
o
o Data variety refers to the multiple formats and types of data that need
to be supported by Big Data solutions.
o Data variety brings challenges for enterprises in terms of data
integration, transformation, processing, and storage.
o It referred to the various data types.
Structured data: Excel Records.
Semi-structured data: Log files
Un-structured data: X-ray Images.
➢ Veracity:
o Veracity refers to the quality or fidelity of data.
o Data that enters Big Data environments needs to be
assessed for quality, which can lead to data processing
activities to resolve invalid data and remove noise.
o Accuracy and trustworthiness of the generated data
termed as veracity.
o Noise is data that cannot be converted into information and thus has
no value, whereas signals have value and lead to meaningful
information.
➢ Value:
o Value is defined as the usefulness of data for an enterprise.
o Analysing all these data will benefit the medical sector by,
Faster disease detection
Better treatment
Reduced cost
o These are known as the value of big data.
▪ To store and process big data, various frameworks are used.

Prepared by NITHIN SEBASTIAN


3

Cassandra
Hadoop
Spark
▪ Big data is analysed for numerous applications in games like:
HALO 3
CALL OF DUTY
▪ Designers analyse users’ data to understand at which stage users pause, restart, quit
playing.
▪ This insight can help them to rework on the game and improve the user experience.
▪ Also, big data also helped with disaster management during hurricane in 2012 in USA
and necessary measures were taken.

TYPES/SOURCES OF BIG DATA


▪ It is suggested by IBM and the Big Data task team:
• Social networks and web data: such as Facebook, Twitter, e-mails, blogs, and
YouTube.
• Transactions data and Business Processes data: such as credit card
transactions, flight bookings, etc. and public agencies data such as medical
records, insurance business data, etc.
• Customer master data: such as data for facial recognition and for the name,
date of birth, marriage anniversary, gender, location and income category.
• Machine-generated data: such as machine-to-machine or Internet of Things
(IOT) data, and the data from sensors, trackers, web logs and computer
systems log. Computer generated data is also considered as machine
generated data from data stores. Usage of programs for processing of data
using data repositories, such as database or file, generates data and also
machine generated data.
• Human-generated data: such as biometrics data, human-machine interaction
data, e-mail records with a mail server and MySql database of student grades.

CHALLENGES OF CONVENTIONAL SYSTEMS


▪ Big data is huge amount of data, hence conventional systems cannot store, manage
and analyse within a time interval.
1. Data is collected in large quantities and it is not possible to process
everything.
2. Data must be meaningful and collected in real time. Where meaningful data
means a data without irregularity, mistakes and inconsistency. Realtime data
means the data can’t be old or outdated.
3. Data is collected from multiple sources (text, audio, video etc) and must be
categorized. This is an important aspect of data collection. Doing manually, it
is impossible for humans, because it is extremely challenging, time
consuming, requires a lot of manpower.

Prepared by NITHIN SEBASTIAN


4

4. Collect correct data. By eliminating flaws, mistakes, incompleteness,


inconsistency, irregularity from data. Because wrong data will produce
problems in future.
5. Comparison of data using multiple tools. The collected data must be
represented in the form of graphs, charts, statistics etc.
6. The data must be accessible to the respective person.
7. Lack of knowledgeable professionals who can handle and deals with diverse
and large data.
▪ To overcome these challenges and drawbacks, intelligent data analysis is used.
▪ The valuable information can be gathered with the help of machines, thus reducing
processing time, cost and errors.

INTELLIGENT DATA ANALYSIS (IDA)


▪ Intelligent Data Analysis (IDA) discloses hidden facts that are not known previously
and provides potentially important information or facts from large quantities of data.
▪ It also helps in making a decision.
▪ IDA helps to obtain useful information, necessary data and interesting models from a
lot of data available online in order to make the right choices.
▪ The main goal of intelligent data analysis is to obtain knowledge.
▪ Data analysis is the process of a combination of extracting data from data set,
analysing, classification of data, organizing, reasoning and so on.
▪ Intelligent data analysis is the combination of advanced statistical techniques,
human intuition and serious computing power to address real world data intensive
problems.
▪ Analysis is a scientific process to discover the meaningful patterns and structures
hidden within the mountain of available data and transform this data into
information for better decisions.
▪ IDA is one of the major issues in Artificial Intelligence (AI) and information.
▪ Intelligent data analysis discloses hidden facts that are not known previously and
provides potentially important information or facts from large quantities of data.
▪ IDA helps to obtain useful information, necessary data and interesting models.
▪ In general, IDA includes three stages:
1. Preparation of data
2. Data mining
3. Data validation and explanation

NATURE OF DATA
▪ Data are raw facts that have not been processed to explain their meaning.
▪ Data are stored in the database and data base management system manages data.
i.e., it stores, update and retrieve from the database.
▪ There are 3 types of data:
1. Structured data
2. Semi-structured data

Prepared by NITHIN SEBASTIAN


5

3. Unstructured data
STRUCTURED DATA
▪ Stored in tabular format.
▪ i.e., in the form of rows and columns.
▪ Structured data clearly defined and data is stored in a pre-defined data model.
▪ Ex: Excel files, SQL data bases
▪ Data are stored in rows and columns are related to each other.
▪ Hence get a proper view and understanding of data.
▪ Real life example:

▪ Structured data are stored in Relational Databases.


UN- STRUCTURED DATA

▪ No predefined structure.
▪ No data model
▪ Data is irregular and ambiguous.
▪ Ex: text, numbers, images, audios, videos, messages, social media post etc.
▪ Easy to extract data.
▪ 80- 90% of data are unstructured data.
▪ Real life example:
▪ Face book, Instagram & YouTube are unstructured data.
▪ It is complex task to analyse such data. hence Artificial Intelligence is used.
▪ Ex: Face recognition by google.
▪ Previously, only structured data was used extensively. But with the help of Artificial
Intelligence, unstructured data are commonly used.
▪ So, unstructured data is the most useful kind of data. & It provides a lot of
information.
SEMI-STRUCTURED DATA
▪ It falls between structured and unstructured data.
▪ It is a combination of both.
▪ Ex: Emails, XML, WWW.

ANALYTIC PROCESS
▪ The steps involved in data analytic process are:

Prepared by NITHIN SEBASTIAN


6

1. Collecting data
2. Cleaning data
3. Manipulating data
4. Analysing data
5. Visualizing data
▪ Ex: Travel industry:
Collecting Data
✓ If one person is travelling to Delhi, he uses one of the Aviation website,
provides basic details like destination, date of travel, price etc., He select one
with his budget & make the payment confirm. These data are collected by
travel company.
✓ Similarly, many people do the same thing, then it generates a lot of data.
✓ These data are stored in their web servers in tabular format. Hence it is easier
for analyst to analyse this data
Cleaning data
✓ If there are some missing values or un-structed data in tabular format, by
replacing missing values or deleting that row is called cleaning data.
✓ Now the data is clean and ready for analyse.
Manipulating Data
✓ Manipulates the data to create required features and variables.
✓ Ex: if the analyst adds new columns like return date etc.
Analysing Data
✓ Data analyse using logical methods and analytical techniques.
✓ Once it’s ready, advanced analytics processes can turn big data into big insights.
✓ Some of these big data analysis methods include:
● Data mining sorts through large datasets to identify patterns and relationships by
identifying anomalies and creating data clusters.
● Predictive analytics uses an organization’s historical data to make predictions
about the future, identifying upcoming risks and opportunities.
● Deep learning imitates human learning patterns by using artificial intelligence and
machine learning to layer algorithms and find patterns in the most complex and
abstract data.

Visualizing Data
✓ Present data after complete analysis on data.
✓ Visualization means showing analysed data in visual or graphical format for
easy interpretation.

DATA ANALYSIS TOOLS


▪ Data analysis tools provide the analysed results visually.
▪ They are:

Prepared by NITHIN SEBASTIAN


7

Hadoop:
✓ It is an open-source framework that efficiently stores and processes big
datasets on clusters of commodity hardware.
✓ This framework is free and can handle large amounts of structured and
unstructured data, making it a valuable mainstay for any big data operation.
Microsoft Excel :
✓ Developed by Microsoft.
✓ It is a spread sheet program, used to create grid of numbers, texts and
various formulas.
✓ Easy to use & widely used tool.
✓ Excel works with other office software. i.e., Excel spreadsheets can be
easily added to Word document and Power point presentations.
✓ The biggest benefits of Excel are ability to organize large amounts of
data into orderly logical spreadsheets and charts.
RapidMiner:
✓ It is a data science software platform which helps with data
presentation and analysis.
✓ It is an integrated environment for:
• Data preparation
• Analysis
• Machine learning
• Deep learning
✓ It is widely used in every business and commercial sector.
✓ It has data exploration features such as:
• Graphs
• Descriptive statistics
• Visualization which allows users to get valuable insights.
✓ It has more than 1500 operators for data transformation and analysis
tasks.
Talend:
✓ It is an open-source software platform which offers data integration
and management.
✓ It specializes in big data integration.
✓ It is also available in open-source and premium versions.
✓ It is one of the best tools for cloud computing and big data
integration.
KNIME:
✓ It is a free and open-source data analysis tool to create data science
applications and build machine learning models.
✓ It is an analysing, reporting and integration platform.
✓ KNIME has been used in pharmaceutical research and customer data
analysis, business intelligence, text mining & financial data analysis.
✓ It provides interactive graphical user interface to create visual
workflows.

Prepared by NITHIN SEBASTIAN


8

▪ Other tools are:


SAS ( Statistical Analysis System)
R and Python
Apache spark
Power BI
Tableau

ANALYSIS VS REPORTING
REPORTING
▪ It is the process of organizing data in the form of graphs and charts.
▪ Reporting is used to provide facts, which can use to draw conclusions, avoid
problems or create plans.
▪ Reporting presents the actual data to end-users, after collecting, sorting and
summarizing it to make it easy to understand.
▪ Reporting offers no judgment or insight.
▪ It focuses on what is happening.
▪ High-level overview of data.
ANALYSING
▪ It is the process of exploring data in order to extract a meaningful insight.
▪ Analytics offers pre-analysed conclusions that a company can use to solve problems
and improve its performance.
▪ Analytics doesn't present the data but instead draws information from the available
data and uses it to generate insights, forecasts and recommended actions.
▪ It focuses on why is something happening.
▪ Data analytics focuses on “why” something is happening within an organization.
▪ Interpret data at a deeper level.

STATISTICAL CONCEPTS
▪ Statistics is a applied mathematics were we collect, organize, analyse and interpret
numerical facts.
▪ Statistical methods are the concepts models and formulas of mathematics used in
the Statistical analysis of data.
▪ It is the science of collecting, exploring and presenting large amounts of data to
identify patterns and trends.
▪ It is also called quantitative analysis.
SAMPLING DISTRIBUTIONS
• In statistics, a population is the entire pool from which a statistical sample is
drawn.
• A population may refer to an entire group of people, objects, events, hospital
visits, or measurements.
• A population can thus be said to be an aggregate observation of subjects
grouped together by a common feature.

Prepared by NITHIN SEBASTIAN


9

• A lot of data drawn and used by academicians, statisticians, researchers,


marketers, analysts, etc. are actually samples, not populations.
• A sample is a subset of a population.
• A sampling distribution is a probability distribution of a statistic obtained
from a larger number of samples drawn from a specific population.
• The sampling distribution of a given population is the distribution of
frequencies of a range of different outcomes that could possibly occur for a
statistic of a population.
• Sampling Distribution is a statistic that aims to guess a large number of
samples obtained from a specific group repeatedly.
• In statistics, the probability is used for calculating the likely occurrence of a
phenomenon.
• This is done by collecting samples from populations.
• A lot of data that is collected over time, aim to calculate the probabilities of
an event.
• This data is collected with utmost precision.
• Sampling distribution involves more than one statistical value of a sample.
• The primary purpose of Sampling Distribution is to establish representative
results of small samples of a comparatively larger population.
• The significance of sampling distribution :
✓ It provides accuracy.
✓ Provides consistency.
RE-SAMPLING

• The problem with the sampling process is that we only have a single estimate
of the population parameter, with little idea of the variability or uncertainty in
the estimate.
• One way to address this is by estimating the population parameter multiple
times from our data sample. This is called resampling.
• Re-sampling is the method that consists of creating or drawing repeated
samples from the original samples.
• Resampling involves the selection of randomized cases with replacement
from the original data sample in such a manner that each number of a sample
drawn has a number of cases that are similar to the original data sample.

STATISTICAL INFERENCE

• Statistical inference is the process of analysing the result and making


conclusions from data.
• It is also called inferential statistics.
• Statistical inference is a method of making decisions about the parameters of
a population, based on random sampling.
• It helps to assess the relationship between the dependent and independent
variables.

Prepared by NITHIN SEBASTIAN


10

• different types of statistical inferences are:


✓ Pearson Correlation
✓ Bi-variate regression
✓ Multi-variate regression
✓ Chi-square statistics and contingency table
✓ ANOVA or T-test
• The statistical inference has a wide range of application in different fields,
such as:
✓ Business Analysis
✓ Artificial Intelligence
✓ Financial Analysis
✓ Fraud Detection
✓ Machine Learning etc.
PREDICTION ERROR

• Predictive analytical processes use new and historical data to forecast activity,
behaviour, and trends.
• A prediction error is the failure of some expected event to occur.
• When prediction fails, humans can use different methods, examining
predictions and failures and deciding some methods to overcome such errors
in the future.
• Applying that type of knowledge can inform decisions and improve the
quality of future prediction.

Prepared by NITHIN SEBASTIAN

Common questions

Powered by AI

Conventional systems face several challenges in managing big data due to its volume, variety, and velocity. These systems struggle with storing and processing large quantities of data in real-time while ensuring that the data is meaningful and free from inconsistencies. Challenges include the need for real-time data processing, managing data from diverse sources, and the requirement for accurate data. Solutions involve adopting frameworks like Hadoop, which can handle structured and unstructured data, and utilizing intelligent data analysis to reduce processing time, cost, and errors .

The value of big data in the healthcare sector is attributed to its ability to enhance disease detection, improve treatment outcomes, and reduce costs. By analyzing extensive data sets, healthcare providers can identify disease patterns early, tailor treatments based on patient data, and streamline operations to minimize expenses. This results in more effective patient care and resource allocation, demonstrating the strategic importance of big data in healthcare .

Big data analytics can improve user experience in video games by analyzing user data to understand at which stages players pause, restart, or quit the game. This insight allows designers to rework on game elements and enhance the user experience. Furthermore, by utilizing big data, developers can tailor the gaming experience based on behavioral patterns, leading to better engagement and satisfaction .

Machine-generated data plays a crucial role in IoT by providing the data needed for devices to communicate and operate efficiently. This data comes from sensors, trackers, web logs, and other computer systems' logs, enabling the collection and processing of real-time information. By analyzing this data, IoT systems can perform predictive maintenance, improve operations, and enable automation, making them crucial for IoT infrastructure .

Intelligent Data Analysis (IDA) enhances decision-making by uncovering hidden facts and providing potentially important information from large data sets. It helps in making decisions by obtaining useful insights, necessary models, and pertinent data, which are crucial for informed decision-making processes. IDA involves preparation, data mining, and data validation stages, making it a vital tool for transforming data into actionable knowledge .

Data analysis tools like Hadoop and RapidMiner support big data initiatives by providing robust frameworks for data storage, processing, and analysis. Hadoop is an open-source framework that efficiently manages large datasets across clusters, ideal for both structured and unstructured data. RapidMiner offers a comprehensive platform for data preparation, machine learning, and data visualization. These tools enable organizations to handle big data efficiently, derive insights, and make data-driven decisions effectively .

Statistical inference contributes to AI and ML by enabling the assessment of relationships between variables and making informed decisions based on random sampling from data. It encompasses techniques like Pearson correlation, bivariate and multivariate regression, and ANOVA, which help in model building, hypothesis testing, and understanding data trends. In AI and ML, statistical inference is critical in training models and validating them, thus enhancing predictive accuracy and reliability .

The main components of the data analytic process include collecting, cleaning, manipulating, analyzing, and visualizing data. These steps contribute to gaining insights by ensuring that the data is accurate and relevant before analysis. Data collection gathers necessary data, cleaning removes errors and inconsistencies, manipulation prepares data for analysis by creating relevant variables, analysis extracts insights using logical and advanced analytics methods, and visualization presents analyzed data in an interpretable format, facilitating informed decision-making .

Unstructured data, which constitutes 80-90% of all data, is significant due to its abundance and richness in information. Unlike structured data, unstructured data does not follow a predefined format, which makes it versatile but also challenging to analyze. The impact on data analysis is notable; it requires robust tools and technologies like artificial intelligence to process, interpret, and extract valuable insights. Despite the complexity, unstructured data offers immense value by providing deep insights into consumer behavior and trends when properly analyzed .

Analysis and reporting serve different roles in data management. Reporting presents organized data in the form of graphs and charts, offering a high-level overview and focusing on what is happening, without judgment or insight. It aids end-users in understanding the data's state for informed decision-making. In contrast, analysis explores data to extract meaningful insights and provide pre-analyzed conclusions that drive problem-solving and performance improvements. Analysis delves into the 'why,' providing deeper interpretations and recommendations based on data .

You might also like