0% found this document useful (0 votes)
22 views7 pages

Unit I

The document is a question bank for the Computer Science and Engineering department at V.S.B. Engineering College, focusing on Data Science for the academic year 2024-2025. It includes definitions, concepts, and comparisons related to data science, data mining, data warehousing, and various data processing techniques. The document serves as a study guide for students preparing for exams in this field.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views7 pages

Unit I

The document is a question bank for the Computer Science and Engineering department at V.S.B. Engineering College, focusing on Data Science for the academic year 2024-2025. It includes definitions, concepts, and comparisons related to data science, data mining, data warehousing, and various data processing techniques. The document serves as a study guide for students preparing for exams in this field.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

V.S.B.

ENGINEERING COLLEGE, KARUR


Department of Computer Science and Engineering (AIML)
Academic Year: 2024-2025 (Odd Semester)
UNIT 1 - Question Bank

1. Define Data science.(Apr/May 24) [RE]


Data science is the domain of study that deals with vast volumes of data using modern
tools and techniques to find unseen patterns, derive meaningful information, and make
business decisions. Data science uses complex machine learning algorithms to build
predictive models.

2. What are the benefits and uses of data science? (Apr/May 2023) [RE]
 Commercial
 Governmental
 Non governmental
 Universities

3. List the facets of data. .(Apr/May 24) [RE]


 Structured
 Unstructured
 Natural language
 Machine generated data
 Graph based or network data
 Audio/Video/Image
 Streaming data

4. What is Research goal? [RE]


Aim to fulfil a precise and measurable goal that is clearly connected to the purposes,
workflows, and decision-making processes of the business.

5. Define Project charter. [RE]


A project charter is a formal short document that states a project exists and provides
project managers with written authority to begin work.
 Project goal
 Project participants
 Stakeholders
 Requirements& Constraints
 Implementation milestones
 Deliverables
 Create an implementation plan

6. How to Retrieve Data in data science? [RE]


Data retrieval typically requires writing and executing data retrieval or extraction
commands or queries on a database. Based on the query provided, the database looks for and
retrieves the data requested. Applications and software generally use various queries to retrieve
data in different formats.
 Internal
 External

7. List the Data preparation Steps. [RE]


Data
discrenation

Data reduction Data cleaning

Data
Data integration
tranformation

8. What is Data Exploration? [RE]


Data exploration refers to the initial step in data analysis in which data analysts use data
visualization and statistical techniques to describe dataset characterizations, such as size,
quantity, and accuracy, in order to better understand the nature of the data.

9. List the Data exploration tools [RE]


 Matplotlib
 Sckirt
 Plotpy
 Seaborn
 Pandas
 D3
 Bokeh
 Altair
 Yellow brick
 Tableau.

10. Difference between Data and Information. [UN]


Data
 Data refers to raw and unprocessed facts, figures, symbols, or values that have little or no
context or meaning on their own.
 It is typically represented as numbers, text, images, audio, video, or any other form that
can be processed by computers or observed by humans.
 Data lacks interpretation and significance until it is processed and organized in a
meaningful way.
Example: The numbers 25, 32, 18, 41, and 56 are data. They are just individual values without
any context.
Information:
 Information is the result of processing and organizing data to give it meaning and
context.
 It involves analyzing and interpreting data to derive meaningful insights, conclusions, or
knowledge.
 Information is more structured and has value to the recipient, as it provides answers to
specific questions or helps in making informed decisions.

11. What is meant by missing value? [RE]


Missing values depend on the unobserved data. If there is some structure/pattern in
missing data and other observed data cannot explain it, then it is
 Missing Not At Random (MNAR).
 Missing Completely at Random (MCAR)
 Missing at Random (MAR)

12. Summarize Outliers. [UN]


An outlier is an observation that lies an abnormal distance from other values in a random
sample from a population. In a sense, this definition leaves it up to the analyst (or a consensus
process) to decide what will be considered abnormal.
Identified by Scatter plots & Box plots

13. Show the key points in building the model. [RE]


 Selection of model
 Execution of model
 Diagnose of model

14. State Link and brush. [RE]


In databases, brushing and linking is the connection of two or more views of the same
data, such that a change to the representation in one view affects the representation in the other.
Brushing and linking is also an important technique in interactive visual analysis, a method for
performing visual exploration and analysis of large, structured data sets.

15. Define Machine Generated data. [RE]


Machine-generated data is information automatically generated by a computer
process, application, or other mechanism without the active intervention of a human.
Examples:
 Web server logs.
 Call detail records.
 Financial instrument trades.
 Network event logs.

16. What is meant by Data mining? [RE]


Data mining is a process used by companies to turn raw data into useful information. By
using software to look for patterns in large batches of data, businesses can learn more about their
customers to develop more effective marketing strategies, increase sales and decrease costs. Data
mining depends on effective data collection, warehousing, and computer processing.
17. Define Data warehousing. [RE]
Data warehousing is the process of constructing and using a data warehouse. A data
warehouse is constructed by integrating data from multiple heterogeneous sources that support
analytical reporting, structured and/or ad hoc queries, and decision making. Data warehousing
involves data cleaning, data integration, and data consolidations.

18. Difference between Data mining and Data science. [UN]

Data Mining Data Science

Data mining is a phase of extracting useful Data science defines the process of
data, patterns, and trends from large obtaining valuable insights from
databases. structured and unstructured records
by using several tools and methods.

The main objective of data mining is to The main objective of data science is
discover properties of existing information to use certain specialized
that were previously unknown and to find computational methods to find
statistical rules or patterns from those data meaningful and useful data within a
to solve complex computing problems. dataset to create important decisions.

In Data mining, the identified trends and Data science is scientific research
patterns are used by organizations to that paves the way for a project
formulate operations, marketing, and program- or portfolio-centric
financial strategies to fuel business analysis.
growth.

Data Mining centers on discovering Data Science makes data-focused


records from several sources and products for organizations and drives
transforming the data into a useful tool. It decisions through the aid of records.
can be used across industries. It can be used across industries.

19. List out the Basic Statistics of data description. [RE]


 Mean
 Mode
 Median
 Variance
 Correlation
 Standard Deviation
 Regression
 Normal distribution

20. Differentiate between Data science and Data warehouse. [UN]


The purpose of the Data science is not fixed.
The data warehouse has data that has
Sometimes organizations have a future use-
already been designed for some use-case.
Purpose case in mind. Its general uses include data
Its uses include Business Intelligence,
discovery, user profiling, and machine
Visualizations, and Batch Reporting.
learning.
Data Scientists use data lakes to find out the
Business Analysts use data warehouses to
Users patterns and useful information that can help
create visualizations and reports.
businesses.
It is comparatively low-cost storage as we do
Storing data is a bit costlier and also a
Pricing not give much attention to storing in the
time-consuming process.
structured format.

21. Outline data binning. [UN]


Data binning, also called discrete binning or bucketing, is a data pre-processing technique
used to reduce the effects of minor observation errors. It is a form of quantization. The original
data values are divided into small intervals known as bins, and then they are replaced by a
general value calculated for that bin. This has a soothing effect on the input data and may also
reduce the chances of over fitting in the case of small datasets.

22. Define Data science and Big data (Nov/Dec 2022) [RE]
Big Data refers to large data sets which are analyzed to understand data trends, which is
also referred to as data mining, but data science utilizes machine learning algorithms to design
and create statistical methods to generate information from big data that can be implemented to
enhance business processes

23. List an overview of common errors in retrieving data and which cleansing solutions to
be employed (Nov/Dec 2022) [RE]
 Failing to Analyze the Data System.
 Focusing on a Small Subset of Data.
 Overlooking Duplicate Data.
 Cleaning Without Backing Up Data.
 Communication Breakdowns.
 Benefits of Getting & Cleaning Data

24. What do you understand by the term mining? (Apr/May 2023) [RE]
Data mining is the process of sorting through large data sets to identify patterns and
relationships that can help solve business problems through data analysis. Data mining
techniques and tools enable enterprises to predict future trends and make more-informed
business decisions.

[Link] mean, median and mode for following 2,5,1,8,9,2,7,6(Nov/Dec 2023) [RE]
Mean 5
Median 5.5
Mode 2
26. Recall data reduction [RE]
Data reduction is a method of reducing the size of original data so that it may be
represented in a much smaller space. While reducing data, data reduction techniques preserve
data integrity.

27. Compare Data warehouse, Data mart and Data lake. [UN]

Aspect Data Warehouse Data Mart Data Lake


Purpose Centralized repository for Smaller, focused subset Vast, flexible storage system
consolidated data from of a data warehouse, for raw and unstructured
various sources for BI, catering to the specific data, allowing for more
reporting, and strategic needs of a department or exploratory data analysis
decision-making. user group. and big data processing.
Data Structured data, organized Structured data, similar Contains both structured and
Structure into predefined schemas to a data warehouse, but unstructured data, including
and data models. focused on specific raw data and data in its
business areas or user native format.
groups.
Data Involves Extract, May also undergo ETL Often uses ELT (Extract,
Processing Transform, Load (ETL) processes but generally Load, Transform) processes,
processes to transform simpler than data allowing for faster data
and cleanse data before warehouses. ingestion and more flexible
storage. analysis.
Data Usage Business intelligence, data Analytical purposes, Big data processing, data
analysis, and historical providing faster and science, machine learning,
reporting for strategic targeted access to and advanced analytics with
decision-making. relevant data for specific a focus on exploration.
business areas.
Scalability Designed for scalability, Relatively easier to scale Highly scalable, capable of
but can be more than data warehouses handling massive volumes
challenging to scale due to their smaller size of both structured and
compared to data lakes. and focused nature. unstructured data without
significant structural
changes.
Schema Predefined schemas and Predefined or Not necessarily predefined,
data models are used. customized schemas, offering more flexibility in
often derived from the data exploration and
data warehouse. analysis.
Data High emphasis on data Relies on data quality Data quality may vary since
Quality quality, as data is maintained in the data raw data is stored, but it can
transformed and cleansed warehouse or during be improved during the
before storage. ETL processes. analysis process.

28. List the Common evaluation metrics used to measure the performance of models. [RE]
Mean squared Error (MSE), Root Mean Squared Error(RMSE),Mean Absolute
Error(MAE), Coefficient of Determination, Confusion Matrix.
29. Show example for any 3 types of Visualization methods used for data exploration. [UN]
 Bar chart
 Line plot
 Distribution graph

30. Define Exploratory Data Analysis and its types (Nov/Dec 2023) [UN]
EDA refers to the critical process of performing investigation on data so as to discover
patterns to spot anomalies, to test hypothesis and to check assumptions with the help of summary
statistics and graphical representation.
Types: Simple graph
Combined graph
Link and brush
Non graphical representation

You might also like