0% found this document useful (0 votes)
202 views21 pages

Unit 2 Final

The document outlines the foundations of data in AI, emphasizing the importance of data types, ethics, and utility in building applications. It categorizes data into structured, unstructured, and semi-structured types, and discusses ethical considerations such as privacy and fairness. Additionally, it highlights the role of big data in enhancing AI model performance and the significance of various data modalities and formats.

Uploaded by

40272.8vdib
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
202 views21 pages

Unit 2 Final

The document outlines the foundations of data in AI, emphasizing the importance of data types, ethics, and utility in building applications. It categorizes data into structured, unstructured, and semi-structured types, and discusses ethical considerations such as privacy and fairness. Additionally, it highlights the role of big data in enhancing AI model performance and the significance of various data modalities and formats.

Uploaded by

40272.8vdib
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

UNIT – 2

1Q. Foundations of Data - Types, Ethics and Utility in Building Applications using AI

Foundations of data consist of understanding data types, ethical handling of data, and using data effectively to
train AI models.

Structured, unstructured, and semi-structured data help in building AI systems.

Ethical principles like privacy, fairness, consent, and transparency ensure responsible AI.

Utility of data lies in training models, making predictions, personalization, automation, and enabling intelligent
decision-making in modern applications.

AI applications depend completely on data. The quality, type, and ethical use of data decide how good and safe
an AI system will be.

1. Data Types

Data used in AI can be broadly divided as follows:

1. Structured Data

Organized in rows/columns (databases). Easy to query and process for AI models (e.g., tabular datasets for
regression).

2. Unstructured Data

No fixed format (images, audio, text, video). Requires preprocessing (feature extraction, embeddings) for AI
use.

3. Semi-Structured Data

Not fully structured but contains tags or markers. Mix of formats (JSON, HTML, XML files). Needs parsing to
extract usable features for AI pipelines.

4. Time - series data: Sequential data points (sensor logs). Used in forecasting or anomaly detection models.

5. Data Based on Origin

 Primary Data: Collected first-hand (surveys, experiments, sensors).


 Secondary Data: Taken from existing sources (web data, reports).

6. Data Based on Dynamics

 Static Data: Does not change (historical records).


 Dynamic Data: Continuously changing (real-time sensor data, stock prices).

7. Data for ML/AI Tasks


1
 Training Data – teaches the model
 Validation Data – tunes the model
 Testing Data – evaluates performance

II. Data Ethics in AI

Ethics ensures that AI systems are fair, safe, transparent, and responsible.

1. Privacy

 Protecting personal data of users


 Following laws like GDPR, Data Protection Acts
 Using encryption, anonymization

2. Bias and Fairness

unrepresentative datasets can lead to unfair AI outcomes. Data can contain human or societal biases. mitigation
involves diverse sampling and bias audits.(BIAS DETECTION TOOLS).

3. Transparency

AI decisions must be explainable. Documenting data sources, processing steps, and model behavior for
accountability. Users should know how data is collected and used.

4. Governance

Implementing policies for data collection, storage, usage, and compliance with regulations (e.g., GDPR)

5. Consent

Users should give permission before their data is [Link] must be used only for the stated purpose

6. Security

Protecting data from hacking, leaks, misuse

7. Accountability

Organizations must be responsible for AI decisions. Proper audit trails, documentation.

III. Utility of Data in Building AI Applications

Data is the core ingredient that enables AI systems to function.

1. Training AI Models

High-quality, diverse datasets improve model accuracy and generalization.

2
AI learns patterns, relationships, trends from data.

Better data → Better model accuracy

2. Feature engineering:

Transforming raw data into meaningful inputs boosts model performance.

3. Evaluation:

Validation data helps measure AI model effectiveness and avoid overfitting.

4. Deployment:

Real-world data pipelines feed live inputs to AI systems for inference.

5. Personalization

Applications recommend content/products based on user data


Examples: YouTube, Amazon, Netflix.

6. Automation

AI systems automate tasks using repetitive patterns in data.


Examples: Chatbots, image recognition, robotic process automation.

7. Prediction & Forecasting

AI can predict future outcomes from historical data.


Examples: Weather prediction, sales forecasting, stock analysis.

8. Real-Time Decision Making

Live data from sensors, IoT devices used for instant decisions.
Examples: Smart cars, smart homes, surveillance.

9. Enhancing User Experience

Voice assistants, smart appliances, health monitoring — all depend on continuous data input.

2 [Link] of data in building AI applications: Data as the fuel for AI, Role of big data in training
AI models.

Data is the foundation of every Artificial Intelligence system. Without data, AI cannot learn, make decisions, or
generate useful outputs. The success of any AI model depends mainly on the quality, quantity, and relevance
of the data it is trained on.

1. Data as the Fuel for AI

3
Just like vehicles need fuel to run, AI systems need data to function.

- AI algorithms learn patterns from data; without good data, models can’t produce useful outputs.

- Quality data drives model accuracy, robustness, and generalization to new situations.

- Data feeds every stage of AI: training (learning parameters), validation (tuning), and inference (making
predictions).

- The “fuel” analogy means richer, relevant data powers better AI performance, just like fuel powers an
engine.

- More data = better learning


Large and clean datasets help AI models achieve higher accuracy.

- Data helps AI make decisions


Models use previous examples to predict future outcomes.

- AI does not understand anything without data


Without input data, an AI model cannot perform tasks like classification, translation, detection, or prediction.

- Continuous data improves performance


New data allows AI models to update, improve, and adapt to changing environments.

2. Role of Big Data in Training AI Models

Big Data refers to extremely large, complex datasets generated from various sources like social media, sensors,
IoT devices, transactions, GPS, etc.

- Volume: Large datasets provide more examples for models to learn complex relationships, especially deep
neural nets that thrive on massive samples.

- Variety: Diverse data types (text, images, sensor streams) enable building multimodal AI systems and
improve robustness.

- Velocity: Fast-moving data streams allow real-time model updates and adaptation to changing
environments.

- Big data requires scalable storage (data lakes, distributed file systems) and compute (GPUs, TPUs, clusters)
to process and train models efficiently.

- Techniques like data parallelism and distributed training split big datasets across multiple processors to
handle large-scale model training.

Why Big Data is important for AI?

1. Improves Accuracy

4
 AI models require thousands to millions of samples.
 Big data provides enough examples for the model to learn correctly.

2. Enables Deep Learning

 Deep learning (neural networks) requires massive datasets to identify complex patterns.
 Without big data, deep learning models underperform.

3. Reduces Bias

 More diverse data reduces unfairness in AI decisions.


 Example: Face recognition improves when trained on diverse image datasets.

4. Supports Real-Time AI

 Applications like fraud detection, traffic management, and smart devices depend on continuous data
streams.

5. Helps AI Handle Complex Problems

 Big data captures hidden patterns and correlations.


 Example: Disease prediction using millions of medical records.

6. Improves Personalization

 AI systems like YouTube, Netflix, and Amazon analyze big data to personalize recommendations.

3 Q. Conceptual Foundations of Data(DIK Hierarchy) Data vs. Information vs. Knowledge.

To understand AI and data science, it is important to differentiate between Data, Information, and Knowledge.
These three form a hierarchy called the DIK Hierarchy (Data → Information → Knowledge).

1. Data:

Data is raw, unorganized facts that have no meaning by themselves. Data is unprocessed facts and figures
without context. Data is the basic building block, but on its own, it doesn't convey meaning. It can be numbers,
symbols, images, audio and text. It is not useful until processed.

Example: Numbers "20, 25, 30".

2. Information

Information is processed, organized, or structured data that now has meaning. Data with context and
meaning. When data is processed, it becomes information. Information is more useful than raw data and helps
in understanding a situation.

Example: "The temperatures for the past three days were 20°C, 25°C, and 30°C".

5
3. Knowledge

Knowledge is interpreted information that helps in making decisions or predictions. Information with
insights, understanding, and application. Knowledge is the actionable takeaway from information. It helps in
decision-making and used by AI and humans to solve problems.

Example: "The temperature is rising over the past few days, indicating a heatwave".

4Q. Structure of Data: Structured, Semi-Structured, and Unstructured Data

Structure of Data

Data used in AI and data science can be classified based on how well it is organized. The three major categories
are:

1. Structured Data
2. Semi-Structured Data
3. Unstructured Data

This classification helps in choosing the right storage systems, processing methods, and AI/ML algorithms.

1. Structured Data

Structured data is highly organized and stored in a fixed format, usually in rows and columns (tables).

Characteristics

 Follows a strict schema (fixed fields)


 Easy to store, search, and analyze.
 Easily handled by SQL-based databases. Ex. Relational databases(oracle, MYSQL), excel sheets.
 Its advantages are fast processing, high accuracy, and easy querying using SQL.
 Its disadvantages are limited flexibility, cannot store complex data like images/videos.

2. Semi-Structured Data

Semi-structured data does not follow strict table format, but contains tags, labels, or markers to maintain
some organization. Its partially organized data.

Characteristics

 No fixed schema like SQL tables


 Contains metadata (tags, key–value pairs)
 More flexible than structured data. Ex JSON, XML, HTML..

 Its advantages are Flexible structure, Easy to extend and Good for web and mobile applications

 Its disadvantages are Harder to analyze than structured data, Requires extra processing to convert to
tables.

6
3. Unstructured Data

Unstructured data has no predefined format or organization. It is the most common type of data used in AI.
Used in applications like speech recognition, image classification, chat bot training etc. its free-form data(no
structure).

Characteristics

 No fixed schema
 Difficult to store and process
 Requires AI/ML/CV/NLP techniques to analyze. Ex text documents, images and videos, audio
recordings, chat messages, social media posts, PDF etc.
 Its advantages are captures real-world information, useful for AI tasks like vision, language processing.
 Its disadvantages are hard to process, it requires large storage and computing power.

5Q. Modalities of Data: Text, Image, Audio, Video, Tabular, Time-Series, and Spatial Data.

Modalities of Data

In AI and Machine Learning, data modalities refer to the different forms or types of data inputs that an AI
system can process.
Each modality carries unique characteristics and requires specialized processing techniques.

The major modalities are:

1. Text Data

Text data consists of written or typed words, sentences, and documents.

Examples: Emails, Chat messages, News articles, Social media posts, Documents (PDF, Word).

Text data is used in NLP, chat bots, sentiment analysis, and machine translation.

2. Image Data

Image data includes pictures or visuals captured by cameras or sensors.

Examples: Photos, Medical images (X-rays, MRI), Satellite images, Handwritten digits.

Image data is used in computer vision, face recognition, object detection and medical diagnosis.

3. Audio Data

Audio data represents sound in the form of wave signals.

Examples: Voice recordings, Music, Speech commands, Environmental sounds.

7
Audio data is used in speech recognition, voice assistants (siri,alexa), speaker identification, sound
classification.

4. Video Data

Video data is a combination of images + audio presented over time.

Examples: CCTV footage, Movies, Recorded lectures, YouTube videos.

Video data is used in action recognition, video analytics, surveillance systems and self-driving vehicles.

5. Tabular Data

Tabular data is structured data arranged in rows and columns (tables).

Examples: Excel sheets, Banking records, Employee details, Sales reports.

Tabular data is used in statistical analysis, predictive modeling, and business intelligence.

6. Time-Series Data

Time-series data is collected over time at regular intervals.

Examples: Stock prices, Temperature readings, Heartbeat signals, Website traffic logs.

Time – series data is used in forecasting, anomaly detection, and IoT applications.

7. Spatial Data

Spatial data represents location-based information on Earth.

Examples: GPS coordinates, Maps, Geographic boundary data, Satellite imagery.

Spatial data is used in GIS systems, climate modeling, navigation (google maps), and urban planning.

Each modality requires specific processing techniques and is used in different AI applications.

6Q. Formats of Data: Text Formats (CSV, JSON, XML), Image Formats (JPEG, GIF, PNG),
Audio/Video (MP3, WAV, MP4, AVI).

Formats of Data

Data is stored in different file formats depending on the type of content—text, images, audio, or video. Each
format uses a specific structure for encoding and storing information.

The main categories are:

1. Text Formats (CSV, JSON, XML)


8
2. Image Formats (JPEG, GIF, PNG)
3. Audio Formats (MP3, WAV)
4. Video Formats (MP4, AVI)

1. Text Formats

Text formats store data in human-readable or machine-readable textual form. They are widely used in data
science, databases, and APIs.

A. CSV (Comma-Separated Values)

 Stores data in rows and columns, separated by commas


 Very simple and lightweight
 Easy to load into Excel, Python, or SQL

Used for:
Tables, datasets, spreadsheets

Example:
Name, Age, Mark

B. JSON (JavaScript Object Notation)

 Stores data in key–value pairs


 Human-readable and machine-friendly
 Widely used in APIs, web applications, NoSQL databases

Used for:
API responses, configuration files, NoSQL documents

Example:
{"name": "Riya", "age": 20}

C. XML (eXtensibleMarkup Language)

 Structured using tags


 Similar to HTML
 Flexible for hierarchical data

Used for:
Web services, configuration files, document exchange

Example:

<student>
<name>Riya</name>
<age>20</age>
</student>

9
2. Image Formats

Image formats differ based on compression, size, and quality.

A. JPEG (Joint Photographic Experts Group)

 Lossy compression (some quality lost)


 Small file size
 Best for photos and real-world images

B. GIF (Graphics Interchange Format)

 Supports animation
 Limited to 256 colors
 Used for simple graphics, memes

C. PNG (Portable Network Graphics)

 Lossless compression (no quality loss)


 Supports transparency
 Good for logos, icons, graphics

3. Audio Formats

A. MP3

 Compressed audio
 Smaller file size
 Most common format for music and speech

B. WAV

 Uncompressed, high-quality audio


 Larger file size
 Used in studios, recording, high-quality sound processing

4. Video Formats

A. MP4

 Most widely used video format


 High quality + good compression
 Supports video, audio, subtitles, images
 Used for streaming (YouTube, mobile)

B. AVI (Audio Video Interleave)

 Older format
 Larger file size
10
 Less compression
 Used for raw or high-quality video editing

These formats help in efficient storage, processing, and transmission of data.

7Q. Data Repositories:

Data Repositories

A data repository is a central place where data is stored, managed, and maintained.
It helps organizations keep data secure, organized, and easily accessible for analysis or future use.

Types of Data Repositories

1. Data Warehouse

 Stores large volumes of structured data.


 Designed for reporting and analysis.
 Example: Amazon Redshift, Google BigQuery.

2. Data Lake

 Stores raw data in any format (structured, semi-structured, unstructured).


 Used for big data analytics and AI/ML.
 Example: Hadoop HDFS, AWS S3.

3. Data Mart

 A smaller part of a data warehouse.


 Designed for a specific department (e.g., sales, HR).

4. Database

 Stores data in organized tables.


 Supports fast queries, daily operations.
 Example: MySQL, PostgreSQL, Oracle.

5. Data Hub

 A central point for data integration and sharing.


 Connects data from multiple systems.

6. Metadata Repository

 Stores data about data (metadata).


 Example: Column names, data types, creation date.

8Q. Definition of public Datasets; Definition of private Datasets

11
Public Datasets – Definition

Public datasets are freely available data collections that anyone can access, download, and use without
restrictions.
They are usually provided by: Governments, Research institutions, Universities, Open-source communities

Features of Public Datasets: Open access, Free or low cost, Useful for research, AI/ML training, and
education, Usually well-documented.

Examples: Kaggle Datasets, Government open-data portals

Private Datasets – Definition

Private datasets are data collections that are restricted and not available to the public.
Access is limited to: An organization, Authorized users. Members of a specific team or project

They usually contain confidential, sensitive, or proprietary information.

Features of Private Datasets: Restricted access, Require permissions or licenses, Often used for internal
business operations. May contain personal data (customer details, financial data, etc.)

Examples: Company sales records, Employee databases, Customer transaction data, Medical or health records.

Key differences:

- Accessibility: Public datasets are openly available, while private datasets are restricted.

- Sensitivity: Public datasets typically contain non-sensitive information, while private datasets often contain
sensitive or confidential data.

- Usage: Public datasets are often used for research, analysis, or open innovation, while private datasets are used
for internal decision-making, customer service, or business operations.

9Q. Importance of Public Datasets

Public datasets play a major role in research, learning, innovation, and building AI/ML systems. Their
importance includes:

1. Support for Research and Innovation

Public datasets help researchers test theories, build models, and compare results without needing to collect new
[Link] speed up scientific progress.

2. Help in Learning and Education

Students and beginners can practice data analysis, machine learning, AI, and statistics using free [Link]
make data education accessible to everyone.

3. Enable Transparency and Open Science


12
Public datasets allow researchers to verify, repeat, and improve scientific studies.
This increases trust and transparency in research.

4. Reduce Cost and Time

Collecting data is expensive and time-consuming.


Public datasets save time and reduce the cost of data collection.

5. Economic growth

Fostering data – driven start ups, businesses, and economic development.

6. Encourage Collaboration

Open datasets allow global researchers, students, and developers to work together and share ideas easily.

7. Promote Innovation in AI/ML

Public datasets provide raw material for:

 AI model training
 Data mining projects
 Predictive analytics
 Real-world application development

Many breakthrough AI models were trained using public datasets.

8. Support Government Transparency

Government public datasets help citizens understand:

 Public spending
 Health trends
 Environment and climate data
 Crime statistics

Ex 1. Health and medical research data sets(COVID -19 data).

2. Environmental data sets (climate, weather, and satellite data).

10Q. Popular Public Dataset Repositories

Public dataset repositories are platforms where large collections of datasets are stored, shared, and accessed for
AI, machine learning, data science, research, and education.
Below are some widely used repositories:

1. Kaggle Datasets

13
 One of the most popular platforms for data science and machine learning.
 Offers thousands of free datasets across categories like health, finance, images, text, climate, and
more.
 Also provides notebooks, competitions, and community support.

2. Hugging Face Datasets

 A large repository mainly focused on NLP (text), computer vision, and audio datasets.
 Provides ready-to-use datasets for training AI models.
 Integrated with libraries like Transformers and Datasets.

3. UCI Machine Learning Repository

 One of the oldest and most trusted dataset repositories.


 Contains classic datasets used for teaching, research, and algorithm benchmarking.
 Popular datasets include Iris, Wine, and Adult datasets.

4. Google Dataset Search

 A search engine that helps users discover datasets available on the internet.
 Works like “Google Search for datasets.”
 Includes datasets from universities, governments, research labs, and open-data portals.

5. Microsoft Research Open Data

 A collection of free datasets shared by Microsoft Research.


 Focuses on AI, computer vision, biology, and social sciences.

6. [Link] (USA) / [Link] (India)

 Government open-data portals.


 Provide datasets related to population, health, transport, education, economy, and national statistics.

11Q. Dataset licensing and usage rights.

Dataset licensing defines how a dataset can be used, shared, modified, or distributed.
Licenses protect the rights of creators and inform users about what is allowed and what is not.

Why Dataset Licensing Is Important?

 Ensures legal and ethical data use


 Protects privacy and intellectual property
 Prevents misuse of sensitive data
 Guides researchers, students, and developers on allowed actions

Common Types of Dataset Licenses

1. Open Licenses
14
These licenses allow users to access, use, and share data freely with some conditions.

a) CC0 (Public Domain License)

 No restrictions
 Anyone can use the data for any purpose
 No attribution required

b) Creative Commons (CC) Licenses

These include:

 CC-BY → Use allowed, but must give credit


 CC-BY-SA → Share with same license
 CC-BY-NC → Non-commercial use only
 CC-BY-ND → No modifications allowed

Commonly used for public datasets.

2. Open Data Commons (ODC) Licenses

Designed specifically for databases:

 ODC-BY → Attribution required


 ODC-ODbl → Share-alike requirement
 PDDL → Public domain data license

3. Proprietary / Restricted Licenses

These apply to private or sensitive datasets.

Examples:

 “For internal use only”


 “Non-commercial research use only”
 “No redistribution allowed”

Often used for: Company data, Medical records, and Financial data.

4. Government Open Data Licenses

Governments release datasets with open licenses such as:

 India: Government Open Data License – India (GODL)


 US: Public Domain (US Government data)

These typically allow free use with attribution.

15
Usage Rights:
Always check the dataset page, download section, or terms footer for the exact license, as public availability
does not imply unrestricted use.

1. Access Rights

Who can access the dataset?

 Public users, Registered users, Authorized internal team

2. Usage Rights

What can the user do?

 View and download, Analyze and modify, Use for commercial or research purposes

3. Redistribution Rights

Can the user share the dataset?

 Allowed → Many open datasets, Restricted → Company/private datasets

4. Modification Rights

Can the dataset be changed?

 Allowed under CC-BY, Not allowed under CC-BY-ND

Key Points to Remember

 Licenses define how datasets can be used legally.


 Open licenses (CC, ODC) allow free usage with some conditions.
 Restricted licenses limit use, sharing, and modification.
 Always check dataset license before using it in projects or commercial products.

12Q. Privacy Concerns Related to Data Usage

When data is collected, stored, and used—especially personal or sensitive data—there are several important
privacy risks. These concerns arise in government systems, companies, apps, websites, and AI/ML
applications.

Below are the major privacy issues:

1. Unauthorized Access

If data is not protected properly, unauthorized people may gain [Link]: Hackers accessing customer
data from a company database.

16
2. Data Misuse

Data collected for one purpose may be used for another purpose without user [Link]: A mobile
app using your contact list for marketing.

3. Data Breaches

Large-scale leaks of personal information due to poor security.


Example: Password leaks, credit card leaks, health data leaks.

4. Lack of User Consent

Sometimes data is collected without informing users clearly or without asking for permission.
This violates privacy rights.

5. Tracking and Surveillance

Websites, apps, and smart devices often track:

 Location, Browsing history, Personal behavior

This creates a feeling of being monitored and can be misused.

6. Re-identification Risk

Even if data is “anonymized,” it can sometimes be combined with other data sources to identify individuals
again. This is a major concern in AI and big data.

7. Sharing Data with Third Parties

Companies may share or sell user data to advertisers or other businesses, often without clear user awareness.

8. Bias and Discrimination

If sensitive data (race, gender, health information, etc.) is used improperly, AI models may produce biased or
unfair results.

9. Inadequate Data Protection

Weak security measures—like poor encryption or weak passwords—can lead to data theft or misuse.

10. Loss of Control Over Personal Data

Once data is uploaded online, stored in cloud servers, or shared with companies, users lose control over:

 How long it is stored, Who uses it, What it is used for

Sensitive areas like healthcare, finance, and biometrics require extra care.

17
13Q. Regulations governing data usage - GDPR, HIPAA (Overview)

Data protection laws ensure that organizations collect, store, and use data in a legal, safe, and ethical manner.
Two important regulations are GDPR and HIPAA.

1. GDPR (General Data Protection Regulation)

Region: European Union (EU)


Purpose: Protects the privacy and personal data of individuals in the EU.

Key Features of GDPR

 Consent Required: Organizations must take clear and explicit permission from users before collecting
data.
 Right to Access: Users can request a copy of their personal data.
 Right to Erasure: Users can ask to delete their data (“Right to be Forgotten”).
 Data Minimization: Only collect data that is absolutely necessary.
 Data Breach Notification: Companies must report data breaches within 72 hours.
 Strong Penalties: Violations can lead to heavy fines (up to 4% of global revenue).

What GDPR Protects

 Name, email, phone number


 Address, location
 Photos, IP addresses
 Financial and health data
 Online identifiers

2. HIPAA (Health Insurance Portability and Accountability Act)

Region: United States


Purpose: Protects medical and health information (PHI – Protected Health Information).

Key Features of HIPAA

 Privacy Rule: Defines what health data must be protected (e.g., medical records, treatments, diagnoses).
 Security Rule: Requires organizations to use safeguards (encryption, passwords, training) to protect
digital health data.
 Breach Notification Rule: If health data is leaked, organizations must inform affected individuals.
 Minimum Necessary Rule: Only the minimum required health information should be shared.

Who Must Follow HIPAA?

 Hospitals, clinics
 Doctors, nurses
 Insurance companies
 Healthcare service providers
 Third-party vendors handling medical data
18
Simple Difference
Regulation Region Focus Protects

GDPR European Union Personal data privacy All personal information

HIPAA United States Health/medical data Medical records & patient information

14 Q. Ethical use of data

Ethical use of data means collecting, storing, processing, and sharing data in a fair, responsible, and
respectful manner.
The goal is to protect individuals, maintain trust, and ensure that data is used for good purposes without
causing harm.

Principles of Ethical Data Use

1. Transparency

Organizations must clearly inform users:

 What data is being collected, Why it is being collected, How it will be used

No hidden data practices.

2. Informed Consent

Data should only be collected when users give permission knowingly.


Consent must be: Clear, Voluntary, Easy to withdraw

3. Privacy and Confidentiality

Personal and sensitive data must be protected.


Data should be: Encrypted, Access-controlled, Stored securely

No unauthorized access or sharing.

4. Data Minimization

Only collect the minimum amount of data needed for the [Link] not gather unnecessary or unrelated
information.

5. Fairness and Non-Discrimination

Data should not be used in ways that:

 Create bias, Discriminate against individuals or groups, Promote unfair decisions

Especially important in AI and machine learning.

19
6. Accuracy

Data must be correct, complete, and up to [Link] data can lead to wrong conclusions or harmful
decisions.

7. Accountability

Organizations and data users must take responsibility for:

 How data is handled, Mistakes or misuse, Security failures

Ethical guidelines and audits should be followed.

8. Purpose Limitation

Data collected for one purpose must not be used for another purpose without user permission.
Example:
A fitness app collecting health data should not sell it to advertisers.

9. Avoiding Harm

Data must not be used in ways that:

 Hurt individuals, Invade privacy, Cause emotional, financial, or physical harm

10. Respect for User Rights

Users must have the right to:

 Access their data, Correct errors, Delete their data, Control how it is used

15 Q. Responsible AI data practices.

Responsible AI data practices ensure that the data used to train, test, and deploy AI systems is handled in an
ethical, fair, transparent, and safe manner.
They help prevent harm, bias, privacy violations, and misuse of AI.

1. Data Quality and Accuracy

 Data used for AI must be correct, complete, and reliable.


 Poor-quality data leads to wrong predictions and bad decisions.

2. Fairness and Bias Reduction

 AI models should not discriminate based on gender, race, age, religion, etc.
 Datasets must be:
o Balanced, Representative, Regularly checked for bias

20
3. Privacy Protection

 Personal data must be collected legally and handled securely.


 Use techniques like:
o Data anonymization, Encryption, Differential privacy

4. Transparency and Explainability

 Users should know what data is used and how AI makes decisions.
 AI systems should be explainable, not “black boxes.”

5. Informed Consent

 Collect data only when users agree.


 Explain:
o Why data is needed, How it will be used, How long it will be stored

6. Data Minimization

 Collect only the data that is necessary for the AI task.


 Avoid collecting extra or irrelevant personal data.

7. Security and Safeguards

 Protect data from breaches, leaks, or unauthorized access.


o Use strong security practices: Firewalls, Access control, Monitoring

8. Accountability

 Organizations must take responsibility for:


o How data is collected, How AI systems behave, How decisions impact people

Audits and documentation should be maintained.

9. Human Oversight

 AI should not operate without supervision.


o Humans must monitor decisions, especially in: Healthcare, Finance, Law enforcement.

10. Respect for User Rights

Users should have the right to: Access their data, Correct errors, Delete their data, and Opt out of AI-based
decisions.

21

You might also like