UNIT – 2
1Q. Foundations of Data - Types, Ethics and Utility in Building Applications using AI
Foundations of data consist of understanding data types, ethical handling of data, and using data effectively to
train AI models.
Structured, unstructured, and semi-structured data help in building AI systems.
Ethical principles like privacy, fairness, consent, and transparency ensure responsible AI.
Utility of data lies in training models, making predictions, personalization, automation, and enabling intelligent
decision-making in modern applications.
AI applications depend completely on data. The quality, type, and ethical use of data decide how good and safe
an AI system will be.
1. Data Types
Data used in AI can be broadly divided as follows:
1. Structured Data
Organized in rows/columns (databases). Easy to query and process for AI models (e.g., tabular datasets for
regression).
2. Unstructured Data
No fixed format (images, audio, text, video). Requires preprocessing (feature extraction, embeddings) for AI
use.
3. Semi-Structured Data
Not fully structured but contains tags or markers. Mix of formats (JSON, HTML, XML files). Needs parsing to
extract usable features for AI pipelines.
4. Time - series data: Sequential data points (sensor logs). Used in forecasting or anomaly detection models.
5. Data Based on Origin
Primary Data: Collected first-hand (surveys, experiments, sensors).
Secondary Data: Taken from existing sources (web data, reports).
6. Data Based on Dynamics
Static Data: Does not change (historical records).
Dynamic Data: Continuously changing (real-time sensor data, stock prices).
7. Data for ML/AI Tasks
1
Training Data – teaches the model
Validation Data – tunes the model
Testing Data – evaluates performance
II. Data Ethics in AI
Ethics ensures that AI systems are fair, safe, transparent, and responsible.
1. Privacy
Protecting personal data of users
Following laws like GDPR, Data Protection Acts
Using encryption, anonymization
2. Bias and Fairness
unrepresentative datasets can lead to unfair AI outcomes. Data can contain human or societal biases. mitigation
involves diverse sampling and bias audits.(BIAS DETECTION TOOLS).
3. Transparency
AI decisions must be explainable. Documenting data sources, processing steps, and model behavior for
accountability. Users should know how data is collected and used.
4. Governance
Implementing policies for data collection, storage, usage, and compliance with regulations (e.g., GDPR)
5. Consent
Users should give permission before their data is [Link] must be used only for the stated purpose
6. Security
Protecting data from hacking, leaks, misuse
7. Accountability
Organizations must be responsible for AI decisions. Proper audit trails, documentation.
III. Utility of Data in Building AI Applications
Data is the core ingredient that enables AI systems to function.
1. Training AI Models
High-quality, diverse datasets improve model accuracy and generalization.
2
AI learns patterns, relationships, trends from data.
Better data → Better model accuracy
2. Feature engineering:
Transforming raw data into meaningful inputs boosts model performance.
3. Evaluation:
Validation data helps measure AI model effectiveness and avoid overfitting.
4. Deployment:
Real-world data pipelines feed live inputs to AI systems for inference.
5. Personalization
Applications recommend content/products based on user data
Examples: YouTube, Amazon, Netflix.
6. Automation
AI systems automate tasks using repetitive patterns in data.
Examples: Chatbots, image recognition, robotic process automation.
7. Prediction & Forecasting
AI can predict future outcomes from historical data.
Examples: Weather prediction, sales forecasting, stock analysis.
8. Real-Time Decision Making
Live data from sensors, IoT devices used for instant decisions.
Examples: Smart cars, smart homes, surveillance.
9. Enhancing User Experience
Voice assistants, smart appliances, health monitoring — all depend on continuous data input.
2 [Link] of data in building AI applications: Data as the fuel for AI, Role of big data in training
AI models.
Data is the foundation of every Artificial Intelligence system. Without data, AI cannot learn, make decisions, or
generate useful outputs. The success of any AI model depends mainly on the quality, quantity, and relevance
of the data it is trained on.
1. Data as the Fuel for AI
3
Just like vehicles need fuel to run, AI systems need data to function.
- AI algorithms learn patterns from data; without good data, models can’t produce useful outputs.
- Quality data drives model accuracy, robustness, and generalization to new situations.
- Data feeds every stage of AI: training (learning parameters), validation (tuning), and inference (making
predictions).
- The “fuel” analogy means richer, relevant data powers better AI performance, just like fuel powers an
engine.
- More data = better learning
Large and clean datasets help AI models achieve higher accuracy.
- Data helps AI make decisions
Models use previous examples to predict future outcomes.
- AI does not understand anything without data
Without input data, an AI model cannot perform tasks like classification, translation, detection, or prediction.
- Continuous data improves performance
New data allows AI models to update, improve, and adapt to changing environments.
2. Role of Big Data in Training AI Models
Big Data refers to extremely large, complex datasets generated from various sources like social media, sensors,
IoT devices, transactions, GPS, etc.
- Volume: Large datasets provide more examples for models to learn complex relationships, especially deep
neural nets that thrive on massive samples.
- Variety: Diverse data types (text, images, sensor streams) enable building multimodal AI systems and
improve robustness.
- Velocity: Fast-moving data streams allow real-time model updates and adaptation to changing
environments.
- Big data requires scalable storage (data lakes, distributed file systems) and compute (GPUs, TPUs, clusters)
to process and train models efficiently.
- Techniques like data parallelism and distributed training split big datasets across multiple processors to
handle large-scale model training.
Why Big Data is important for AI?
1. Improves Accuracy
4
AI models require thousands to millions of samples.
Big data provides enough examples for the model to learn correctly.
2. Enables Deep Learning
Deep learning (neural networks) requires massive datasets to identify complex patterns.
Without big data, deep learning models underperform.
3. Reduces Bias
More diverse data reduces unfairness in AI decisions.
Example: Face recognition improves when trained on diverse image datasets.
4. Supports Real-Time AI
Applications like fraud detection, traffic management, and smart devices depend on continuous data
streams.
5. Helps AI Handle Complex Problems
Big data captures hidden patterns and correlations.
Example: Disease prediction using millions of medical records.
6. Improves Personalization
AI systems like YouTube, Netflix, and Amazon analyze big data to personalize recommendations.
3 Q. Conceptual Foundations of Data(DIK Hierarchy) Data vs. Information vs. Knowledge.
To understand AI and data science, it is important to differentiate between Data, Information, and Knowledge.
These three form a hierarchy called the DIK Hierarchy (Data → Information → Knowledge).
1. Data:
Data is raw, unorganized facts that have no meaning by themselves. Data is unprocessed facts and figures
without context. Data is the basic building block, but on its own, it doesn't convey meaning. It can be numbers,
symbols, images, audio and text. It is not useful until processed.
Example: Numbers "20, 25, 30".
2. Information
Information is processed, organized, or structured data that now has meaning. Data with context and
meaning. When data is processed, it becomes information. Information is more useful than raw data and helps
in understanding a situation.
Example: "The temperatures for the past three days were 20°C, 25°C, and 30°C".
5
3. Knowledge
Knowledge is interpreted information that helps in making decisions or predictions. Information with
insights, understanding, and application. Knowledge is the actionable takeaway from information. It helps in
decision-making and used by AI and humans to solve problems.
Example: "The temperature is rising over the past few days, indicating a heatwave".
4Q. Structure of Data: Structured, Semi-Structured, and Unstructured Data
Structure of Data
Data used in AI and data science can be classified based on how well it is organized. The three major categories
are:
1. Structured Data
2. Semi-Structured Data
3. Unstructured Data
This classification helps in choosing the right storage systems, processing methods, and AI/ML algorithms.
1. Structured Data
Structured data is highly organized and stored in a fixed format, usually in rows and columns (tables).
Characteristics
Follows a strict schema (fixed fields)
Easy to store, search, and analyze.
Easily handled by SQL-based databases. Ex. Relational databases(oracle, MYSQL), excel sheets.
Its advantages are fast processing, high accuracy, and easy querying using SQL.
Its disadvantages are limited flexibility, cannot store complex data like images/videos.
2. Semi-Structured Data
Semi-structured data does not follow strict table format, but contains tags, labels, or markers to maintain
some organization. Its partially organized data.
Characteristics
No fixed schema like SQL tables
Contains metadata (tags, key–value pairs)
More flexible than structured data. Ex JSON, XML, HTML..
Its advantages are Flexible structure, Easy to extend and Good for web and mobile applications
Its disadvantages are Harder to analyze than structured data, Requires extra processing to convert to
tables.
6
3. Unstructured Data
Unstructured data has no predefined format or organization. It is the most common type of data used in AI.
Used in applications like speech recognition, image classification, chat bot training etc. its free-form data(no
structure).
Characteristics
No fixed schema
Difficult to store and process
Requires AI/ML/CV/NLP techniques to analyze. Ex text documents, images and videos, audio
recordings, chat messages, social media posts, PDF etc.
Its advantages are captures real-world information, useful for AI tasks like vision, language processing.
Its disadvantages are hard to process, it requires large storage and computing power.
5Q. Modalities of Data: Text, Image, Audio, Video, Tabular, Time-Series, and Spatial Data.
Modalities of Data
In AI and Machine Learning, data modalities refer to the different forms or types of data inputs that an AI
system can process.
Each modality carries unique characteristics and requires specialized processing techniques.
The major modalities are:
1. Text Data
Text data consists of written or typed words, sentences, and documents.
Examples: Emails, Chat messages, News articles, Social media posts, Documents (PDF, Word).
Text data is used in NLP, chat bots, sentiment analysis, and machine translation.
2. Image Data
Image data includes pictures or visuals captured by cameras or sensors.
Examples: Photos, Medical images (X-rays, MRI), Satellite images, Handwritten digits.
Image data is used in computer vision, face recognition, object detection and medical diagnosis.
3. Audio Data
Audio data represents sound in the form of wave signals.
Examples: Voice recordings, Music, Speech commands, Environmental sounds.
7
Audio data is used in speech recognition, voice assistants (siri,alexa), speaker identification, sound
classification.
4. Video Data
Video data is a combination of images + audio presented over time.
Examples: CCTV footage, Movies, Recorded lectures, YouTube videos.
Video data is used in action recognition, video analytics, surveillance systems and self-driving vehicles.
5. Tabular Data
Tabular data is structured data arranged in rows and columns (tables).
Examples: Excel sheets, Banking records, Employee details, Sales reports.
Tabular data is used in statistical analysis, predictive modeling, and business intelligence.
6. Time-Series Data
Time-series data is collected over time at regular intervals.
Examples: Stock prices, Temperature readings, Heartbeat signals, Website traffic logs.
Time – series data is used in forecasting, anomaly detection, and IoT applications.
7. Spatial Data
Spatial data represents location-based information on Earth.
Examples: GPS coordinates, Maps, Geographic boundary data, Satellite imagery.
Spatial data is used in GIS systems, climate modeling, navigation (google maps), and urban planning.
Each modality requires specific processing techniques and is used in different AI applications.
6Q. Formats of Data: Text Formats (CSV, JSON, XML), Image Formats (JPEG, GIF, PNG),
Audio/Video (MP3, WAV, MP4, AVI).
Formats of Data
Data is stored in different file formats depending on the type of content—text, images, audio, or video. Each
format uses a specific structure for encoding and storing information.
The main categories are:
1. Text Formats (CSV, JSON, XML)
8
2. Image Formats (JPEG, GIF, PNG)
3. Audio Formats (MP3, WAV)
4. Video Formats (MP4, AVI)
1. Text Formats
Text formats store data in human-readable or machine-readable textual form. They are widely used in data
science, databases, and APIs.
A. CSV (Comma-Separated Values)
Stores data in rows and columns, separated by commas
Very simple and lightweight
Easy to load into Excel, Python, or SQL
Used for:
Tables, datasets, spreadsheets
Example:
Name, Age, Mark
B. JSON (JavaScript Object Notation)
Stores data in key–value pairs
Human-readable and machine-friendly
Widely used in APIs, web applications, NoSQL databases
Used for:
API responses, configuration files, NoSQL documents
Example:
{"name": "Riya", "age": 20}
C. XML (eXtensibleMarkup Language)
Structured using tags
Similar to HTML
Flexible for hierarchical data
Used for:
Web services, configuration files, document exchange
Example:
<student>
<name>Riya</name>
<age>20</age>
</student>
9
2. Image Formats
Image formats differ based on compression, size, and quality.
A. JPEG (Joint Photographic Experts Group)
Lossy compression (some quality lost)
Small file size
Best for photos and real-world images
B. GIF (Graphics Interchange Format)
Supports animation
Limited to 256 colors
Used for simple graphics, memes
C. PNG (Portable Network Graphics)
Lossless compression (no quality loss)
Supports transparency
Good for logos, icons, graphics
3. Audio Formats
A. MP3
Compressed audio
Smaller file size
Most common format for music and speech
B. WAV
Uncompressed, high-quality audio
Larger file size
Used in studios, recording, high-quality sound processing
4. Video Formats
A. MP4
Most widely used video format
High quality + good compression
Supports video, audio, subtitles, images
Used for streaming (YouTube, mobile)
B. AVI (Audio Video Interleave)
Older format
Larger file size
10
Less compression
Used for raw or high-quality video editing
These formats help in efficient storage, processing, and transmission of data.
7Q. Data Repositories:
Data Repositories
A data repository is a central place where data is stored, managed, and maintained.
It helps organizations keep data secure, organized, and easily accessible for analysis or future use.
Types of Data Repositories
1. Data Warehouse
Stores large volumes of structured data.
Designed for reporting and analysis.
Example: Amazon Redshift, Google BigQuery.
2. Data Lake
Stores raw data in any format (structured, semi-structured, unstructured).
Used for big data analytics and AI/ML.
Example: Hadoop HDFS, AWS S3.
3. Data Mart
A smaller part of a data warehouse.
Designed for a specific department (e.g., sales, HR).
4. Database
Stores data in organized tables.
Supports fast queries, daily operations.
Example: MySQL, PostgreSQL, Oracle.
5. Data Hub
A central point for data integration and sharing.
Connects data from multiple systems.
6. Metadata Repository
Stores data about data (metadata).
Example: Column names, data types, creation date.
8Q. Definition of public Datasets; Definition of private Datasets
11
Public Datasets – Definition
Public datasets are freely available data collections that anyone can access, download, and use without
restrictions.
They are usually provided by: Governments, Research institutions, Universities, Open-source communities
Features of Public Datasets: Open access, Free or low cost, Useful for research, AI/ML training, and
education, Usually well-documented.
Examples: Kaggle Datasets, Government open-data portals
Private Datasets – Definition
Private datasets are data collections that are restricted and not available to the public.
Access is limited to: An organization, Authorized users. Members of a specific team or project
They usually contain confidential, sensitive, or proprietary information.
Features of Private Datasets: Restricted access, Require permissions or licenses, Often used for internal
business operations. May contain personal data (customer details, financial data, etc.)
Examples: Company sales records, Employee databases, Customer transaction data, Medical or health records.
Key differences:
- Accessibility: Public datasets are openly available, while private datasets are restricted.
- Sensitivity: Public datasets typically contain non-sensitive information, while private datasets often contain
sensitive or confidential data.
- Usage: Public datasets are often used for research, analysis, or open innovation, while private datasets are used
for internal decision-making, customer service, or business operations.
9Q. Importance of Public Datasets
Public datasets play a major role in research, learning, innovation, and building AI/ML systems. Their
importance includes:
1. Support for Research and Innovation
Public datasets help researchers test theories, build models, and compare results without needing to collect new
[Link] speed up scientific progress.
2. Help in Learning and Education
Students and beginners can practice data analysis, machine learning, AI, and statistics using free [Link]
make data education accessible to everyone.
3. Enable Transparency and Open Science
12
Public datasets allow researchers to verify, repeat, and improve scientific studies.
This increases trust and transparency in research.
4. Reduce Cost and Time
Collecting data is expensive and time-consuming.
Public datasets save time and reduce the cost of data collection.
5. Economic growth
Fostering data – driven start ups, businesses, and economic development.
6. Encourage Collaboration
Open datasets allow global researchers, students, and developers to work together and share ideas easily.
7. Promote Innovation in AI/ML
Public datasets provide raw material for:
AI model training
Data mining projects
Predictive analytics
Real-world application development
Many breakthrough AI models were trained using public datasets.
8. Support Government Transparency
Government public datasets help citizens understand:
Public spending
Health trends
Environment and climate data
Crime statistics
Ex 1. Health and medical research data sets(COVID -19 data).
2. Environmental data sets (climate, weather, and satellite data).
10Q. Popular Public Dataset Repositories
Public dataset repositories are platforms where large collections of datasets are stored, shared, and accessed for
AI, machine learning, data science, research, and education.
Below are some widely used repositories:
1. Kaggle Datasets
13
One of the most popular platforms for data science and machine learning.
Offers thousands of free datasets across categories like health, finance, images, text, climate, and
more.
Also provides notebooks, competitions, and community support.
2. Hugging Face Datasets
A large repository mainly focused on NLP (text), computer vision, and audio datasets.
Provides ready-to-use datasets for training AI models.
Integrated with libraries like Transformers and Datasets.
3. UCI Machine Learning Repository
One of the oldest and most trusted dataset repositories.
Contains classic datasets used for teaching, research, and algorithm benchmarking.
Popular datasets include Iris, Wine, and Adult datasets.
4. Google Dataset Search
A search engine that helps users discover datasets available on the internet.
Works like “Google Search for datasets.”
Includes datasets from universities, governments, research labs, and open-data portals.
5. Microsoft Research Open Data
A collection of free datasets shared by Microsoft Research.
Focuses on AI, computer vision, biology, and social sciences.
6. [Link] (USA) / [Link] (India)
Government open-data portals.
Provide datasets related to population, health, transport, education, economy, and national statistics.
11Q. Dataset licensing and usage rights.
Dataset licensing defines how a dataset can be used, shared, modified, or distributed.
Licenses protect the rights of creators and inform users about what is allowed and what is not.
Why Dataset Licensing Is Important?
Ensures legal and ethical data use
Protects privacy and intellectual property
Prevents misuse of sensitive data
Guides researchers, students, and developers on allowed actions
Common Types of Dataset Licenses
1. Open Licenses
14
These licenses allow users to access, use, and share data freely with some conditions.
a) CC0 (Public Domain License)
No restrictions
Anyone can use the data for any purpose
No attribution required
b) Creative Commons (CC) Licenses
These include:
CC-BY → Use allowed, but must give credit
CC-BY-SA → Share with same license
CC-BY-NC → Non-commercial use only
CC-BY-ND → No modifications allowed
Commonly used for public datasets.
2. Open Data Commons (ODC) Licenses
Designed specifically for databases:
ODC-BY → Attribution required
ODC-ODbl → Share-alike requirement
PDDL → Public domain data license
3. Proprietary / Restricted Licenses
These apply to private or sensitive datasets.
Examples:
“For internal use only”
“Non-commercial research use only”
“No redistribution allowed”
Often used for: Company data, Medical records, and Financial data.
4. Government Open Data Licenses
Governments release datasets with open licenses such as:
India: Government Open Data License – India (GODL)
US: Public Domain (US Government data)
These typically allow free use with attribution.
15
Usage Rights:
Always check the dataset page, download section, or terms footer for the exact license, as public availability
does not imply unrestricted use.
1. Access Rights
Who can access the dataset?
Public users, Registered users, Authorized internal team
2. Usage Rights
What can the user do?
View and download, Analyze and modify, Use for commercial or research purposes
3. Redistribution Rights
Can the user share the dataset?
Allowed → Many open datasets, Restricted → Company/private datasets
4. Modification Rights
Can the dataset be changed?
Allowed under CC-BY, Not allowed under CC-BY-ND
Key Points to Remember
Licenses define how datasets can be used legally.
Open licenses (CC, ODC) allow free usage with some conditions.
Restricted licenses limit use, sharing, and modification.
Always check dataset license before using it in projects or commercial products.
12Q. Privacy Concerns Related to Data Usage
When data is collected, stored, and used—especially personal or sensitive data—there are several important
privacy risks. These concerns arise in government systems, companies, apps, websites, and AI/ML
applications.
Below are the major privacy issues:
1. Unauthorized Access
If data is not protected properly, unauthorized people may gain [Link]: Hackers accessing customer
data from a company database.
16
2. Data Misuse
Data collected for one purpose may be used for another purpose without user [Link]: A mobile
app using your contact list for marketing.
3. Data Breaches
Large-scale leaks of personal information due to poor security.
Example: Password leaks, credit card leaks, health data leaks.
4. Lack of User Consent
Sometimes data is collected without informing users clearly or without asking for permission.
This violates privacy rights.
5. Tracking and Surveillance
Websites, apps, and smart devices often track:
Location, Browsing history, Personal behavior
This creates a feeling of being monitored and can be misused.
6. Re-identification Risk
Even if data is “anonymized,” it can sometimes be combined with other data sources to identify individuals
again. This is a major concern in AI and big data.
7. Sharing Data with Third Parties
Companies may share or sell user data to advertisers or other businesses, often without clear user awareness.
8. Bias and Discrimination
If sensitive data (race, gender, health information, etc.) is used improperly, AI models may produce biased or
unfair results.
9. Inadequate Data Protection
Weak security measures—like poor encryption or weak passwords—can lead to data theft or misuse.
10. Loss of Control Over Personal Data
Once data is uploaded online, stored in cloud servers, or shared with companies, users lose control over:
How long it is stored, Who uses it, What it is used for
Sensitive areas like healthcare, finance, and biometrics require extra care.
17
13Q. Regulations governing data usage - GDPR, HIPAA (Overview)
Data protection laws ensure that organizations collect, store, and use data in a legal, safe, and ethical manner.
Two important regulations are GDPR and HIPAA.
1. GDPR (General Data Protection Regulation)
Region: European Union (EU)
Purpose: Protects the privacy and personal data of individuals in the EU.
Key Features of GDPR
Consent Required: Organizations must take clear and explicit permission from users before collecting
data.
Right to Access: Users can request a copy of their personal data.
Right to Erasure: Users can ask to delete their data (“Right to be Forgotten”).
Data Minimization: Only collect data that is absolutely necessary.
Data Breach Notification: Companies must report data breaches within 72 hours.
Strong Penalties: Violations can lead to heavy fines (up to 4% of global revenue).
What GDPR Protects
Name, email, phone number
Address, location
Photos, IP addresses
Financial and health data
Online identifiers
2. HIPAA (Health Insurance Portability and Accountability Act)
Region: United States
Purpose: Protects medical and health information (PHI – Protected Health Information).
Key Features of HIPAA
Privacy Rule: Defines what health data must be protected (e.g., medical records, treatments, diagnoses).
Security Rule: Requires organizations to use safeguards (encryption, passwords, training) to protect
digital health data.
Breach Notification Rule: If health data is leaked, organizations must inform affected individuals.
Minimum Necessary Rule: Only the minimum required health information should be shared.
Who Must Follow HIPAA?
Hospitals, clinics
Doctors, nurses
Insurance companies
Healthcare service providers
Third-party vendors handling medical data
18
Simple Difference
Regulation Region Focus Protects
GDPR European Union Personal data privacy All personal information
HIPAA United States Health/medical data Medical records & patient information
14 Q. Ethical use of data
Ethical use of data means collecting, storing, processing, and sharing data in a fair, responsible, and
respectful manner.
The goal is to protect individuals, maintain trust, and ensure that data is used for good purposes without
causing harm.
Principles of Ethical Data Use
1. Transparency
Organizations must clearly inform users:
What data is being collected, Why it is being collected, How it will be used
No hidden data practices.
2. Informed Consent
Data should only be collected when users give permission knowingly.
Consent must be: Clear, Voluntary, Easy to withdraw
3. Privacy and Confidentiality
Personal and sensitive data must be protected.
Data should be: Encrypted, Access-controlled, Stored securely
No unauthorized access or sharing.
4. Data Minimization
Only collect the minimum amount of data needed for the [Link] not gather unnecessary or unrelated
information.
5. Fairness and Non-Discrimination
Data should not be used in ways that:
Create bias, Discriminate against individuals or groups, Promote unfair decisions
Especially important in AI and machine learning.
19
6. Accuracy
Data must be correct, complete, and up to [Link] data can lead to wrong conclusions or harmful
decisions.
7. Accountability
Organizations and data users must take responsibility for:
How data is handled, Mistakes or misuse, Security failures
Ethical guidelines and audits should be followed.
8. Purpose Limitation
Data collected for one purpose must not be used for another purpose without user permission.
Example:
A fitness app collecting health data should not sell it to advertisers.
9. Avoiding Harm
Data must not be used in ways that:
Hurt individuals, Invade privacy, Cause emotional, financial, or physical harm
10. Respect for User Rights
Users must have the right to:
Access their data, Correct errors, Delete their data, Control how it is used
15 Q. Responsible AI data practices.
Responsible AI data practices ensure that the data used to train, test, and deploy AI systems is handled in an
ethical, fair, transparent, and safe manner.
They help prevent harm, bias, privacy violations, and misuse of AI.
1. Data Quality and Accuracy
Data used for AI must be correct, complete, and reliable.
Poor-quality data leads to wrong predictions and bad decisions.
2. Fairness and Bias Reduction
AI models should not discriminate based on gender, race, age, religion, etc.
Datasets must be:
o Balanced, Representative, Regularly checked for bias
20
3. Privacy Protection
Personal data must be collected legally and handled securely.
Use techniques like:
o Data anonymization, Encryption, Differential privacy
4. Transparency and Explainability
Users should know what data is used and how AI makes decisions.
AI systems should be explainable, not “black boxes.”
5. Informed Consent
Collect data only when users agree.
Explain:
o Why data is needed, How it will be used, How long it will be stored
6. Data Minimization
Collect only the data that is necessary for the AI task.
Avoid collecting extra or irrelevant personal data.
7. Security and Safeguards
Protect data from breaches, leaks, or unauthorized access.
o Use strong security practices: Firewalls, Access control, Monitoring
8. Accountability
Organizations must take responsibility for:
o How data is collected, How AI systems behave, How decisions impact people
Audits and documentation should be maintained.
9. Human Oversight
AI should not operate without supervision.
o Humans must monitor decisions, especially in: Healthcare, Finance, Law enforcement.
10. Respect for User Rights
Users should have the right to: Access their data, Correct errors, Delete their data, and Opt out of AI-based
decisions.
21