0% found this document useful (0 votes)
355 views2 pages

PG Diploma in Data Science Curriculum

The document outlines the curriculum for a Post Graduate Diploma in Data Science program over 4 semesters. The first two semesters cover the basics of statistics, data structures, algorithms, R and Python programming, data warehousing, mining and big data. Semester 3 includes courses on NoSQL databases, data visualization, machine learning with R and Python. Semester 4 covers emerging trends like deep learning, AI, business intelligence and a capstone project. Students are required to complete submissions and a final project to demonstrate their learning.

Uploaded by

yash borkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
355 views2 pages

PG Diploma in Data Science Curriculum

The document outlines the curriculum for a Post Graduate Diploma in Data Science program over 4 semesters. The first two semesters cover the basics of statistics, data structures, algorithms, R and Python programming, data warehousing, mining and big data. Semester 3 includes courses on NoSQL databases, data visualization, machine learning with R and Python. Semester 4 covers emerging trends like deep learning, AI, business intelligence and a capstone project. Students are required to complete submissions and a final project to demonstrate their learning.

Uploaded by

yash borkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

POST GRADUATE DIPLOMA IN DATA SCIENCE (PGDDS)

PROGRAMME CURRICULUM

Semester - I Semester II

Basics of Statistics Big data with Data Warehousing and Data Mining
1. Basics of Statistics 1. Fundamentals of Data Warehouse
2. Data Collection and Measurement 2. Architecture of Data Warehouse
3. Data Presentation 3. Dimensional Modelling
4. Data Processing and Analysis 4. ETL and OLAP
5. Measures of Central Tendency (Mean, 5. Introduction to Data Mining
Median and Mode) 6. Data Mining Techniques
6. Measures of Dispersion 7. Applications of Data Mining
7. Correlation
8. Introduction to Big Data
Introduction to Data Science 9. Hadoop Ecosystem
1. Basics of Data 10. Querying big data with Hive
2. Basics of Data Science
3. Big Data, Datafication & its impact on Data Advanced Statistics
Science 1. Sampling and Sampling Technique
4. Data Science Pipeline, EDA & Data 2. Probability
Preparation 3. Normal Distribution
5. Data Scientist Toolbox, Applications & Case 4. Linear Regression
Studies 5. Multiple Linear Regression
6. Random Variables
Data Structures and Algorithms
1. Programming Fundamentals
Python Programming
2. Control Flow
1. Introduction to Python
3. Arrays and Pointers
2. Variables, expressions and statements
4. Functions
6. Stacks and Queues 3. Control Structures, Data structures- Arrays
7. Linked Lists and Linked lists, Queues
8. Trees 4. Functions
9. Searching Algorithms 5. Conditionals, recursion and iteration
10. Sorting Algorithms 6. Strings
11. Graphs
7. Lists and Tuples
Introduction to R Programming 8. Dictionaries
1. Introduction to R 9. Object Oriented Programming
2. Data Types and Data Structures 11. Files and Error Handling
3. Loops and Functions in R 12. Testing, Debugging and Profiling
4. Mathematics in R 13. Handling data with Python
5. Graphs 14. Python Graphical User Interface
6. String Manipulation and Input/output Development
7. Object Oriented Programming – I Submission I
8. Object Oriented Programming – II In Semester II students are required to submit a
9. Debugging and Condition Handling submission as per guidelines given by SCDL.
10. Introduction to Parallel Computing in R

1|Page
POST GRADUATE DIPLOMA IN DATA SCIENCE (PGDDS)
PROGRAMME CURRICULUM

Semester III
Ethical and Legal Issues in Data Science
NoSQL Databases 1. What are Ethics?
1. Introduction to NoSQL
2. Some Ethical concern of Data Science
2. Basics of NoSQL
3. History, Concept of Informed Consent
3. Replication and Sharding
4. Data Ownership
4. Key-Value Databases
5. Privacy, Anonymity, Data Validity
5. Document Databases
6. Algorithmic Fairness
6. Column-Oriented Databases
7. Societal Consequences
7. Graph Databases
8. Code of Ethics
8. Advanced NoSQL

Data Visualisation Semester IV


1. Introduction to Data Visualisation Emerging Trends in Data Science
2. Visualisation of Numerical Data 1. Big Data
3. Visualisation of Non-numerical Data 2. Apache Spark and Scala
4. Common Visualisation Idioms 3. Deep Learning
5. Visualisation of Spatial Data, Networks and 4. Artificial Intelligence
5. Business Intelligence
Trees
6. Natural language processing
6. Data Reduction
7. Data Analytics
7. Introduction to Tableau 8. Web Analytics
8. Data Visualisation with SPSS 9. Case Study

Machine Learning with R and Python Submission II


1. Basics of Machine Learning In Semester IV students are required to submit a
2. Supervised Machine Learning submission as per guidelines given by SCDL.
3. Unsupervised Learning
Project
4. Regression Algorithms Student should choose a technical or Techno-
5. Clustering Models business topic of his/her interest and is required
6. R Markdown, Knitr, Rpubs to develop the Project based on the provided
7. ggplot2 guidelines.
8. Computation with Python – NumPy, SciPy
9. Pandas
10. Aggregating and Analysing Data with dplyr
11. Data Visualisation in Python – Matplotlib
12. Introduction to scikit-learn
13. Web Scraping in Python – Beautiful Soup
14. Introduction to (Py) Spark

2|Page

Common questions

Powered by AI

Deep learning is distinguished from traditional machine learning by its capability to automatically learn representations from raw data through layered neural networks. Unlike traditional machine learning, which often requires manual feature extraction, deep learning models, such as convolutional neural networks (CNNs), can learn from raw data like images and text without extensive preprocessing. This ability makes deep learning particularly useful in data science for tasks requiring high-dimensional data processing, such as image recognition and natural language processing, where traditional methods might falter .

SQL query integration with big data tools like Hive facilitates advanced analytics by providing a familiar, powerful querying language that can handle complex queries across large datasets stored in a distributed system. Hive translates SQL-like queries into MapReduce jobs executed across Hadoop clusters, enabling efficient processing of big data. This integration allows data scientists to leverage their existing SQL skills to perform complex analytics tasks, such as statistical analysis and data transformations, without handling the intricacies of distributed computing .

Python facilitates data handling and manipulation through its comprehensive libraries such as Pandas and NumPy. Pandas provides data structures like DataFrames that are designed for quick data manipulation tasks, allowing easy data cleanup and transformation, which are essential steps in data analysis. NumPy provides powerful multi-dimensional array objects and a suite of functions for performing operations like statistical calculations and linear algebra. These libraries together make Python a powerful tool for data analysis by enabling efficient handling of complex data structures .

The societal consequences of data science practices include privacy invasions, algorithmic bias, and the erosion of anonymity, leading to unequal societal impacts. These can be mitigated by implementing robust ethical guidelines and regulations that prioritize privacy, transparency, and fairness. Data scientists must engage in responsible data governance, inclusive algorithm design, and maintain transparency in model decision-making processes to ensure societal benefits are distributed fairly without exacerbating existing inequalities .

Algorithmic fairness is critically important in data science due to its impact on societal values and the equitable distribution of benefits and risks of data-driven decisions. As data science increasingly affects areas like healthcare, finance, and criminal justice, ensuring algorithms do not inadvertently perpetuate bias or discrimination is essential. Ethical concerns arise when biased data or flawed model objectives lead to unfair outcomes, which can exacerbate existing social inequalities. Addressing algorithmic fairness involves designing and deploying models that are transparent, inclusive, and continuously assessed for biased behavior .

Measures of central tendency, which include mean, median, and mode, are central to data processing and analysis as they provide a summary statistic that represents the center point of a dataset. In data science, these measures are utilized to derive insights about the data's distribution and the typical case within a data set. Utilizing these alongside data processing techniques allows data scientists to understand patterns and outliers, making these measures fundamental in exploratory data analysis (EDA).

Data visualization techniques enhance decision-making in business intelligence by transforming complex data sets into clear, visual formats such as graphs, charts, and maps. This simplification allows decision-makers to easily discern patterns, trends, and outliers, which are crucial for strategic planning and operational efficiency. By using tools like Tableau or SPSS for visual representation, businesses can quickly interpret data-driven insights and make informed decisions to gain competitive advantages .

The Hadoop ecosystem plays a crucial role in managing big data by providing a scalable and flexible framework for storing and processing vast amounts of data across distributed systems. Its distributed file system enables high throughput access to application data, while its ecosystem components, such as MapReduce and HDFS, support data mining techniques by allowing the processing of large datasets efficiently. For example, querying big data with Hive integrates seamlessly with these processes by providing a SQL-like interface to handle data mining tasks .

The data science pipeline integrates exploratory data analysis (EDA) by positioning it as a preliminary step that informs the subsequent phases of modeling and evaluation. EDA is critical in a data science pipeline because it involves initial data processing and transformation, which helps in understanding data patterns and formulating hypotheses that guide model building. This integration is significant as it ensures models are built on a solid understanding of data, reducing the risk of biased or inaccurate models .

The advantages of using object-oriented programming (OOP) in R for data science include modularity, reusability, and scalability of code. OOP allows data scientists to organize code into objects that encapsulate data and associated operations, facilitating easier maintenance and modification. This approach also supports code reusability across different data analysis tasks, improving efficiency and reducing redundancy. Furthermore, OOP in R enhances scalability by allowing data scientists to build complex data structures tailored for specific tasks, promoting efficient data handling and analysis .

You might also like