Python for Data Analysis - Complete Notes
1. Introduction to Python for Data Analysis
Python is a high-level, versatile programming language ideal for data analysis due to its readability and
ecosystem. It supports a variety of tasks including data cleaning, transformation, statistical modeling, and
visualization.
2. NumPy - Numerical Python
NumPy provides efficient array structures and mathematical functions.
Key Features:
- ndarray: Multidimensional array object
- Broadcasting: Arithmetic operations on arrays of different shapes
- Mathematical functions: mean, std, dot, etc.
Example:
import numpy as np
arr = [Link]([[1, 2], [3, 4]])
print([Link](arr)) # Output: 2.5
print([Link]) # Output: (2, 2)
3. Pandas - Data Manipulation and Analysis
Pandas introduces two main data structures:
- Series: 1D labeled array
- DataFrame: 2D labeled data structure
Key Operations:
- Reading data: pd.read_csv(), pd.read_excel()
- Inspecting data: [Link](), [Link]()
- Filtering: df[df['Age'] > 25]
- Sorting: df.sort_values(by='Salary')
Example:
import pandas as pd
df = [Link]({'Name': ['A', 'B'], 'Age': [22, 28]})
print(df[df['Age'] > 25])
Python for Data Analysis - Complete Notes
4. Data Cleaning in Pandas
- Handling Missing Data:
[Link]().sum()
[Link](), [Link](value)
- Renaming Columns:
[Link](columns={'old': 'new'})
- Changing Data Types:
df['col'] = df['col'].astype('int')
Example:
df['Age'] = df['Age'].fillna(df['Age'].mean())
5. Grouping and Aggregation
- Grouping: [Link]('Department')['Salary'].mean()
- Aggregation: [Link]({'Age': ['mean', 'max'], 'Salary': 'sum'})
- Pivot Tables:
df.pivot_table(index='Dept', values='Salary', aggfunc='mean')
6. Matplotlib - Basic Visualization
Matplotlib is used to create static, animated, and interactive plots.
Example:
import [Link] as plt
x = [1, 2, 3]
y = [10, 20, 30]
[Link](x, y)
[Link]('X-axis')
[Link]('Y-axis')
[Link]('Line Plot')
[Link]()
7. Seaborn - Statistical Visualization
Seaborn is built on top of Matplotlib and is used for statistical graphics.
Python for Data Analysis - Complete Notes
Example:
import seaborn as sns
[Link](style='darkgrid')
tips = sns.load_dataset('tips')
[Link](x='day', y='total_bill', data=tips)
[Link]()
8. Time Series Analysis with Pandas
Time series data has timestamps. Pandas supports powerful time-based indexing.
Example:
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
monthly_avg = df['sales'].resample('M').mean()
9. Statistics with Pandas and NumPy
- Descriptive Stats: [Link]()
- Correlation: [Link]()
- Value Counts: df['Category'].value_counts()
- Standard Deviation: df['Salary'].std()
NumPy Examples:
[Link](data), [Link](data), [Link](data)
10. Plotly - Interactive Visualization
Plotly is a graphing library for interactive charts.
Example:
import [Link] as px
df = [Link]().query("year == 2007")
fig = [Link](df, x="gdpPercap", y="lifeExp", size="pop", color="continent")
[Link]()
Python for Data Analysis - Complete Notes
11. Scikit-learn - Machine Learning Library
Scikit-learn provides simple tools for predictive data analysis.
Steps:
- Load dataset
- Split data: train_test_split()
- Train model: [Link]()
- Predict: [Link]()
Example:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
[Link](X_train, y_train)
preds = [Link](X_test)
12. Summary & Tips for Interviews
- Master Pandas and NumPy first
- Practice real datasets (Kaggle, UCI, etc.)
- Know how to visualize and clean data
- Understand ML workflow: EDA -> Preprocessing -> Model
- Practice SQL + Python-based case studies