Exploratory Data
Analysis (EDA) in
Python
Core Techniques Every Data Professional
Should Know
Pooja Pawar
EDA Setup
Create virtual env → isolate project
$ python -m venv eda_env
Activate env → start workspace
$ source eda_env/bin/activate
Upgrade pip → avoid install errors
$ python -m pip install --upgrade pip
Install core libraries → EDA essentials
$ pip install pandas numpy
$ pip install matplotlib seaborn
Install stats tools → deeper analysis
$ pip install scipy
Install profiling → automated EDA
$ pip install ydata-profiling
Launch Jupyter → interactive analysis
$ jupyter notebook Pooja Pawar
Load Data & First Look
Read CSV file → load dataset into a
DataFrame
pd.read_csv("[Link]")
Read Excel file → import spreadsheet data
into Python
pd.read_excel("[Link]")
Head → preview first few rows of the dataset
[Link]()
Tail → inspect last rows to check data
completeness
[Link]()
Shape → check number of rows and columns
[Link]
Columns → list all feature names in the
dataset
[Link] Pooja Pawar
Data Types & Columns
Dtypes → check column data types
[Link]
Convert to datetime → fix dates
pd.to_datetime(df["date"])
Convert to numeric → clean numbers
pd.to_numeric(df["amount"], errors="coerce")
Astype category → optimize memory
df["city"].astype("category")
Rename columns → consistency
[Link](columns={"Order Date":"order_date"})
Sort values → inspect extremes
df.sort_values("amount")
Unique values → detect IDs
[Link]()
Pooja Pawar
Missing Values Handling
Is null → missing check
[Link]().any()
Null count → column-wise
[Link]().sum()
Null percentage → severity
[Link]().mean()*100
Drop null rows → strict cleaning
[Link]()
Fill with value → simple impute
[Link](0)
Fill with median → numeric fix
df["amount"].fillna(df["amount"].median())
Forward fill → time series
[Link](method="ffill")
Pooja Pawar
Duplicates & Quality
Checks
Find duplicates → detect repeats
[Link]()
Count duplicates → data quality
[Link]().sum()
Drop duplicates → clean data
df.drop_duplicates()
Subset duplicates → key-based
[Link](subset=["id","date"])
Memory usage → dataset size
df.memory_usage(deep=True)
Sample rows → random check
[Link](5)
Value counts → category spread
df["status"].value_counts() Pooja Pawar
Descriptive Statistics
Mean → average value
df["amount"].mean()
Median → central value
df["amount"].median()
Std dev → variation
df["amount"].std()
Quantiles → distribution cut
df["amount"].quantile([0.25,0.5,0.75])
Skew → distribution shape
df["amount"].skew()
Kurtosis → tail heaviness
df["amount"].kurt()
Mode → most frequent
df["status"].mode()
Pooja Pawar
GroupBy Analysis
Group mean → segment average
[Link]("city")["amount"].mean()
Group sum → totals
[Link]("city")["amount"].sum()
Multiple agg → deeper insight
[Link]("city")
["amount"].agg(["mean","median","count"])
Pivot table → summary view
pd.pivot_table(df, values="amount", index="city")
Crosstab → category vs category
[Link](df["city"], df["status"])
Rank within group → comparison
[Link]("city")["amount"].rank()
Top N per group → leaders
df.sort_values("amount").groupby("city").tail(3)
Pooja Pawar
Correlation &
Relationships
Correlation matrix → relationships
df.select_dtypes("number").corr()
Target correlation → drivers
[Link]()["amount"]
Covariance → joint variation
[Link]()
Scatter plot → relation view
[Link](df["x"], df["y"])
Pairplot → multi-feature view
[Link](df)
Heatmap → correlation visual
[Link]([Link]())
Line fit → trend check
[Link](x, y, 1) Pooja Pawar
Visual EDA
Histogram → distribution
df["amount"].hist()
Boxplot → outliers
[Link](column="amount")
Bar plot → category counts
df["city"].value_counts().[Link]()
Line plot → trends
[Link](x="date", y="amount")
Countplot → frequency
[Link](x="status", data=df)
Violin plot → density
[Link](x="status", y="amount", data=df)
Save plot → reuse
[Link]("[Link]")
Pooja Pawar
Automated EDA &
Export
Profile report → full EDA
from ydata_profiling import ProfileReport
ProfileReport(df).to_file("[Link]")
Minimal profile → large data
ProfileReport(df, minimal=True)
Sweetviz → instant report
[Link](df).show_html()
Missingno matrix → null pattern
[Link](df)
Export CSV → cleaned data
df.to_csv("[Link]", index=False)
Export Parquet → analytics-ready
df.to_parquet("[Link]")
Save stats → share insights
[Link]().to_csv("[Link]") Pooja Pawar
Follow for more
data analytics
content
Pooja Pawar