0% found this document useful (0 votes)
7 views12 pages

EDA in Python

The document provides a comprehensive guide on Exploratory Data Analysis (EDA) in Python, detailing essential setup steps, data loading techniques, and methods for handling missing values and duplicates. It also covers descriptive statistics, group analysis, correlation, and various visualization techniques. Additionally, automated EDA tools and export options for cleaned data are discussed to enhance the analysis process.

Uploaded by

gowthami jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views12 pages

EDA in Python

The document provides a comprehensive guide on Exploratory Data Analysis (EDA) in Python, detailing essential setup steps, data loading techniques, and methods for handling missing values and duplicates. It also covers descriptive statistics, group analysis, correlation, and various visualization techniques. Additionally, automated EDA tools and export options for cleaned data are discussed to enhance the analysis process.

Uploaded by

gowthami jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Exploratory Data

Analysis (EDA) in
Python
Core Techniques Every Data Professional
Should Know

Pooja Pawar
EDA Setup
Create virtual env → isolate project
$ python -m venv eda_env

Activate env → start workspace


$ source eda_env/bin/activate

Upgrade pip → avoid install errors


$ python -m pip install --upgrade pip

Install core libraries → EDA essentials


$ pip install pandas numpy
$ pip install matplotlib seaborn

Install stats tools → deeper analysis


$ pip install scipy

Install profiling → automated EDA


$ pip install ydata-profiling

Launch Jupyter → interactive analysis


$ jupyter notebook Pooja Pawar
Load Data & First Look
Read CSV file → load dataset into a
DataFrame
pd.read_csv("[Link]")

Read Excel file → import spreadsheet data


into Python
pd.read_excel("[Link]")

Head → preview first few rows of the dataset


[Link]()

Tail → inspect last rows to check data


completeness
[Link]()

Shape → check number of rows and columns


[Link]

Columns → list all feature names in the


dataset
[Link] Pooja Pawar
Data Types & Columns
Dtypes → check column data types
[Link]

Convert to datetime → fix dates


pd.to_datetime(df["date"])

Convert to numeric → clean numbers


pd.to_numeric(df["amount"], errors="coerce")

Astype category → optimize memory


df["city"].astype("category")

Rename columns → consistency


[Link](columns={"Order Date":"order_date"})

Sort values → inspect extremes


df.sort_values("amount")

Unique values → detect IDs


[Link]()

Pooja Pawar
Missing Values Handling
Is null → missing check
[Link]().any()

Null count → column-wise


[Link]().sum()

Null percentage → severity


[Link]().mean()*100

Drop null rows → strict cleaning


[Link]()

Fill with value → simple impute


[Link](0)

Fill with median → numeric fix


df["amount"].fillna(df["amount"].median())

Forward fill → time series


[Link](method="ffill")
Pooja Pawar
Duplicates & Quality
Checks
Find duplicates → detect repeats
[Link]()

Count duplicates → data quality


[Link]().sum()

Drop duplicates → clean data


df.drop_duplicates()

Subset duplicates → key-based


[Link](subset=["id","date"])

Memory usage → dataset size


df.memory_usage(deep=True)

Sample rows → random check


[Link](5)

Value counts → category spread


df["status"].value_counts() Pooja Pawar
Descriptive Statistics
Mean → average value
df["amount"].mean()

Median → central value


df["amount"].median()

Std dev → variation


df["amount"].std()

Quantiles → distribution cut


df["amount"].quantile([0.25,0.5,0.75])

Skew → distribution shape


df["amount"].skew()

Kurtosis → tail heaviness


df["amount"].kurt()

Mode → most frequent


df["status"].mode()
Pooja Pawar
GroupBy Analysis
Group mean → segment average
[Link]("city")["amount"].mean()

Group sum → totals


[Link]("city")["amount"].sum()

Multiple agg → deeper insight


[Link]("city")
["amount"].agg(["mean","median","count"])

Pivot table → summary view


pd.pivot_table(df, values="amount", index="city")

Crosstab → category vs category


[Link](df["city"], df["status"])

Rank within group → comparison


[Link]("city")["amount"].rank()

Top N per group → leaders


df.sort_values("amount").groupby("city").tail(3)
Pooja Pawar
Correlation &
Relationships
Correlation matrix → relationships
df.select_dtypes("number").corr()

Target correlation → drivers


[Link]()["amount"]

Covariance → joint variation


[Link]()

Scatter plot → relation view


[Link](df["x"], df["y"])

Pairplot → multi-feature view


[Link](df)

Heatmap → correlation visual


[Link]([Link]())

Line fit → trend check


[Link](x, y, 1) Pooja Pawar
Visual EDA
Histogram → distribution
df["amount"].hist()

Boxplot → outliers
[Link](column="amount")

Bar plot → category counts


df["city"].value_counts().[Link]()

Line plot → trends


[Link](x="date", y="amount")

Countplot → frequency
[Link](x="status", data=df)

Violin plot → density


[Link](x="status", y="amount", data=df)

Save plot → reuse


[Link]("[Link]")
Pooja Pawar
Automated EDA &
Export
Profile report → full EDA
from ydata_profiling import ProfileReport
ProfileReport(df).to_file("[Link]")

Minimal profile → large data


ProfileReport(df, minimal=True)

Sweetviz → instant report


[Link](df).show_html()

Missingno matrix → null pattern


[Link](df)

Export CSV → cleaned data


df.to_csv("[Link]", index=False)

Export Parquet → analytics-ready


df.to_parquet("[Link]")

Save stats → share insights


[Link]().to_csv("[Link]") Pooja Pawar
Follow for more
data analytics
content

Pooja Pawar

You might also like