WeRateDogs Data Wrangling Report

This report details the data wrangling process for the WeRateDogs project, which involved gathering, assessing, and cleaning a Twitter dataset. Key challenges included data quality issues and the need for programmatic cleaning due to inconsistencies and irrelevant information. The final cleaned dataset is documented for reproducibility and future analysis.

Uploaded by

anawsjourney

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

111 views3 pages

WeRateDogs Data Wrangling Report

Uploaded by

anawsjourney

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Project WeRateDogs – Report

NWG Data Engineering Nano Degree – Wrangle and Analyze Data

By Gaurav Sachan

Introduction
This report provides an overview of the data wrangling process undertaken to prepare the dataset for analysis within the project.
The dataset used in this project is the tweet archive of Twitter user@dog_rates, also known as WeRateDogs, which is a Twitter
account that rates people’s dogs with humorous comment about the dog. The wrangle report documents the primary objectives
included gathering, assessing, and cleaning, the data to ensure its usability and reliability. This report documents each step for
future reference, aiding in reproducibility.

Data Gathering
The first step was to gather the data from multiple origins/sources and in varying formats, including:
 CSV file – WeRateDogs Twitter archive. Udacity provided this CSV file and I downloaded directly from project website.
 TSV file – Tweet Image Predictions. Udacity servers host the file. I downloaded this file programmatically by using the
Requests python library.
 TXT file – Tweets JSON. The file contains retweets count and favourite count information missing from the Twitter
archive. In the absence of a Twitter account, I used the Requests python library to download the tweet JSON file (in txt
format) programmatically from Udacity portal.
 The key Python libraries used in the process the data are pandas and requests.

Challenges encountered during gathering:

 Unable to get the developer access to twitter API, the TXT file containing Tweets JSON had to be gathered using
workaround of directly sourcing from project portal.

Data Assessment
After gathering the data from various sources, I assessed the data both visually and programmatically to identify data quality
and tidiness issues using the following guideline:

“Quality relates to Content while Tidiness relates to Structure of data”

Since the datasets are not massive, I could use MS Excel to open the files, scroll through the data and spot any visible issues. The
code in Jupyter Notebook helped to view specific portions and summaries of the data, for example, using the powerful library of
pandas’ methods including but not limited to info(), head(), sample(), value_counts(), duplicated(), query(), and describe() etc.
Following are the observations noted while assessing the data to help Define the scope of issues to fix later in the process.

Quality Issues Summary

Dataset Issue
Twitter Archive  4 _ID columns (in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id,
retweeted_status_user_id) are float, should be int
 timestamp is str, should be datetime, contains +0000 in timestamp, which is odd
 doggo, floofer, pupper, puppo are all adaptations for a dog stage, could be under one column say
"dog stage"
 Unusual values in rating denominator, e.g., 170, 150, 130, etc. The rating denominator would
expected to be almost always 10
 unusual values in rating numerator, e.g., 1776, 960, 666, 204, 165,etc. in comparison to the
denominator
 source info not easily comprehensible
 only need original ratings with pictures, retweets and replies entries not required
Image Prediction  There are non-dog entries as well e.g., banana, notebook, pool_table, theatre_curtain etc.
 capitalization in p1, p2 and p3 columns is not very consistent, it could either be all lower case or
InitCap
 Grain of data is tweet_id, hence it could very well sit with twitter archive
Tweet JSON the grain of data is tweet_id, this data could very well sit with twitter_archive_enhanced dataset
Data Cleaning
Although there are, many issues with the data downloaded from social media, which would, significant time to fix it completely.
To my best abilities, I wrote the code to clean-up each of the issues documented in the previous section. The programmatic data
cleaning process consists three steps:

Define Code Test

Important points to note:

 Copies of original pieces of data before cleaning were made (to avoid accidental impact to original dataset)
 Logical order to clean issues because some of the issues will disappear as we clean the issue one by one
o For e.g., some of the abnormal rating numerators and denominators we discovered earlier disappeared after
fixing the issue that many images in the image prediction were not for dogs

To ensure quality, the pre-processing steps undertaken articulated below:

Handling non- Removing

Correcting Standardizing
comprehensive irrelevant Manual Steps
inconsistencies: data types:
data: information:

source information Rating numerator Converted numerical Dropped columns Programmatic tools
was not easily and denominator and categorical that were executed most of
comprehensible and were inconsistent variables to unnecessary for the data cleaning;
had to be fixed to and had to be fixed appropriate formats analysis. for some, manual
simpler and methods were used,
understandable for example,
values Removed entries that correcting the
Datatype issues with
were not dogs, and abnormal rating
four ID columns,
some of the numerators and
typed as float, while
abnormal ratings denominators.
they should have
were automatically
been int.
gone.
Filter abnormal
Project required only ratings and read
original tweets; though the text to
hence, any retweets find the correct
and replies ratings.
information had to
be removed, and
datatype issue got
resolved after
removing these
columns

After fixing all the issues, reassessed the dataset and then stored clean data in a master csv file.
Data Analysis Preparation
Following cleaning, the dataset was:
 Transformed into a structured format suitable for modeling or visualization
 Checked for anomalies
 Normalized if required

Reproducibility Considerations
To facilitate future reproduction of this work:
 Documented each step in a Jupyter Notebook with markdown explanations
 Ensured that all dataset sources were properly referenced for easy access

Conclusion
This report summarizes the data wrangling workflow, outlining how the data was sourced, cleaned, and prepared for analysis.
The documented process will assist future users in reproducing the dataset transformation accurately.

Data Wrangling Report for Twitter Data
No ratings yet
Data Wrangling Report for Twitter Data
3 pages
Data Wrangling WeRateDogs Tweets
No ratings yet
Data Wrangling WeRateDogs Tweets
4 pages
Data Wrangling with Twitter and API
No ratings yet
Data Wrangling with Twitter and API
1 page
23dsb3303a Practical Workbook
No ratings yet
23dsb3303a Practical Workbook
131 pages
Python Data Wrangling Lab Assignment
100% (1)
Python Data Wrangling Lab Assignment
12 pages
Social Media Data Cleaning Guide
No ratings yet
Social Media Data Cleaning Guide
7 pages
Data Science Assignment: Data Cleaning Tasks
No ratings yet
Data Science Assignment: Data Cleaning Tasks
3 pages
Machine Learning Assignment Guidelines
No ratings yet
Machine Learning Assignment Guidelines
2 pages
Data Wrangling for Content Recommendations
No ratings yet
Data Wrangling for Content Recommendations
7 pages
Data Wrangling Steps for Capstone Project
No ratings yet
Data Wrangling Steps for Capstone Project
3 pages
TikTok Data Analysis Project Guide
No ratings yet
TikTok Data Analysis Project Guide
10 pages
Data Cleaning Techniques in Python
No ratings yet
Data Cleaning Techniques in Python
10 pages
ML for Natural Language to SQL Conversion
No ratings yet
ML for Natural Language to SQL Conversion
3 pages
Project 1
No ratings yet
Project 1
4 pages
Data Cleaning and Preprocessing Guide
No ratings yet
Data Cleaning and Preprocessing Guide
14 pages
NCR Ride Booking Data Analysis Report
No ratings yet
NCR Ride Booking Data Analysis Report
33 pages
Toxic Comment Analysis Project Report
No ratings yet
Toxic Comment Analysis Project Report
20 pages
Comparing CSV, JSON, and XML Formats
No ratings yet
Comparing CSV, JSON, and XML Formats
9 pages
Data Wrangling in Supply Chain AI
No ratings yet
Data Wrangling in Supply Chain AI
13 pages
CL-I Lab Manual 2025-26
No ratings yet
CL-I Lab Manual 2025-26
159 pages
2011.09926v3-5
No ratings yet
2011.09926v3-5
1 page
Topic Modeling with Twitter Data Analysis
No ratings yet
Topic Modeling with Twitter Data Analysis
2 pages
Data Science Class Lectures
No ratings yet
Data Science Class Lectures
197 pages
Mid-Term Project Data Wrangling Guide
No ratings yet
Mid-Term Project Data Wrangling Guide
3 pages
Data Science Internship Tasks for Coders
No ratings yet
Data Science Internship Tasks for Coders
12 pages
Data Science Lab Syllabus and Outcomes
No ratings yet
Data Science Lab Syllabus and Outcomes
5 pages
4P Mobile Data Processing Task
No ratings yet
4P Mobile Data Processing Task
5 pages
Data Wrangling for AI Content Recommendations
No ratings yet
Data Wrangling for AI Content Recommendations
14 pages
Data Wrangling in Python at SVMIT
No ratings yet
Data Wrangling in Python at SVMIT
91 pages
Class XII Practical File 2026
No ratings yet
Class XII Practical File 2026
13 pages
Analyzing WeRateDogs Twitter Data
No ratings yet
Analyzing WeRateDogs Twitter Data
339 pages
Data Wrangling Essentials for Analysis
No ratings yet
Data Wrangling Essentials for Analysis
25 pages
Data Science & Big Data Lab Manual
No ratings yet
Data Science & Big Data Lab Manual
116 pages
Real-Time Social Media Analytics Pipeline
No ratings yet
Real-Time Social Media Analytics Pipeline
4 pages
BTC-USD Data Analysis with Numpy
No ratings yet
BTC-USD Data Analysis with Numpy
4 pages
Essential Python Libraries for Data Science
No ratings yet
Essential Python Libraries for Data Science
22 pages
Netflix EDA: Trends & Insights Analysis
No ratings yet
Netflix EDA: Trends & Insights Analysis
3 pages
Machine Learning Project on Election Data
No ratings yet
Machine Learning Project on Election Data
37 pages
21cdt51 Data Wrangling Lesson Plan
No ratings yet
21cdt51 Data Wrangling Lesson Plan
4 pages
DSBDA Lab Manual for Data Science
No ratings yet
DSBDA Lab Manual for Data Science
104 pages
DSBDA Lab Manual for SPPU 2022-23
No ratings yet
DSBDA Lab Manual for SPPU 2022-23
117 pages
Twitter Sentiment Analysis Project
No ratings yet
Twitter Sentiment Analysis Project
16 pages
L6 - Data Munging
No ratings yet
L6 - Data Munging
39 pages
Hate Speech Detection Project Report
100% (1)
Hate Speech Detection Project Report
24 pages
Python Data Cleaning Exercises
No ratings yet
Python Data Cleaning Exercises
1 page
Sub-Event Detection in Football Matches
No ratings yet
Sub-Event Detection in Football Matches
5 pages
Data Science Internship Report 2023
No ratings yet
Data Science Internship Report 2023
24 pages
Tweet Sentiment Classification Project
No ratings yet
Tweet Sentiment Classification Project
12 pages
Data Preprocessing Steps in ML
No ratings yet
Data Preprocessing Steps in ML
26 pages
Data Science Internship Task Overview
No ratings yet
Data Science Internship Task Overview
10 pages
AI Project 2: Data Science Workflow
No ratings yet
AI Project 2: Data Science Workflow
3 pages
COVID-19 Data Analysis and Preprocessing
No ratings yet
COVID-19 Data Analysis and Preprocessing
8 pages
Social Media Data Analysis with Python
No ratings yet
Social Media Data Analysis with Python
6 pages
Types and Processes of Data Science
No ratings yet
Types and Processes of Data Science
35 pages
Data Science Lec 6
No ratings yet
Data Science Lec 6
27 pages
Data Wrangling and Cleansing Techniques
No ratings yet
Data Wrangling and Cleansing Techniques
70 pages
Six-Week Data Science Training Report
No ratings yet
Six-Week Data Science Training Report
30 pages
TikTok Claims Classification EDA Guide
No ratings yet
TikTok Claims Classification EDA Guide
22 pages
Introduction to Biostatistics Concepts
No ratings yet
Introduction to Biostatistics Concepts
19 pages
Jamovi Guide for Psychology Students
No ratings yet
Jamovi Guide for Psychology Students
27 pages
Understanding Binary Logistic Regression
No ratings yet
Understanding Binary Logistic Regression
30 pages
Skittles Statistical Analysis Project
No ratings yet
Skittles Statistical Analysis Project
8 pages
Descriptive Data Analysis Overview
No ratings yet
Descriptive Data Analysis Overview
15 pages
Loan Prediction Using Python ML
No ratings yet
Loan Prediction Using Python ML
28 pages
Predictive Classification Analysis Template
No ratings yet
Predictive Classification Analysis Template
4 pages
MBA Business Analytics Loan Default Report
No ratings yet
MBA Business Analytics Loan Default Report
4 pages
Lecture Lesson 1 (Week 2)
No ratings yet
Lecture Lesson 1 (Week 2)
32 pages
India EIS Data Analysis Plan Guide
No ratings yet
India EIS Data Analysis Plan Guide
23 pages
Data Mining Using SAS Enterprise Miner A Case Study Approach PDF
No ratings yet
Data Mining Using SAS Enterprise Miner A Case Study Approach PDF
135 pages
Jci 133 171058
No ratings yet
Jci 133 171058
3 pages
Discriminant Analysis in Classification
No ratings yet
Discriminant Analysis in Classification
33 pages
Quantitative Analysis in CIE Course
No ratings yet
Quantitative Analysis in CIE Course
10 pages
Blinder-Oaxaca Decomposition in R Package
No ratings yet
Blinder-Oaxaca Decomposition in R Package
19 pages
(Ebook) Statistics: Informed Decisions Using Data, Global Edition (Paperback) by Sullivan, Michael ISBN 9781292157115, 1292157119 Updated 2025
100% (1)
(Ebook) Statistics: Informed Decisions Using Data, Global Edition (Paperback) by Sullivan, Michael ISBN 9781292157115, 1292157119 Updated 2025
86 pages
Introduction to Data Mining Techniques
No ratings yet
Introduction to Data Mining Techniques
42 pages
Smartphone Brand Preferences in Malawi
100% (1)
Smartphone Brand Preferences in Malawi
9 pages
Case Study For Sustainable Tourism Salvador and Querquez
No ratings yet
Case Study For Sustainable Tourism Salvador and Querquez
8 pages
Data Analysis in Pharmaceutical Statistics
No ratings yet
Data Analysis in Pharmaceutical Statistics
109 pages
Quantitative Research Overview for Grade 12
No ratings yet
Quantitative Research Overview for Grade 12
22 pages
Importance of Business Analytics
No ratings yet
Importance of Business Analytics
5 pages
Age-Based Survival Analysis Results
No ratings yet
Age-Based Survival Analysis Results
3 pages
DASP User Manual: Stata Package Guide
No ratings yet
DASP User Manual: Stata Package Guide
151 pages
Solution Manual for Fundamentals of Business Statistics International Edition by 6th Edition by Dennis j Sweeney Thomas
No ratings yet
Solution Manual for Fundamentals of Business Statistics International Edition by 6th Edition by Dennis j Sweeney Thomas
61 pages
Big Data Research Proposal Workshop Guide
No ratings yet
Big Data Research Proposal Workshop Guide
40 pages
Feature Engineering for Data Science
No ratings yet
Feature Engineering for Data Science
98 pages
Understanding Structural Equation Models
No ratings yet
Understanding Structural Equation Models
35 pages
Medicine The American Journal of Sports
No ratings yet
Medicine The American Journal of Sports
7 pages
Choosing the Right Statistical Test
No ratings yet
Choosing the Right Statistical Test
51 pages

WeRateDogs Data Wrangling Report

Uploaded by

WeRateDogs Data Wrangling Report

Uploaded by

Project WeRateDogs – Report

NWG Data Engineering Nano Degree – Wrangle and Analyze Data

Challenges encountered during gathering:

“Quality relates to Content while Tidiness relates to Structure of data”

Quality Issues Summary

Define Code Test

Important points to note:

To ensure quality, the pre-processing steps undertaken articulated below:

Handling non- Removing

You might also like