Project WeRateDogs – Report
NWG Data Engineering Nano Degree – Wrangle and Analyze Data
By Gaurav Sachan
Introduction
This report provides an overview of the data wrangling process undertaken to prepare the dataset for analysis within the project.
The dataset used in this project is the tweet archive of Twitter user@dog_rates, also known as WeRateDogs, which is a Twitter
account that rates people’s dogs with humorous comment about the dog. The wrangle report documents the primary objectives
included gathering, assessing, and cleaning, the data to ensure its usability and reliability. This report documents each step for
future reference, aiding in reproducibility.
Data Gathering
The first step was to gather the data from multiple origins/sources and in varying formats, including:
CSV file – WeRateDogs Twitter archive. Udacity provided this CSV file and I downloaded directly from project website.
TSV file – Tweet Image Predictions. Udacity servers host the file. I downloaded this file programmatically by using the
Requests python library.
TXT file – Tweets JSON. The file contains retweets count and favourite count information missing from the Twitter
archive. In the absence of a Twitter account, I used the Requests python library to download the tweet JSON file (in txt
format) programmatically from Udacity portal.
The key Python libraries used in the process the data are pandas and requests.
Challenges encountered during gathering:
Unable to get the developer access to twitter API, the TXT file containing Tweets JSON had to be gathered using
workaround of directly sourcing from project portal.
Data Assessment
After gathering the data from various sources, I assessed the data both visually and programmatically to identify data quality
and tidiness issues using the following guideline:
“Quality relates to Content while Tidiness relates to Structure of data”
Since the datasets are not massive, I could use MS Excel to open the files, scroll through the data and spot any visible issues. The
code in Jupyter Notebook helped to view specific portions and summaries of the data, for example, using the powerful library of
pandas’ methods including but not limited to info(), head(), sample(), value_counts(), duplicated(), query(), and describe() etc.
Following are the observations noted while assessing the data to help Define the scope of issues to fix later in the process.
Quality Issues Summary
Dataset Issue
Twitter Archive 4 _ID columns (in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id,
retweeted_status_user_id) are float, should be int
timestamp is str, should be datetime, contains +0000 in timestamp, which is odd
doggo, floofer, pupper, puppo are all adaptations for a dog stage, could be under one column say
"dog stage"
Unusual values in rating denominator, e.g., 170, 150, 130, etc. The rating denominator would
expected to be almost always 10
unusual values in rating numerator, e.g., 1776, 960, 666, 204, 165,etc. in comparison to the
denominator
source info not easily comprehensible
only need original ratings with pictures, retweets and replies entries not required
Image Prediction There are non-dog entries as well e.g., banana, notebook, pool_table, theatre_curtain etc.
capitalization in p1, p2 and p3 columns is not very consistent, it could either be all lower case or
InitCap
Grain of data is tweet_id, hence it could very well sit with twitter archive
Tweet JSON the grain of data is tweet_id, this data could very well sit with twitter_archive_enhanced dataset
Data Cleaning
Although there are, many issues with the data downloaded from social media, which would, significant time to fix it completely.
To my best abilities, I wrote the code to clean-up each of the issues documented in the previous section. The programmatic data
cleaning process consists three steps:
Define Code Test
Important points to note:
Copies of original pieces of data before cleaning were made (to avoid accidental impact to original dataset)
Logical order to clean issues because some of the issues will disappear as we clean the issue one by one
o For e.g., some of the abnormal rating numerators and denominators we discovered earlier disappeared after
fixing the issue that many images in the image prediction were not for dogs
To ensure quality, the pre-processing steps undertaken articulated below:
Handling non- Removing
Correcting Standardizing
comprehensive irrelevant Manual Steps
inconsistencies: data types:
data: information:
source information Rating numerator Converted numerical Dropped columns Programmatic tools
was not easily and denominator and categorical that were executed most of
comprehensible and were inconsistent variables to unnecessary for the data cleaning;
had to be fixed to and had to be fixed appropriate formats analysis. for some, manual
simpler and methods were used,
understandable for example,
values Removed entries that correcting the
Datatype issues with
were not dogs, and abnormal rating
four ID columns,
some of the numerators and
typed as float, while
abnormal ratings denominators.
they should have
were automatically
been int.
gone.
Filter abnormal
Project required only ratings and read
original tweets; though the text to
hence, any retweets find the correct
and replies ratings.
information had to
be removed, and
datatype issue got
resolved after
removing these
columns
After fixing all the issues, reassessed the dataset and then stored clean data in a master csv file.
Data Analysis Preparation
Following cleaning, the dataset was:
Transformed into a structured format suitable for modeling or visualization
Checked for anomalies
Normalized if required
Reproducibility Considerations
To facilitate future reproduction of this work:
Documented each step in a Jupyter Notebook with markdown explanations
Ensured that all dataset sources were properly referenced for easy access
Conclusion
This report summarizes the data wrangling workflow, outlining how the data was sourced, cleaned, and prepared for analysis.
The documented process will assist future users in reproducing the dataset transformation accurately.