0% found this document useful (0 votes)

20 views9 pages

Data Structure for Real Estate Analysis

Uploaded by

jeffSyed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views9 pages

Data Structure for Real Estate Analysis

Uploaded by

jeffSyed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

📁 DATA STRUCTURE OVERVIEW

Main Folder: propsight_datacollection

This is your master data collection folder containing 7 main directories,

each serving a specific purpose in your data pipeline.

DETAILED BREAKDOWN OF EACH FOLDER

1. parsed_output (11 date folders)

Purpose: Cleaned, structured data ready for analysis

Structure:

Data Repository Overview

Top-Level Layout

parsed_output/

input_files/

images/

html_output/

execution_colabs/

TX/

1) parsed_output/

Purpose: Date-organized outputs from each scraping session.

Top-level structure:

parsed_output/

├── 2025-10-12/ # 74 items (one file per ZIP)

│ ├── redfin_77037_2025-[Link] (14 KB)

│ ├── redfin_77038_2025-[Link] (7 KB)

│ ├── redfin_77032_2025-[Link] (7 KB)

│ └── ... (more zip code files)

├── 2025-10-05/

├── 2025-09-28/
└── ... (more dates)

File types & sizes:

 JSON: 14–18 KB (structured, programmatic use)

 Excel (.xlsx): 5–7 KB (quick analysis, spreadsheets)

What’s inside each file: one ZIP code per file (e.g., 77037, 77038,
77032).

Sample columns (from Excel):

Address

City, State, ZIP Code

Price # e.g., $224,786; $80,000; $144,950

Beds # 2, 3, 4, ...

Baths # 1.5, 2, 2.5, ...

Square Footage # 1,920; 1,000; 1,010; ...

Lot Size

Status # For Sale, ABOUT THIS HOME

Property URL # Redfin link

Image URL # CDN link

Observed ranges (sample):

 Prices: $80,000 → $450,000+

 Beds: 2–4

 Baths: 1–3.5

 Area: ~1,000 → 2,842+ sq ft

 Links: Redfin property pages + CDN image URLs

2) input_files/ (1 file)

Purpose: Parameters/reference lists for scraping runs.

Contents:

input_files/
└── harris_county_zipcodes.xlsx (2 KB)

Columns:

Rank

Zip Code

Neighborhood

Median Price per Sq. Ft. # e.g., $498, $363, $355, ...

Sample entries:

1. 77005 — West University — $498/sq ft

2. 77024 — Memorial — $363/sq ft

3. 77027 — River Oaks — $355/sq ft

4. 77008 — The Heights — $312/sq ft

…

Use: Which ZIPs to target, neighborhood names, expected price ranges

(validation), and priority (rank).

3) images/

Purpose: Photo assets by collection date and ZIP.

Structure:

images/

├── 2025-10-12/

│ ├── 77068/

│ │ ├── [Link] (76 KB)

│ │ ├── [Link] (41 KB)

│ │ ├── [Link] (39 KB)

│ │ └── ... (more images)

│ ├── 77032/

│ └── ...

├── 2025-10-05/

└── ... (more dates)

Notes:
 JPGs: ~27–77 KB each

 Filenames: hash-based (unique)

 Typical content: exterior/interior property photos (often from Redfin

CDN)

 Used for image-based search features

4) html_output/

Purpose: Raw HTML snapshots saved during scraping (for

re-parsing/debugging).

Structure:

html_output/

├── 2025-10-12/

│ ├── redfin_77034_2025-[Link] (908 KB)

│ ├── redfin_77035_2025-[Link] (1,202 KB)

│ ├── redfin_fail_77028_2025-[Link] (180 KB) # failed/empty page

│ └── ... (more HTML files)

└── ... (more dates)

Notes:

 Success pages: large (≈0.9–1.2 MB)

 Failure pages: smaller (≈180 KB), prefixed with redfin_fail_

 Why keep HTML: full-fidelity backup, re-parse without re-scraping,

targeted debugging

5) execution_colabs/

Purpose: Artifacts from each Google Colab scraping session.

Structure:

execution_colabs/

├── 6855838179985639936/

│ ├── google-chrome-stable_current_amd64.deb (117 MB)

│ ├── [Link] (5.4 MB)

│ ├── content (Jupyter notebook, 2.9 MB)

│ └── ...

├── 8721370211984343040/

└── ... (more execution IDs)

Notes:

 One folder per run (unique execution ID)

 Evidences headless Chrome + likely Selenium

 Contains intermediate HTML/Notebook artifacts to trace exact run

timing/outputs

6) TX/ (Texas — Primary Geographical Index)

Purpose: Complete datasets organized by City → ZIP → Property ID.

Structure (sample):

TX/

├── Houston/

│ ├── 77055/

│ │ ├── 138101113G117459/

│ │ │ ├── [Link] (1,139 KB)

│ │ │ ├── [Link] (1 KB)

│ │ │ └── images/ # property-specific images

│ │ ├── 5741926959192973/

│ │ │ ├── [Link]

│ │ │ └── images/

│ │ └── ... (more properties)

│ ├── 77032/
│ └── ... (more ZIPs)

├── Bunker Hill Village/

├── Hedwig Village/

├── West University Place/

└── ... (more cities)

Notes:

 Per-property folders (unique IDs) with dated snapshots: html, xlsx,

txt, and an images/ subfolder

 Supports fine-grained lineage (property-level change history across

dates)

This is your MOST IMPORTANT folder with:

Level 1: Cities/Areas

 Houston (main city)

 Bunker Hill Village

 Hedwig Village

 West University Place

 Southside Place

 South Houston

 Piney Point Village

 North Houston

 Hunters Creek Village

 Jacinto City

Level 2: Zip Codes

 77055, 77032, 77068, 77037, 77038, etc.

Level 3: Property IDs

 Long unique numbers (138101113G117459, 5741926959192973)

 These are Redfin property identifiers

Level 4: Historical Data for Each Property

 HTML files (1.1 MB): Full listing page

 Excel files (1 KB): Structured property data

 Text files (1 KB): Basic property info

 images folder: Property photos

What's POWERFUL about this structure:

 Historical tracking: Multiple dates per property

o Example: Property 5741926959192973 has data from both

2025-09-07 AND 2025-08-20

o You can track price changes, status changes, new photos

 Complete data: Every property has 4 data types (HTML, Excel,

text, images)

 Easy analysis: Can aggregate by city, zip code, or individual

property

7. UT (Likely Utah or University area)

Similar structure to TX folder, organized by location

📊 WHAT THIS DATA COLLECTION ENABLES

Current Capabilities:

1. Geographic Analysis

o Compare neighborhoods by price

o Map property density by zip code

o Track which areas have most listings

2. Price Analysis

o Price per square foot by neighborhood

o Price ranges per zip code

o Historical price changes for individual properties

3. Property Features

o Distribution of bedrooms/bathrooms

o Square footage ranges

o Lot sizes

4. Image-Based Search
o Visual similarity matching

o Property style classification

o Neighborhood aesthetic analysis

5. Historical Tracking

o How long properties stay on market

o Price changes over time

o Status changes (For Sale → Sold)

🎯 DATA QUALITY ASSESSMENT

Strengths: ✅ Well-organized hierarchy (State → City → Zip → Property) ✅

Multiple data formats (JSON, Excel, HTML, images) ✅ Historical tracking
(multiple dates per property) ✅ Complete metadata (URLs, images, all
property fields) ✅ Error tracking (redfin_fail files)

Observations: ⚠️Some properties have "N/A" for Address, City, State ⚠️

Some fields showing "—" for beds/baths ⚠️Mix of different date ranges
(suggests ongoing collection)

💡 RECOMMENDED DATA PROCESSING STEPS

Based on what you have, here's what you should do:

Step 1: Data Consolidation

# Combine all parsed_output files into one master dataset

# Result: Single CSV/database with ~10,000+ properties

Step 2: Data Cleaning

# Fix N/A addresses using Property IDs to look up in TX folder

# Convert price strings to numbers

# Handle missing beds/baths

# Standardize zip codes

Step 3: Geocoding

# Use addresses to get lat/long

# Already have zip codes which helps

# harris_county_zipcodes gives you neighborhood context

Step 4: Image Processing

# Link images to property IDs

# Organize by property_id → multiple images

# Upload to Cloud Storage

Step 5: Upload to BigQuery

# Create main properties table from parsed_output

# Create images table linking to properties

# Create historical_data table from TX folder multi-date entries

📈 DATA VOLUME ESTIMATE

Based on what I see:

 ~10-20 zip codes being tracked

 ~50-100 properties per zip code

 ~1,000-2,000 total properties

 ~5-10 images per property

 ~5,000-20,000 total images

 Historical data spanning 2-3 months

Houston Property Data Analysis Report
No ratings yet
Houston Property Data Analysis Report
4 pages
Housing Price Analysis and Insights
No ratings yet
Housing Price Analysis and Insights
64 pages
House Price Prediction Analysis
No ratings yet
House Price Prediction Analysis
14 pages
House Price Dataset Analysis Guide
No ratings yet
House Price Dataset Analysis Guide
11 pages
House Price Prediction Analysis
No ratings yet
House Price Prediction Analysis
187 pages
Real Estate Price Prediction Model
100% (2)
Real Estate Price Prediction Model
174 pages
House Price Prediction Analysis
No ratings yet
House Price Prediction Analysis
8 pages
Austin Housing Price EDA Insights
No ratings yet
Austin Housing Price EDA Insights
24 pages
PropStream Guide
No ratings yet
PropStream Guide
20 pages
EDA of Bangalore Real Estate Prices
No ratings yet
EDA of Bangalore Real Estate Prices
28 pages
Analyzing NYC Airbnb Listings Data
No ratings yet
Analyzing NYC Airbnb Listings Data
6 pages
Gopal CNST Report Final
No ratings yet
Gopal CNST Report Final
14 pages
Denver Neighborhood Data Analysis Guide
No ratings yet
Denver Neighborhood Data Analysis Guide
2 pages
House Price Determinants Analysis
No ratings yet
House Price Determinants Analysis
7 pages
DC Property Owner Data Collection Guide
No ratings yet
DC Property Owner Data Collection Guide
1 page
House Prices Prediction Report
No ratings yet
House Prices Prediction Report
16 pages
Home Price Analysis in Alameda & Santa Clara
No ratings yet
Home Price Analysis in Alameda & Santa Clara
31 pages
House Price Prediction Analysis Guide
No ratings yet
House Price Prediction Analysis Guide
4 pages
House Price Prediction in Python
No ratings yet
House Price Prediction in Python
10 pages
House Prices Data Science Project
No ratings yet
House Prices Data Science Project
46 pages
Visualizing 3D Buildings with OSM Data
100% (1)
Visualizing 3D Buildings with OSM Data
48 pages
USA Real Estate Dataset Analysis
No ratings yet
USA Real Estate Dataset Analysis
13 pages
EDA Techniques in Data Analytics Lab
No ratings yet
EDA Techniques in Data Analytics Lab
48 pages
House Price Prediction Analysis
No ratings yet
House Price Prediction Analysis
14 pages
Rajasri's House Listing Filter Project
No ratings yet
Rajasri's House Listing Filter Project
10 pages
Kirubavathi's House Listing Filter Project
No ratings yet
Kirubavathi's House Listing Filter Project
10 pages
Data Wrangling for Cracow Flats
No ratings yet
Data Wrangling for Cracow Flats
10 pages
London Real Estate Capstone Project
No ratings yet
London Real Estate Capstone Project
11 pages
House Prices EDA: Data Science Insights
No ratings yet
House Prices EDA: Data Science Insights
9 pages
Data Preparation with House Prices Dataset
No ratings yet
Data Preparation with House Prices Dataset
84 pages
Data Preparation with Pandas
No ratings yet
Data Preparation with Pandas
15 pages
Rental Price Data API Specification
No ratings yet
Rental Price Data API Specification
7 pages
Advanced House Price Prediction Analysis
No ratings yet
Advanced House Price Prediction Analysis
87 pages
House Price Prediction Analysis
No ratings yet
House Price Prediction Analysis
14 pages
Predicting House Prices Using ML
No ratings yet
Predicting House Prices Using ML
9 pages
ArcGIS Address Data Model Overview
No ratings yet
ArcGIS Address Data Model Overview
36 pages
Predictive Real Estate Analysis in Gaziantep
No ratings yet
Predictive Real Estate Analysis in Gaziantep
9 pages
Visualizing 3D Maps with Pydeck
No ratings yet
Visualizing 3D Maps with Pydeck
11 pages
King County Home Price Analysis
100% (1)
King County Home Price Analysis
15 pages
Intro to ML with Sklearn: House Pricing
No ratings yet
Intro to ML with Sklearn: House Pricing
10 pages
Data Science Basics with Python Pandas
No ratings yet
Data Science Basics with Python Pandas
6 pages
Overview of California Housing Dataset
No ratings yet
Overview of California Housing Dataset
11 pages
Delhi House Price Prediction Analysis
No ratings yet
Delhi House Price Prediction Analysis
34 pages
GIS Data Processing Basics Guide
No ratings yet
GIS Data Processing Basics Guide
27 pages
SF Housing Prices vs. Nearby Venues
No ratings yet
SF Housing Prices vs. Nearby Venues
9 pages
Indian Housing Data Analysis Report
No ratings yet
Indian Housing Data Analysis Report
7 pages
California Housing Prices Analysis 2024
No ratings yet
California Housing Prices Analysis 2024
16 pages
EDA for Airbnb Cape Town Insights
No ratings yet
EDA for Airbnb Cape Town Insights
26 pages
King County House Sales Analysis
No ratings yet
King County House Sales Analysis
11 pages
Create a QGIS Map of Your Area
No ratings yet
Create a QGIS Map of Your Area
8 pages
QGIS Mapping and Data Import Guide
No ratings yet
QGIS Mapping and Data Import Guide
11 pages
Real-World Data Preparation for ML
No ratings yet
Real-World Data Preparation for ML
78 pages
Melbourne Real Estate Data Analysis
No ratings yet
Melbourne Real Estate Data Analysis
7 pages
Import Building Data into ROVER Guide
No ratings yet
Import Building Data into ROVER Guide
23 pages
Kaggle House Price Prediction Challenge
No ratings yet
Kaggle House Price Prediction Challenge
12 pages
NCTCOG Multi-Year Building Footprints
No ratings yet
NCTCOG Multi-Year Building Footprints
5 pages
California Housing Data Analysis Guide
No ratings yet
California Housing Data Analysis Guide
5 pages
Bengaluru House Data Analysis
No ratings yet
Bengaluru House Data Analysis
14 pages
Class 12 IoT: Concepts and Applications
No ratings yet
Class 12 IoT: Concepts and Applications
2 pages
Website Management Overview and Practices
33% (3)
Website Management Overview and Practices
28 pages
NoSQL Data Model Exam Questions
No ratings yet
NoSQL Data Model Exam Questions
6 pages
01112020011440PM - S5 Computer Software Assignment by Accleophas
No ratings yet
01112020011440PM - S5 Computer Software Assignment by Accleophas
8 pages
Online Electric System Overview
No ratings yet
Online Electric System Overview
18 pages
Overview of Linux OS and Its Benefits
No ratings yet
Overview of Linux OS and Its Benefits
38 pages
Enterprise Application Integration Overview
No ratings yet
Enterprise Application Integration Overview
27 pages
Cyber Safety Quiz and Concepts
No ratings yet
Cyber Safety Quiz and Concepts
7 pages
SAP Enterprise Architecture Designer Guide
No ratings yet
SAP Enterprise Architecture Designer Guide
270 pages
Deploying Clickhouse on Kubernetes
No ratings yet
Deploying Clickhouse on Kubernetes
3 pages
Cyber Security Overview and Practices
No ratings yet
Cyber Security Overview and Practices
9 pages
Power BI Overview by Douglas Rivas
100% (2)
Power BI Overview by Douglas Rivas
39 pages
Usa
No ratings yet
Usa
135 pages
WhatsApp Marketing Tool Overview
No ratings yet
WhatsApp Marketing Tool Overview
13 pages
Component-Based Rails Architecture Guide
No ratings yet
Component-Based Rails Architecture Guide
184 pages
Grade 10 Netiquette and Malware Guide
No ratings yet
Grade 10 Netiquette and Malware Guide
8 pages
Overview of Information Systems and ICT
No ratings yet
Overview of Information Systems and ICT
14 pages
Multi-Factor Biometric Door Lock System
No ratings yet
Multi-Factor Biometric Door Lock System
24 pages
Agile User Stories: Best Practices Guide
No ratings yet
Agile User Stories: Best Practices Guide
31 pages
Full Oracle OIC Developer Resume
No ratings yet
Full Oracle OIC Developer Resume
3 pages
Detailed Software Design Overview
No ratings yet
Detailed Software Design Overview
22 pages
Privileged Password Security Policy Template
No ratings yet
Privileged Password Security Policy Template
20 pages
Version Control and Documentation Guide
No ratings yet
Version Control and Documentation Guide
14 pages
Engaging Cloud Security Stakeholders
No ratings yet
Engaging Cloud Security Stakeholders
2 pages
Online Ticket Booking System Case Study
No ratings yet
Online Ticket Booking System Case Study
17 pages
Non-Functional System Testing Overview
No ratings yet
Non-Functional System Testing Overview
20 pages
Understanding Two-Factor Authentication (2FA)
No ratings yet
Understanding Two-Factor Authentication (2FA)
2 pages
Operating System Services Overview
No ratings yet
Operating System Services Overview
61 pages
ICT Empowerment Technology Test for Grade 11
No ratings yet
ICT Empowerment Technology Test for Grade 11
2 pages
Front-End Developer Resume Example
No ratings yet
Front-End Developer Resume Example
1 page