📁 DATA STRUCTURE OVERVIEW
Main Folder: propsight_datacollection
This is your master data collection folder containing 7 main directories,
each serving a specific purpose in your data pipeline.
DETAILED BREAKDOWN OF EACH FOLDER
1. parsed_output (11 date folders)
Purpose: Cleaned, structured data ready for analysis
Structure:
Data Repository Overview
Top-Level Layout
parsed_output/
input_files/
images/
html_output/
execution_colabs/
TX/
1) parsed_output/
Purpose: Date-organized outputs from each scraping session.
Top-level structure:
parsed_output/
├── 2025-10-12/ # 74 items (one file per ZIP)
│ ├── redfin_77037_2025-[Link] (14 KB)
│ ├── redfin_77038_2025-[Link] (7 KB)
│ ├── redfin_77032_2025-[Link] (7 KB)
│ └── ... (more zip code files)
├── 2025-10-05/
├── 2025-09-28/
└── ... (more dates)
File types & sizes:
JSON: 14–18 KB (structured, programmatic use)
Excel (.xlsx): 5–7 KB (quick analysis, spreadsheets)
What’s inside each file: one ZIP code per file (e.g., 77037, 77038,
77032).
Sample columns (from Excel):
ID
Address
City, State, ZIP Code
Price # e.g., $224,786; $80,000; $144,950
Beds # 2, 3, 4, ...
Baths # 1.5, 2, 2.5, ...
Square Footage # 1,920; 1,000; 1,010; ...
Lot Size
Status # For Sale, ABOUT THIS HOME
Property URL # Redfin link
Image URL # CDN link
Observed ranges (sample):
Prices: $80,000 → $450,000+
Beds: 2–4
Baths: 1–3.5
Area: ~1,000 → 2,842+ sq ft
Links: Redfin property pages + CDN image URLs
2) input_files/ (1 file)
Purpose: Parameters/reference lists for scraping runs.
Contents:
input_files/
└── harris_county_zipcodes.xlsx (2 KB)
Columns:
Rank
Zip Code
Neighborhood
Median Price per Sq. Ft. # e.g., $498, $363, $355, ...
Sample entries:
1. 77005 — West University — $498/sq ft
2. 77024 — Memorial — $363/sq ft
3. 77027 — River Oaks — $355/sq ft
4. 77008 — The Heights — $312/sq ft
…
Use: Which ZIPs to target, neighborhood names, expected price ranges
(validation), and priority (rank).
3) images/
Purpose: Photo assets by collection date and ZIP.
Structure:
images/
├── 2025-10-12/
│ ├── 77068/
│ │ ├── [Link] (76 KB)
│ │ ├── [Link] (41 KB)
│ │ ├── [Link] (39 KB)
│ │ └── ... (more images)
│ ├── 77032/
│ └── ...
├── 2025-10-05/
└── ... (more dates)
Notes:
JPGs: ~27–77 KB each
Filenames: hash-based (unique)
Typical content: exterior/interior property photos (often from Redfin
CDN)
Used for image-based search features
4) html_output/
Purpose: Raw HTML snapshots saved during scraping (for
re-parsing/debugging).
Structure:
html_output/
├── 2025-10-12/
│ ├── redfin_77034_2025-[Link] (908 KB)
│ ├── redfin_77035_2025-[Link] (1,202 KB)
│ ├── redfin_fail_77028_2025-[Link] (180 KB) # failed/empty page
│ └── ... (more HTML files)
└── ... (more dates)
Notes:
Success pages: large (≈0.9–1.2 MB)
Failure pages: smaller (≈180 KB), prefixed with redfin_fail_
Why keep HTML: full-fidelity backup, re-parse without re-scraping,
targeted debugging
5) execution_colabs/
Purpose: Artifacts from each Google Colab scraping session.
Structure:
execution_colabs/
├── 6855838179985639936/
│ ├── google-chrome-stable_current_amd64.deb (117 MB)
│ ├── [Link] (5.4 MB)
│ ├── content (Jupyter notebook, 2.9 MB)
│ └── ...
├── 8721370211984343040/
└── ... (more execution IDs)
Notes:
One folder per run (unique execution ID)
Evidences headless Chrome + likely Selenium
Contains intermediate HTML/Notebook artifacts to trace exact run
timing/outputs
6) TX/ (Texas — Primary Geographical Index)
Purpose: Complete datasets organized by City → ZIP → Property ID.
Structure (sample):
TX/
├── Houston/
│ ├── 77055/
│ │ ├── 138101113G117459/
│ │ │ ├── [Link] (1,139 KB)
│ │ │ ├── [Link] (1 KB)
│ │ │ ├── [Link] (1 KB)
│ │ │ └── images/ # property-specific images
│ │ ├── 5741926959192973/
│ │ │ ├── [Link]
│ │ │ ├── [Link]
│ │ │ ├── [Link]
│ │ │ ├── [Link]
│ │ │ └── images/
│ │ └── ... (more properties)
│ ├── 77032/
│ └── ... (more ZIPs)
├── Bunker Hill Village/
├── Hedwig Village/
├── West University Place/
└── ... (more cities)
Notes:
Per-property folders (unique IDs) with dated snapshots: html, xlsx,
txt, and an images/ subfolder
Supports fine-grained lineage (property-level change history across
dates)
This is your MOST IMPORTANT folder with:
Level 1: Cities/Areas
Houston (main city)
Bunker Hill Village
Hedwig Village
West University Place
Southside Place
South Houston
Piney Point Village
North Houston
Hunters Creek Village
Jacinto City
Level 2: Zip Codes
77055, 77032, 77068, 77037, 77038, etc.
Level 3: Property IDs
Long unique numbers (138101113G117459, 5741926959192973)
These are Redfin property identifiers
Level 4: Historical Data for Each Property
HTML files (1.1 MB): Full listing page
Excel files (1 KB): Structured property data
Text files (1 KB): Basic property info
images folder: Property photos
What's POWERFUL about this structure:
Historical tracking: Multiple dates per property
o Example: Property 5741926959192973 has data from both
2025-09-07 AND 2025-08-20
o You can track price changes, status changes, new photos
Complete data: Every property has 4 data types (HTML, Excel,
text, images)
Easy analysis: Can aggregate by city, zip code, or individual
property
7. UT (Likely Utah or University area)
Similar structure to TX folder, organized by location
📊 WHAT THIS DATA COLLECTION ENABLES
Current Capabilities:
1. Geographic Analysis
o Compare neighborhoods by price
o Map property density by zip code
o Track which areas have most listings
2. Price Analysis
o Price per square foot by neighborhood
o Price ranges per zip code
o Historical price changes for individual properties
3. Property Features
o Distribution of bedrooms/bathrooms
o Square footage ranges
o Lot sizes
4. Image-Based Search
o Visual similarity matching
o Property style classification
o Neighborhood aesthetic analysis
5. Historical Tracking
o How long properties stay on market
o Price changes over time
o Status changes (For Sale → Sold)
🎯 DATA QUALITY ASSESSMENT
Strengths: ✅ Well-organized hierarchy (State → City → Zip → Property) ✅
Multiple data formats (JSON, Excel, HTML, images) ✅ Historical tracking
(multiple dates per property) ✅ Complete metadata (URLs, images, all
property fields) ✅ Error tracking (redfin_fail files)
Observations: ⚠️Some properties have "N/A" for Address, City, State ⚠️
Some fields showing "—" for beds/baths ⚠️Mix of different date ranges
(suggests ongoing collection)
💡 RECOMMENDED DATA PROCESSING STEPS
Based on what you have, here's what you should do:
Step 1: Data Consolidation
# Combine all parsed_output files into one master dataset
# Result: Single CSV/database with ~10,000+ properties
Step 2: Data Cleaning
# Fix N/A addresses using Property IDs to look up in TX folder
# Convert price strings to numbers
# Handle missing beds/baths
# Standardize zip codes
Step 3: Geocoding
# Use addresses to get lat/long
# Already have zip codes which helps
# harris_county_zipcodes gives you neighborhood context
Step 4: Image Processing
# Link images to property IDs
# Organize by property_id → multiple images
# Upload to Cloud Storage
Step 5: Upload to BigQuery
# Create main properties table from parsed_output
# Create images table linking to properties
# Create historical_data table from TX folder multi-date entries
📈 DATA VOLUME ESTIMATE
Based on what I see:
~10-20 zip codes being tracked
~50-100 properties per zip code
~1,000-2,000 total properties
~5-10 images per property
~5,000-20,000 total images
Historical data spanning 2-3 months