0% found this document useful (0 votes)
20 views9 pages

Data Structure for Real Estate Analysis

Uploaded by

jeffSyed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views9 pages

Data Structure for Real Estate Analysis

Uploaded by

jeffSyed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

📁 DATA STRUCTURE OVERVIEW

Main Folder: propsight_datacollection

This is your master data collection folder containing 7 main directories,


each serving a specific purpose in your data pipeline.

DETAILED BREAKDOWN OF EACH FOLDER

1. parsed_output (11 date folders)

Purpose: Cleaned, structured data ready for analysis

Structure:

Data Repository Overview

Top-Level Layout

parsed_output/

input_files/

images/

html_output/

execution_colabs/

TX/

1) parsed_output/

Purpose: Date-organized outputs from each scraping session.

Top-level structure:

parsed_output/

├── 2025-10-12/ # 74 items (one file per ZIP)

│ ├── redfin_77037_2025-[Link] (14 KB)

│ ├── redfin_77038_2025-[Link] (7 KB)

│ ├── redfin_77032_2025-[Link] (7 KB)

│ └── ... (more zip code files)

├── 2025-10-05/

├── 2025-09-28/
└── ... (more dates)

File types & sizes:

 JSON: 14–18 KB (structured, programmatic use)

 Excel (.xlsx): 5–7 KB (quick analysis, spreadsheets)

What’s inside each file: one ZIP code per file (e.g., 77037, 77038,
77032).

Sample columns (from Excel):

ID

Address

City, State, ZIP Code

Price # e.g., $224,786; $80,000; $144,950

Beds # 2, 3, 4, ...

Baths # 1.5, 2, 2.5, ...

Square Footage # 1,920; 1,000; 1,010; ...

Lot Size

Status # For Sale, ABOUT THIS HOME

Property URL # Redfin link

Image URL # CDN link

Observed ranges (sample):

 Prices: $80,000 → $450,000+

 Beds: 2–4

 Baths: 1–3.5

 Area: ~1,000 → 2,842+ sq ft

 Links: Redfin property pages + CDN image URLs

2) input_files/ (1 file)

Purpose: Parameters/reference lists for scraping runs.

Contents:

input_files/
└── harris_county_zipcodes.xlsx (2 KB)

Columns:

Rank

Zip Code

Neighborhood

Median Price per Sq. Ft. # e.g., $498, $363, $355, ...

Sample entries:

1. 77005 — West University — $498/sq ft

2. 77024 — Memorial — $363/sq ft

3. 77027 — River Oaks — $355/sq ft

4. 77008 — The Heights — $312/sq ft


Use: Which ZIPs to target, neighborhood names, expected price ranges


(validation), and priority (rank).

3) images/

Purpose: Photo assets by collection date and ZIP.

Structure:

images/

├── 2025-10-12/

│ ├── 77068/

│ │ ├── [Link] (76 KB)

│ │ ├── [Link] (41 KB)

│ │ ├── [Link] (39 KB)

│ │ └── ... (more images)

│ ├── 77032/

│ └── ...

├── 2025-10-05/

└── ... (more dates)

Notes:
 JPGs: ~27–77 KB each

 Filenames: hash-based (unique)

 Typical content: exterior/interior property photos (often from Redfin


CDN)

 Used for image-based search features

4) html_output/

Purpose: Raw HTML snapshots saved during scraping (for


re-parsing/debugging).

Structure:

html_output/

├── 2025-10-12/

│ ├── redfin_77034_2025-[Link] (908 KB)

│ ├── redfin_77035_2025-[Link] (1,202 KB)

│ ├── redfin_fail_77028_2025-[Link] (180 KB) # failed/empty page

│ └── ... (more HTML files)

└── ... (more dates)

Notes:

 Success pages: large (≈0.9–1.2 MB)

 Failure pages: smaller (≈180 KB), prefixed with redfin_fail_

 Why keep HTML: full-fidelity backup, re-parse without re-scraping,


targeted debugging

5) execution_colabs/

Purpose: Artifacts from each Google Colab scraping session.

Structure:

execution_colabs/

├── 6855838179985639936/

│ ├── google-chrome-stable_current_amd64.deb (117 MB)

│ ├── [Link] (5.4 MB)


│ ├── content (Jupyter notebook, 2.9 MB)

│ └── ...

├── 8721370211984343040/

└── ... (more execution IDs)

Notes:

 One folder per run (unique execution ID)

 Evidences headless Chrome + likely Selenium

 Contains intermediate HTML/Notebook artifacts to trace exact run


timing/outputs

6) TX/ (Texas — Primary Geographical Index)

Purpose: Complete datasets organized by City → ZIP → Property ID.

Structure (sample):

TX/

├── Houston/

│ ├── 77055/

│ │ ├── 138101113G117459/

│ │ │ ├── [Link] (1,139 KB)

│ │ │ ├── [Link] (1 KB)

│ │ │ ├── [Link] (1 KB)

│ │ │ └── images/ # property-specific images

│ │ ├── 5741926959192973/

│ │ │ ├── [Link]

│ │ │ ├── [Link]

│ │ │ ├── [Link]

│ │ │ ├── [Link]

│ │ │ └── images/

│ │ └── ... (more properties)

│ ├── 77032/
│ └── ... (more ZIPs)

├── Bunker Hill Village/

├── Hedwig Village/

├── West University Place/

└── ... (more cities)

Notes:

 Per-property folders (unique IDs) with dated snapshots: html, xlsx,


txt, and an images/ subfolder

 Supports fine-grained lineage (property-level change history across


dates)

This is your MOST IMPORTANT folder with:

Level 1: Cities/Areas

 Houston (main city)

 Bunker Hill Village

 Hedwig Village

 West University Place

 Southside Place

 South Houston

 Piney Point Village

 North Houston

 Hunters Creek Village

 Jacinto City

Level 2: Zip Codes

 77055, 77032, 77068, 77037, 77038, etc.

Level 3: Property IDs

 Long unique numbers (138101113G117459, 5741926959192973)

 These are Redfin property identifiers

Level 4: Historical Data for Each Property

 HTML files (1.1 MB): Full listing page

 Excel files (1 KB): Structured property data


 Text files (1 KB): Basic property info

 images folder: Property photos

What's POWERFUL about this structure:

 Historical tracking: Multiple dates per property

o Example: Property 5741926959192973 has data from both


2025-09-07 AND 2025-08-20

o You can track price changes, status changes, new photos

 Complete data: Every property has 4 data types (HTML, Excel,


text, images)

 Easy analysis: Can aggregate by city, zip code, or individual


property

7. UT (Likely Utah or University area)

Similar structure to TX folder, organized by location

📊 WHAT THIS DATA COLLECTION ENABLES

Current Capabilities:

1. Geographic Analysis

o Compare neighborhoods by price

o Map property density by zip code

o Track which areas have most listings

2. Price Analysis

o Price per square foot by neighborhood

o Price ranges per zip code

o Historical price changes for individual properties

3. Property Features

o Distribution of bedrooms/bathrooms

o Square footage ranges

o Lot sizes

4. Image-Based Search
o Visual similarity matching

o Property style classification

o Neighborhood aesthetic analysis

5. Historical Tracking

o How long properties stay on market

o Price changes over time

o Status changes (For Sale → Sold)

🎯 DATA QUALITY ASSESSMENT

Strengths: ✅ Well-organized hierarchy (State → City → Zip → Property) ✅


Multiple data formats (JSON, Excel, HTML, images) ✅ Historical tracking
(multiple dates per property) ✅ Complete metadata (URLs, images, all
property fields) ✅ Error tracking (redfin_fail files)

Observations: ⚠️Some properties have "N/A" for Address, City, State ⚠️


Some fields showing "—" for beds/baths ⚠️Mix of different date ranges
(suggests ongoing collection)

💡 RECOMMENDED DATA PROCESSING STEPS

Based on what you have, here's what you should do:

Step 1: Data Consolidation

# Combine all parsed_output files into one master dataset

# Result: Single CSV/database with ~10,000+ properties

Step 2: Data Cleaning

# Fix N/A addresses using Property IDs to look up in TX folder

# Convert price strings to numbers

# Handle missing beds/baths

# Standardize zip codes

Step 3: Geocoding

# Use addresses to get lat/long


# Already have zip codes which helps

# harris_county_zipcodes gives you neighborhood context

Step 4: Image Processing

# Link images to property IDs

# Organize by property_id → multiple images

# Upload to Cloud Storage

Step 5: Upload to BigQuery

# Create main properties table from parsed_output

# Create images table linking to properties

# Create historical_data table from TX folder multi-date entries

📈 DATA VOLUME ESTIMATE

Based on what I see:

 ~10-20 zip codes being tracked

 ~50-100 properties per zip code

 ~1,000-2,000 total properties

 ~5-10 images per property

 ~5,000-20,000 total images

 Historical data spanning 2-3 months

You might also like