No description
Find a file
2026-02-12 08:40:38 +02:00
.run feat: Added image transformation 2026-02-12 08:40:38 +02:00
config feat: Added image transformation 2026-02-12 08:40:38 +02:00
docs feat: Added image transformation 2026-02-12 08:40:38 +02:00
src feat: Added image transformation 2026-02-12 08:40:38 +02:00
templates feat: Finished implemented the basics from implementation plan 2026-02-12 06:57:50 +02:00
tests feat: Finished implemented the basics from implementation plan 2026-02-12 06:57:50 +02:00
.env.example feat: Finished implemented the basics from implementation plan 2026-02-12 06:57:50 +02:00
.gitignore Initial commit: Property scraper foundation + web scraping 2025-12-20 08:01:18 +02:00
IMAGE_TRANSFORMATION_IMPLEMENTATION.md feat: Added image transformation 2026-02-12 08:40:38 +02:00
IMPLEMENTATION_PLAN.md Initial commit: Property scraper foundation + web scraping 2025-12-20 08:01:18 +02:00
IMPLEMENTATION_SUMMARY.md feat: Added image transformation 2026-02-12 08:40:38 +02:00
main.py feat: Finished implemented the basics from implementation plan 2026-02-12 06:57:50 +02:00
pytest.ini feat: Finished implemented the basics from implementation plan 2026-02-12 06:57:50 +02:00
README.md feat: Finished implemented the basics from implementation plan 2026-02-12 06:57:50 +02:00
requirements.txt feat: Finished implemented the basics from implementation plan 2026-02-12 06:57:50 +02:00
SELECTOR_FIX.md feat: Added image transformation 2026-02-12 08:40:38 +02:00
test_image_transformation.py feat: Added image transformation 2026-02-12 08:40:38 +02:00
viewer.py feat: Finished implemented the basics from implementation plan 2026-02-12 06:57:50 +02:00

Property Scraper

A Python-based web scraping system for property listings with intelligent extraction and LLM-powered filtering.

Features

  • Web Scraping: Automated scraping with throttling and deduplication
  • Hybrid Extraction: NuExtract for structured fields + LLM for semantic analysis
  • Intelligent Filtering: LLM-powered scoring and ranking based on your criteria
  • n8n Integration: CLI commands ready for automation workflows

Setup

1. Create Virtual Environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

2. Install Dependencies

pip install -r requirements.txt

3. Configure Environment

Copy the example environment file and edit it:

cp .env.example .env

Edit .env and add:

  • Your Anthropic API key
  • Target website URL
  • Your evaluation criteria (budget, preferred regions, etc.)

4. Initialize Database

python main.py init-db

This creates the SQLite database with all necessary tables.

Project Structure

ss-scraper/
├── src/
│   ├── config.py              # Configuration management
│   ├── database/
│   │   ├── models.py          # SQLAlchemy models
│   │   └── db.py              # Database connection
│   ├── scraper/               # Web scraping (Phase 2)
│   ├── extraction/            # Data extraction (Phase 3)
│   ├── llm/                   # LLM evaluation (Phase 4)
│   └── utils/                 # Utility functions
├── data/                      # SQLite database
├── logs/                      # Application logs
├── .env                       # Environment variables (not in git)
├── main.py                    # CLI entry point
└── requirements.txt           # Python dependencies

Database Schema

scraped_pages

Tracks all scraped listing pages to avoid re-processing.

listings

Extracted property information including:

  • Structured fields: title, price, area, rooms, coordinates, etc.
  • LLM features: interesting_features, potential_issues, summary
  • Image URLs

llm_evaluations

LLM analysis with scores, reasoning, and match criteria.

CLI Commands

# Initialize database
python main.py init-db

# Run individual phases
python main.py scrape              # Scrape listings from enabled sources
python main.py scrape ss.lv        # Scrape specific source
python main.py extract             # Extract data from scraped pages
python main.py extract ss.lv 10    # Extract from specific source with limit
python main.py evaluate            # Evaluate extracted listings with LLM
python main.py evaluate 5 --force  # Re-evaluate with limit

# Run complete pipeline
python main.py full-pipeline       # Scrape → Extract → Evaluate
python main.py full-pipeline ss.lv # Run pipeline for specific source

# View results
python main.py list-interesting            # Show interesting listings (score >= 60)
python main.py list-interesting --min-score 70  # Filter by minimum score
python main.py stats                       # Show statistics

# Web viewer
python viewer.py                   # Launch Flask app at http://localhost:5000

Development Status

  • Phase 1: Foundation Setup (complete)
  • Phase 2: Web Scraper (complete)
  • Phase 3: Data Extraction (complete)
  • Phase 4: LLM Evaluation (complete)
  • Phase 5: CLI & Integration (complete)

All core functionality implemented! See IMPLEMENTATION_PLAN.md for architecture details.

Running Tests

# Activate venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Run all tests
pytest tests/ -v

# Run with coverage
pytest tests/ --cov=src --cov-report=html