No description

Find a file

Andrejs Krilovs 2a21c5ea9a feat: Added image transformation		2026-02-12 08:40:38 +02:00
.run	feat: Added image transformation	2026-02-12 08:40:38 +02:00
config	feat: Added image transformation	2026-02-12 08:40:38 +02:00
docs	feat: Added image transformation	2026-02-12 08:40:38 +02:00
src	feat: Added image transformation	2026-02-12 08:40:38 +02:00
templates	feat: Finished implemented the basics from implementation plan	2026-02-12 06:57:50 +02:00
tests	feat: Finished implemented the basics from implementation plan	2026-02-12 06:57:50 +02:00
.env.example	feat: Finished implemented the basics from implementation plan	2026-02-12 06:57:50 +02:00
.gitignore	Initial commit: Property scraper foundation + web scraping	2025-12-20 08:01:18 +02:00
IMAGE_TRANSFORMATION_IMPLEMENTATION.md	feat: Added image transformation	2026-02-12 08:40:38 +02:00
IMPLEMENTATION_PLAN.md	Initial commit: Property scraper foundation + web scraping	2025-12-20 08:01:18 +02:00
IMPLEMENTATION_SUMMARY.md	feat: Added image transformation	2026-02-12 08:40:38 +02:00
main.py	feat: Finished implemented the basics from implementation plan	2026-02-12 06:57:50 +02:00
pytest.ini	feat: Finished implemented the basics from implementation plan	2026-02-12 06:57:50 +02:00
README.md	feat: Finished implemented the basics from implementation plan	2026-02-12 06:57:50 +02:00
requirements.txt	feat: Finished implemented the basics from implementation plan	2026-02-12 06:57:50 +02:00
SELECTOR_FIX.md	feat: Added image transformation	2026-02-12 08:40:38 +02:00
test_image_transformation.py	feat: Added image transformation	2026-02-12 08:40:38 +02:00
viewer.py	feat: Finished implemented the basics from implementation plan	2026-02-12 06:57:50 +02:00

README.md

Property Scraper

A Python-based web scraping system for property listings with intelligent extraction and LLM-powered filtering.

Features

Web Scraping: Automated scraping with throttling and deduplication
Hybrid Extraction: NuExtract for structured fields + LLM for semantic analysis
Intelligent Filtering: LLM-powered scoring and ranking based on your criteria
n8n Integration: CLI commands ready for automation workflows

Setup

1. Create Virtual Environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

2. Install Dependencies

pip install -r requirements.txt

3. Configure Environment

Copy the example environment file and edit it:

cp .env.example .env

Edit .env and add:

Your Anthropic API key
Target website URL
Your evaluation criteria (budget, preferred regions, etc.)

4. Initialize Database

python main.py init-db

This creates the SQLite database with all necessary tables.

Project Structure

ss-scraper/
├── src/
│   ├── config.py              # Configuration management
│   ├── database/
│   │   ├── models.py          # SQLAlchemy models
│   │   └── db.py              # Database connection
│   ├── scraper/               # Web scraping (Phase 2)
│   ├── extraction/            # Data extraction (Phase 3)
│   ├── llm/                   # LLM evaluation (Phase 4)
│   └── utils/                 # Utility functions
├── data/                      # SQLite database
├── logs/                      # Application logs
├── .env                       # Environment variables (not in git)
├── main.py                    # CLI entry point
└── requirements.txt           # Python dependencies

Database Schema

`scraped_pages`

Tracks all scraped listing pages to avoid re-processing.

`listings`

Extracted property information including:

Structured fields: title, price, area, rooms, coordinates, etc.
LLM features: interesting_features, potential_issues, summary
Image URLs

`llm_evaluations`

LLM analysis with scores, reasoning, and match criteria.

CLI Commands

# Initialize database
python main.py init-db

# Run individual phases
python main.py scrape              # Scrape listings from enabled sources
python main.py scrape ss.lv        # Scrape specific source
python main.py extract             # Extract data from scraped pages
python main.py extract ss.lv 10    # Extract from specific source with limit
python main.py evaluate            # Evaluate extracted listings with LLM
python main.py evaluate 5 --force  # Re-evaluate with limit

# Run complete pipeline
python main.py full-pipeline       # Scrape → Extract → Evaluate
python main.py full-pipeline ss.lv # Run pipeline for specific source

# View results
python main.py list-interesting            # Show interesting listings (score >= 60)
python main.py list-interesting --min-score 70  # Filter by minimum score
python main.py stats                       # Show statistics

# Web viewer
python viewer.py                   # Launch Flask app at http://localhost:5000

Development Status

✅ Phase 1: Foundation Setup (complete)
✅ Phase 2: Web Scraper (complete)
✅ Phase 3: Data Extraction (complete)
✅ Phase 4: LLM Evaluation (complete)
✅ Phase 5: CLI & Integration (complete)

All core functionality implemented! See IMPLEMENTATION_PLAN.md for architecture details.

Running Tests

# Activate venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Run all tests
pytest tests/ -v

# Run with coverage
pytest tests/ --cov=src --cov-report=html