No description
| .run | ||
| config | ||
| docs | ||
| src | ||
| templates | ||
| tests | ||
| .env.example | ||
| .gitignore | ||
| IMAGE_TRANSFORMATION_IMPLEMENTATION.md | ||
| IMPLEMENTATION_PLAN.md | ||
| IMPLEMENTATION_SUMMARY.md | ||
| main.py | ||
| pytest.ini | ||
| README.md | ||
| requirements.txt | ||
| SELECTOR_FIX.md | ||
| test_image_transformation.py | ||
| viewer.py | ||
Property Scraper
A Python-based web scraping system for property listings with intelligent extraction and LLM-powered filtering.
Features
- Web Scraping: Automated scraping with throttling and deduplication
- Hybrid Extraction: NuExtract for structured fields + LLM for semantic analysis
- Intelligent Filtering: LLM-powered scoring and ranking based on your criteria
- n8n Integration: CLI commands ready for automation workflows
Setup
1. Create Virtual Environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
2. Install Dependencies
pip install -r requirements.txt
3. Configure Environment
Copy the example environment file and edit it:
cp .env.example .env
Edit .env and add:
- Your Anthropic API key
- Target website URL
- Your evaluation criteria (budget, preferred regions, etc.)
4. Initialize Database
python main.py init-db
This creates the SQLite database with all necessary tables.
Project Structure
ss-scraper/
├── src/
│ ├── config.py # Configuration management
│ ├── database/
│ │ ├── models.py # SQLAlchemy models
│ │ └── db.py # Database connection
│ ├── scraper/ # Web scraping (Phase 2)
│ ├── extraction/ # Data extraction (Phase 3)
│ ├── llm/ # LLM evaluation (Phase 4)
│ └── utils/ # Utility functions
├── data/ # SQLite database
├── logs/ # Application logs
├── .env # Environment variables (not in git)
├── main.py # CLI entry point
└── requirements.txt # Python dependencies
Database Schema
scraped_pages
Tracks all scraped listing pages to avoid re-processing.
listings
Extracted property information including:
- Structured fields: title, price, area, rooms, coordinates, etc.
- LLM features: interesting_features, potential_issues, summary
- Image URLs
llm_evaluations
LLM analysis with scores, reasoning, and match criteria.
CLI Commands
# Initialize database
python main.py init-db
# Run individual phases
python main.py scrape # Scrape listings from enabled sources
python main.py scrape ss.lv # Scrape specific source
python main.py extract # Extract data from scraped pages
python main.py extract ss.lv 10 # Extract from specific source with limit
python main.py evaluate # Evaluate extracted listings with LLM
python main.py evaluate 5 --force # Re-evaluate with limit
# Run complete pipeline
python main.py full-pipeline # Scrape → Extract → Evaluate
python main.py full-pipeline ss.lv # Run pipeline for specific source
# View results
python main.py list-interesting # Show interesting listings (score >= 60)
python main.py list-interesting --min-score 70 # Filter by minimum score
python main.py stats # Show statistics
# Web viewer
python viewer.py # Launch Flask app at http://localhost:5000
Development Status
- ✅ Phase 1: Foundation Setup (complete)
- ✅ Phase 2: Web Scraper (complete)
- ✅ Phase 3: Data Extraction (complete)
- ✅ Phase 4: LLM Evaluation (complete)
- ✅ Phase 5: CLI & Integration (complete)
All core functionality implemented! See IMPLEMENTATION_PLAN.md for architecture details.
Running Tests
# Activate venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Run all tests
pytest tests/ -v
# Run with coverage
pytest tests/ --cov=src --cov-report=html