No description
Phase C - LangChain LLM Integration: - Add LangChain multi-provider dependencies - Update config with provider routing (Anthropic, OpenAI, Ollama) - Create LLM provider factory with LangChain - Create task-based router for cost optimization - Create semantic extraction prompts - Implement LLMExtractor with pipeline interface Phase D - Orchestration: - Create ExtractionOrchestrator - Query unprocessed ScrapedPages - Initialize and run extraction pipeline - Create Listing records from enriched data - Comprehensive error handling Phase E - CLI Integration: - Implement `python main.py extract` command - Support optional source and limit parameters - Display progress and statistics - Module exports configured Architecture highlights: - Pipeline pattern: progressive data enrichment - Multi-provider LLM support via LangChain - Extensible: easy to add new extractors - Flexible: route tasks to different providers Ready for testing! 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> |
||
|---|---|---|
| .run | ||
| config | ||
| src | ||
| .env.example | ||
| .gitignore | ||
| IMPLEMENTATION_PLAN.md | ||
| main.py | ||
| README.md | ||
| requirements.txt | ||
Property Scraper
A Python-based web scraping system for property listings with intelligent extraction and LLM-powered filtering.
Features
- Web Scraping: Automated scraping with throttling and deduplication
- Hybrid Extraction: NuExtract for structured fields + LLM for semantic analysis
- Intelligent Filtering: LLM-powered scoring and ranking based on your criteria
- n8n Integration: CLI commands ready for automation workflows
Setup
1. Create Virtual Environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
2. Install Dependencies
pip install -r requirements.txt
3. Configure Environment
Copy the example environment file and edit it:
cp .env.example .env
Edit .env and add:
- Your Anthropic API key
- Target website URL
- Your evaluation criteria (budget, preferred regions, etc.)
4. Initialize Database
python main.py init-db
This creates the SQLite database with all necessary tables.
Project Structure
ss-scraper/
├── src/
│ ├── config.py # Configuration management
│ ├── database/
│ │ ├── models.py # SQLAlchemy models
│ │ └── db.py # Database connection
│ ├── scraper/ # Web scraping (Phase 2)
│ ├── extraction/ # Data extraction (Phase 3)
│ ├── llm/ # LLM evaluation (Phase 4)
│ └── utils/ # Utility functions
├── data/ # SQLite database
├── logs/ # Application logs
├── .env # Environment variables (not in git)
├── main.py # CLI entry point
└── requirements.txt # Python dependencies
Database Schema
scraped_pages
Tracks all scraped listing pages to avoid re-processing.
listings
Extracted property information including:
- Structured fields: title, price, area, rooms, coordinates, etc.
- LLM features: interesting_features, potential_issues, summary
- Image URLs
llm_evaluations
LLM analysis with scores, reasoning, and match criteria.
CLI Commands
# Initialize database
python main.py init-db
# Phase 2+ (not yet implemented)
python main.py scrape # Run scraper
python main.py extract # Extract data
python main.py evaluate # Evaluate with LLM
python main.py full-pipeline # Run everything
python main.py list-interesting # Show high-scoring listings
python main.py stats # Show statistics
Development Status
- ✅ Phase 1: Foundation Setup (complete)
- ⏳ Phase 2: Web Scraper (pending)
- ⏳ Phase 3: Data Extraction (pending)
- ⏳ Phase 4: LLM Filtering (pending)
- ⏳ Phase 5: CLI & Integration (pending)
Next Steps
See IMPLEMENTATION_PLAN.md for detailed implementation roadmap.