No description
Find a file
Andrejs Krilovs 25910608ad Phase 3C-E: Complete extraction pipeline with LangChain
Phase C - LangChain LLM Integration:
- Add LangChain multi-provider dependencies
- Update config with provider routing (Anthropic, OpenAI, Ollama)
- Create LLM provider factory with LangChain
- Create task-based router for cost optimization
- Create semantic extraction prompts
- Implement LLMExtractor with pipeline interface

Phase D - Orchestration:
- Create ExtractionOrchestrator
- Query unprocessed ScrapedPages
- Initialize and run extraction pipeline
- Create Listing records from enriched data
- Comprehensive error handling

Phase E - CLI Integration:
- Implement `python main.py extract` command
- Support optional source and limit parameters
- Display progress and statistics
- Module exports configured

Architecture highlights:
- Pipeline pattern: progressive data enrichment
- Multi-provider LLM support via LangChain
- Extensible: easy to add new extractors
- Flexible: route tasks to different providers

Ready for testing!

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-21 07:01:34 +02:00
.run Initial commit: Property scraper foundation + web scraping 2025-12-20 08:01:18 +02:00
config Initial commit: Property scraper foundation + web scraping 2025-12-20 08:01:18 +02:00
src Phase 3C-E: Complete extraction pipeline with LangChain 2025-12-21 07:01:34 +02:00
.env.example Phase 3C-E: Complete extraction pipeline with LangChain 2025-12-21 07:01:34 +02:00
.gitignore Initial commit: Property scraper foundation + web scraping 2025-12-20 08:01:18 +02:00
IMPLEMENTATION_PLAN.md Initial commit: Property scraper foundation + web scraping 2025-12-20 08:01:18 +02:00
main.py Phase 3C-E: Complete extraction pipeline with LangChain 2025-12-21 07:01:34 +02:00
README.md Initial commit: Property scraper foundation + web scraping 2025-12-20 08:01:18 +02:00
requirements.txt Phase 3C-E: Complete extraction pipeline with LangChain 2025-12-21 07:01:34 +02:00

Property Scraper

A Python-based web scraping system for property listings with intelligent extraction and LLM-powered filtering.

Features

  • Web Scraping: Automated scraping with throttling and deduplication
  • Hybrid Extraction: NuExtract for structured fields + LLM for semantic analysis
  • Intelligent Filtering: LLM-powered scoring and ranking based on your criteria
  • n8n Integration: CLI commands ready for automation workflows

Setup

1. Create Virtual Environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

2. Install Dependencies

pip install -r requirements.txt

3. Configure Environment

Copy the example environment file and edit it:

cp .env.example .env

Edit .env and add:

  • Your Anthropic API key
  • Target website URL
  • Your evaluation criteria (budget, preferred regions, etc.)

4. Initialize Database

python main.py init-db

This creates the SQLite database with all necessary tables.

Project Structure

ss-scraper/
├── src/
│   ├── config.py              # Configuration management
│   ├── database/
│   │   ├── models.py          # SQLAlchemy models
│   │   └── db.py              # Database connection
│   ├── scraper/               # Web scraping (Phase 2)
│   ├── extraction/            # Data extraction (Phase 3)
│   ├── llm/                   # LLM evaluation (Phase 4)
│   └── utils/                 # Utility functions
├── data/                      # SQLite database
├── logs/                      # Application logs
├── .env                       # Environment variables (not in git)
├── main.py                    # CLI entry point
└── requirements.txt           # Python dependencies

Database Schema

scraped_pages

Tracks all scraped listing pages to avoid re-processing.

listings

Extracted property information including:

  • Structured fields: title, price, area, rooms, coordinates, etc.
  • LLM features: interesting_features, potential_issues, summary
  • Image URLs

llm_evaluations

LLM analysis with scores, reasoning, and match criteria.

CLI Commands

# Initialize database
python main.py init-db

# Phase 2+ (not yet implemented)
python main.py scrape              # Run scraper
python main.py extract             # Extract data
python main.py evaluate            # Evaluate with LLM
python main.py full-pipeline       # Run everything
python main.py list-interesting    # Show high-scoring listings
python main.py stats               # Show statistics

Development Status

  • Phase 1: Foundation Setup (complete)
  • Phase 2: Web Scraper (pending)
  • Phase 3: Data Extraction (pending)
  • Phase 4: LLM Filtering (pending)
  • Phase 5: CLI & Integration (pending)

Next Steps

See IMPLEMENTATION_PLAN.md for detailed implementation roadmap.