OpenClaw: The Open-Source Personal AI Assistant That Lives in Your Chat Apps

OpenClaw

#Scrapy: Das Arbeitstier für Enterprise-Anwendungen

Für groß angelegte, produktionsreife Crawling-Pipelines bleibt Scrapy der Branchenstandard:

import scrapy

class CompetitorSpider(scrapy.Spider):
    """Crawl competitor pricing pages and product catalogs."""
    name = "competitor_monitor"
    
    start_urls = [
        "https://competitor-a.com/pricing",
        "https://competitor-b.com/courses",
    ]
    
    def parse(self, response):
        for plan in response.css(".pricing-plan"):
            yield {
                "source": response.url,
                "plan_name": plan.css("h3::text").get(),
                "price": plan.css(".price::text").get(),
                "features": plan.css("li::text").getall(),
                "crawled_at": datetime.utcnow().isoformat(),
            }

#Enterprise-Implementierung: Eine Fallstudie aus der IT

Lassen Sie uns durchgehen, wie ein mittleres bis großes IT-Unternehmen (denken Sie an eine SaaS-Plattform mit Tausenden von Kunden, die in mehreren Märkten agiert) Open Crawling in seiner gesamten Entwicklungsorganisation implementieren könnte.

IT Company Crawling Architecture

#Anwendungsfall 1: KI-gestützte Generierung einer Knowledge Base

Problem: Der Aufbau und die Pflege einer umfassenden Knowledge Base über mehrere Produktdomänen hinweg erfordert einen enormen Aufwand bei der Content-Erstellung.

Lösung: Crawlen von öffentlich zugänglicher technischer Dokumentation, Branchennachrichten und produktbezogenen Inhalten, um eine KI-Content-Pipeline zu füttern.

# scripts/crawl_knowledge_base.py
from crawl4ai import AsyncWebCrawler
import json
from pathlib import Path

SOURCES = {
    "industry_news": [
        "https://techcrunch.com/",
        "https://www.theverge.com/",
        "https://arstechnica.com/",
    ],
    "documentation": [
        "https://docs.example.com/",
        "https://developer.mozilla.org/",
    ],
}

async def crawl_knowledge_sources():
    async with AsyncWebCrawler() as crawler:
        materials = []
        
        for category, urls in SOURCES.items():
            for url in urls:
                result = await crawler.arun(
                    url=url,
                    word_count_threshold=50,
                    exclude_external_links=True,
                )
                
                materials.append({
                    "category": category,
                    "source_url": url,
                    "content": result.markdown,
                    "word_count": len(result.markdown.split()),
                })
        
        # Save for downstream AI processing
        Path("data/knowledge_materials.json").write_text(
            json.dumps(materials, ensure_ascii=False, indent=2)
        )
        
        print(f"Crawled {len(materials)} sources for knowledge base")

Speisen Sie diese Daten anschließend in Ihr LLM ein, um strukturierte Wissensartikel zu generieren:

// services/knowledgeGenerator.ts
import { GoogleGenerativeAI } from "@google/generative-ai";

interface KnowledgeArticle {
  title: string;
  category: string;
  summary: string;
  key_points: string[];
  related_topics: string[];
}

async function generateArticle(crawledContent: string): Promise<KnowledgeArticle> {
  const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
  const model = genAI.getGenerativeModel({ model: "gemini-3-pro-preview" });

  const prompt = `Based on the following content, create a knowledge base article.
  
  Content:
  ${crawledContent}
  
  Generate a JSON article with:
  - title: article title
  - category: topic category
  - summary: a concise 2-3 sentence summary
  - key_points: 5-8 key takeaways
  - related_topics: related topics for cross-referencing`;

  const result = await model.generateContent(prompt);
  return JSON.parse(result.response.text());
}

#Anwendungsfall 2: Dashboard für Competitive Intelligence

Problem: Der SaaS-Markt ist schnelllebig. Preisänderungen, neue Features und Marketingstrategien müssen in Echtzeit verfolgt werden.

Lösung: Planen Sie Crawl-Jobs, um die Websites von Mitbewerbern zu überwachen und die Daten in ein internes Dashboard einzuspeisen.

# scripts/competitor_monitor.py
import scrapy
from scrapy.crawler import CrawlerProcess
from datetime import datetime

class PricingMonitor(scrapy.Spider):
    name = "pricing_monitor"
    
    # Competitor pricing pages (anonymized)
    start_urls = [
        "https://competitor-a.com/en/pricing",
        "https://competitor-b.com/plans",
        "https://competitor-c.com/subscription",
    ]
    
    custom_settings = {
        "ROBOTSTXT_OBEY": True,          # Always respect robots.txt
        "DOWNLOAD_DELAY": 2,              # Be a polite crawler
        "CONCURRENT_REQUESTS": 1,         # One request at a time per domain
        "USER_AGENT": "CompanyBot/1.0 (Research purposes)",
    }
    
    def parse(self, response):
        yield {
            "competitor": response.url.split("/")[2],
            "plans": self.extract_plans(response),
            "timestamp": datetime.utcnow().isoformat(),
        }
    
    def extract_plans(self, response):
        plans = []
        for plan in response.css("[class*='plan'], [class*='pricing']"):
            plans.append({
                "name": plan.css("h2::text, h3::text").get(""),
                "price": plan.css("[class*='price']::text").get(""),
                "period": plan.css("[class*='period']::text").get("monthly"),
                "features": plan.css("li::text").getall(),
            })
        return plans

# Run with: python scripts/competitor_monitor.py
if __name__ == "__main__":
    process = CrawlerProcess(settings={
        "FEEDS": {
            f"data/pricing_{datetime.now():%Y%m%d}.json": {
                "format": "json",
                "encoding": "utf-8",
            }
        }
    })
    process.crawl(PricingMonitor)
    process.start()

Planen Sie dies mit Cron für eine automatisierte Überwachung:

# Run pricing monitor every Monday at 9 AM
0 9 * * 1 cd /app && python scripts/competitor_monitor.py

Problem: Zu verstehen, für welche Keywords und Themen die Konkurrenz rankt, um Content-Potenziale zu identifizieren.

Lösung: Crawlen Sie Blogbeiträge und Produktseiten von Mitbewerbern, um inhaltliche Schwerpunkte zu analysieren.

# scripts/content_gap_analysis.py
from crawl4ai import AsyncWebCrawler
from collections import Counter
import re

async def analyze_competitor_content():
    async with AsyncWebCrawler() as crawler:
        # Crawl competitor's blog/resource pages
        result = await crawler.arun(
            url="https://competitor-blog.com/resources",
            word_count_threshold=20,
        )
        
        # Extract key topics using simple NLP
        words = re.findall(r'\b[a-zA-Z]{4,}\b', result.markdown.lower())
        
        # Filter for industry-relevant terms
        industry_terms = [
            "saas", "cloud", "infrastructure", "analytics",
            "automation", "integration", "platform", "enterprise",
            "security", "deployment", "scalable", "monitoring",
            "dashboard", "workflow", "productivity", "collaboration",
        ]
        
        relevant = [w for w in words if w in industry_terms]
        topic_frequency = Counter(relevant).most_common(20)
        
        print("Top content themes from competitor:")
        for topic, count in topic_frequency:
            print(f"  {topic}: {count} mentions")
        
        return topic_frequency

#Anwendungsfall 4: Inhaltslokalisierung für mehrere Märkte

Problem: Die Plattform ist in über 10 Ländern aktiv, aber die Erstellung marktrelevanter Inhalte für jede Region ist teuer.

Lösung: Crawlen Sie lokale Nachrichten, Trends und Brancheninhalte aus den Zielmärkten, um die Lokalisierung und die regionale Strategie zu steuern.

# scripts/localize_content.py
from crawl4ai import AsyncWebCrawler

# Target market content sources
MARKET_SOURCES = {
    "ja": {
        "name": "Japan",
        "sources": [
            "https://www3.nhk.or.jp/news/easy/",
            "https://www.sora-edu.com/",
        ]
    },
    "ko": {
        "name": "South Korea",
        "sources": [
            "https://en.yna.co.kr/",
        ]
    },
    "pt": {
        "name": "Brazil",
        "sources": [
            "https://agenciabrasil.ebc.com.br/en",
        ]
    },
}

async def crawl_market_content(lang: str):
    market = MARKET_SOURCES[lang]
    async with AsyncWebCrawler() as crawler:
        all_content = []
        for url in market["sources"]:
            result = await crawler.arun(url=url)
            all_content.append({
                "market": market["name"],
                "lang": lang,
                "url": url,
                "content": result.markdown,
            })
        return all_content

#Produktionsarchitektur

Für ein Produktions-Deployment empfiehlt sich folgende Architektur:

┌──────────────────────────────────────────────────────────┐
│                    Scheduled Jobs (Cron)                  │
│  ┌──────────┐  ┌──────────────┐  ┌───────────────────┐  │
│  │ Pricing  │  │ Content Gap  │  │ Knowledge Base    │  │
│  │ Monitor  │  │ Analysis     │  │ Crawler           │  │
│  └────┬─────┘  └──────┬───────┘  └────────┬──────────┘  │
│       │               │                    │              │
├───────┼───────────────┼────────────────────┼──────────────┤
│       ▼               ▼                    ▼              │
│  ┌─────────────────────────────────────────────────┐     │
│  │              Message Queue (Redis/SQS)           │     │
│  └──────────────────────┬──────────────────────────┘     │
│                          │                                │
│  ┌──────────────────────▼──────────────────────────┐     │
│  │           Data Processing Pipeline               │     │
│  │  ┌──────────┐  ┌───────────┐  ┌──────────────┐  │     │
│  │  │ Clean &  │→ │ Classify  │→ │ Store in DB  │  │     │
│  │  │ Validate │  │ & Tag     │  │ / S3         │  │     │
│  │  └──────────┘  └───────────┘  └──────────────┘  │     │
│  └──────────────────────┬──────────────────────────┘     │
│                          │                                │
│  ┌──────────────────────▼──────────────────────────┐     │
│  │              Downstream Consumers                │     │
│  │  ┌──────────┐  ┌───────────┐  ┌──────────────┐  │     │
│  │  │ AI Know- │  │ Dashboard │  │ SEO Content  │  │     │
│  │  │ ledgeBase│  │ & Alerts  │  │ Pipeline     │  │     │
│  │  └──────────┘  └───────────┘  └──────────────┘  │     │
│  └─────────────────────────────────────────────────┘     │
└──────────────────────────────────────────────────────────┘

#Wichtige Infrastrukturkomponenten

# docker-compose.yml (crawling infrastructure)
version: "3.8"

services:
  crawler:
    build: ./crawler
    environment:
      - REDIS_URL=redis://redis:6379
      - CRAWL_CONCURRENCY=5
      - RESPECT_ROBOTS_TXT=true
    volumes:
      - crawl-data:/data
    deploy:
      resources:
        limits:
          memory: 2G

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"

  processor:
    build: ./processor
    environment:
      - REDIS_URL=redis://redis:6379
      - DB_URL=postgresql://crawler:pass@db:5432/crawl
      - GEMINI_API_KEY=${GEMINI_API_KEY}
    depends_on:
      - redis

  db:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: crawl
      POSTGRES_USER: crawler
      POSTGRES_PASSWORD: pass
    volumes:
      - pg-data:/var/lib/postgresql/data

volumes:
  crawl-data:
  pg-data:

#Ethisches Crawling: Die Spielregeln

⚠️ Wichtig: Open Crawling geht mit erheblichen ethischen und rechtlichen Verantwortlichkeiten einher. Befolgen Sie stets diese Grundsätze.

#Zwingend einzuhaltende Regeln

Respektieren Sie immer die robots.txt — Dies ist nicht verhandelbar.
Setzen Sie angemessene Rate Limits — Mindestens 1-2 Sekunden zwischen den Anfragen.
Identifizieren Sie Ihren Bot — Verwenden Sie einen aussagekräftigen User-Agent-String.
Beachten Sie noindex- und nofollow-Anweisungen.
Crawlen Sie keine personenbezogenen Daten ohne rechtliche Grundlage (DSGVO/CCPA).
Cachen Sie aggressiv — Crawlen Sie nichts erneut, was sich nicht geändert hat.

# Example: Ethical crawling configuration
CRAWL_CONFIG = {
    "robotstxt_obey": True,
    "download_delay": 2.0,              # 2 seconds between requests
    "concurrent_requests": 1,           # 1 request at a time per domain
    "concurrent_requests_per_ip": 1,
    "user_agent": "YourCompanyBot/1.0 (+https://yourcompany.com/bot)",
    "httpcache_enabled": True,          # Cache responses
    "httpcache_expiration_secs": 86400, # Cache for 24 hours
}

#Datenschutzüberlegungen für IT-Unternehmen

Anliegen	Best Practice
Kundendaten	Niemals Nutzerdaten der Plattform crawlen
Mitarbeiterprofile	Nur öffentliche, mit Einwilligung versehene Informationen
Preise von Mitbewerbern	Nur öffentliche Seiten, Nutzungsbedingungen (ToS) respektieren
Inhalte von Drittanbietern	Urheberrechts- und Lizenzbestimmungen prüfen
Personenbezogene Daten	PII (Personally Identifiable Information) während der Verarbeitung entfernen

#Erste Schritte: Ein 30-Minuten-Setup

So richten Sie in 30 Minuten eine grundlegende Open-Crawling-Pipeline ein:

#Schritt 1: Crawl4AI installieren

pip install crawl4ai
playwright install  # For JavaScript-rendered pages

#Schritt 2: Ihren ersten Crawler erstellen

# my_first_crawler.py
import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://example.com",
            word_count_threshold=10,
            bypass_cache=True,
        )
        
        print(f"Status: {result.success}")
        print(f"Title: {result.metadata.get('title', 'N/A')}")
        print(f"Content length: {len(result.markdown)} chars")
        print(f"\nFirst 500 chars:\n{result.markdown[:500]}")

asyncio.run(main())

#Schritt 3: Planen und Automatisieren

# Add to crontab
# Run daily at 6 AM
0 6 * * * cd /app && python my_first_crawler.py >> /var/log/crawler.log 2>&1