Skip to main content Alex Collie's blog

Search Engine Update 1

As, I mentioned in my last post . I’ve been busy building my search engine lately, and it’s been a lot of fun diving deep into Kubernetes and learning how to set everything up from scratch.

Overview of the System

Diagram of the full search engine

  • Spider

    • Fetches websites and stores them in a local database.
    • Uses bi-directional gRPC to stream data to the Conductor.
    • Retrieves URLs to explore from SQS.
  • Conductor

    • Receives crawled pages from the Spider.
    • Updates the main table of seen pages.
    • Queues new pages to explore in SQS.
  • Cartographer

    • Pulls pages from the seen pages table.
    • Builds an adjacency list and assigns a prominence score to each page.
    • Stores the results in the index table.
  • Searcher

    • Uses the index table to return the most relevant pages for a search query.

What’s Been Done

  • Spider

    • Migrated to a relational database (SQLite).
    • Now exposes a gRPC endpoint consumed by the Conductor.
  • Observability

    • Set up Grafana and Prometheus for system monitoring.
    • Integrated OpenTelemetry for distributed tracing.
  • RPC (Spider ↔ Conductor)

    • Implemented gRPC streaming to allow real-time, bidirectional communication.
  • Kubernetes

    • Cluster configured to pull container images from AWS ECR.

Trade-Offs

  • Database

    • Using Postgres for the main system and SQLite for the Spider.
    • SQLite is lightweight, requires no additional services, and is ideal for isolated Spider instances.
    • Postgres offers scalability and flexibility for future features like word embeddings and graph search.
  • Spider ↔ Conductor

    • The Conductor is the bottleneck in this architecture.
    • To mitigate this, I implemented bidirectional streaming over gRPC. The Conductor can send back “can’t keep up” messages when overwhelmed.
    • Streaming enables incremental, real-time submission of crawled pages.
  • AWS SQS

    • Chosen due to familiarity and reliability.
    • Enables horizontal scaling by allowing multiple Spiders to fetch tasks independently.
  • Prometheus

    • Opted for a /metrics endpoint over a sidecar to reduce RAM usage and simplify deployment.

Future

At the next update, I plan to have:

  • Public Grafana dashboard
  • Implement
    • Conductor
    • Cartographer