Choosing a GRPC Communication Strategy
Overview of the problem
One of the key parts of the design of my search engine is the ability for the spiders to send the pages which they have explored to a central point. Additionally, a more conventional design might use a push/pull system, so the conductor polls each spider in turn and requests the pages which it has seen using pagination. This has several problems.
- Pagination across spiders is hard and it could mean the list needs to be immutable.
- Additionally, if the conductor is slow in processing the messages the spiders will continue to explore, creating more and more backlog.
Proposed solution: Bi-directional communication
In order to ensure that we do not rely on a push/pull system, I instead use a bidirectional stream-based model which allows for pages to be sent as they are explored or in batches. The reason this is better means fewer distinct gRPC requests need to be made. Additionally, it allows for the conductor to signal to the spiders to back off if the conductor can’t keep up.
With the back off
Trade-offs
There is added complexity to using streaming as can require more concurrency. My desired implementation requires using two threads as not to block each other with a shared channel for back off communication. It might have been possible to do this with a push/pull system; however, this allows for more throughput as no polling is required; the communication is all handled in real time.