Redis For HA And Load Balancing Of A Service

12 November 2016

Redis For HA And Load Balancing Of A Service - Part-1

Its been long since I have written a blog post, mostly because of my busy schedule and new life after University. A lot has happened since then, about which I plan to write seperate blog posts.

I have been working with the DevOps team at BrowserStack for a few months now. In this post I’ll describe an interesting problem I have been working on recently.

Problem

Writing a highly available/ multi AZ service deployed independently in each region which does aggregation of data in fixed time intervals and pushes to a database cluster.

Lets first look and try to understand the given architecture: Event Data Service

In the above architecture we have:

Multiple Services and Products all in same region. For example: us-east
All services communicate with database servers via DNS which provides round robin load balancing.
Machine-1 and Machine-2 both have exact same configuration and services - UDP broadcast relayer and InfluxDB in two availability zones. For example: us-east-1a and us-east-1b
If a request from any of the productA, productB etc hits our DNS it can be redirected to either Machine-1 or Machine-2.
UDP relayer make sure that both databases contain the same data, i.e if a request comes on Machine-1 it is automatically sent to Machine-2 and vice versa.
Each product sends the database cluster an event message whenever a user event happens on it.
The message contains 4 parameters (event-type, product, username, timestamp).

Event message examples:

event_type=http-5xx,product=productA value=testusername1 1478507534415525888
event_type=os_error,product=productA value=testusername2 1478507534415525888
event_type=browser_error,product=productB value=testusername3 1478507534415525888
event_type=browser_error,product=productB value=testusername4 1478507534415525888
event_type=http-5xx,product=productA value=testusername1 1478507534415525890

Note: The time stamp for the above events is in nanoseconds.

Consider the following use case:

For instrumentation purpose, we need to count the total number of particular event types(eg: http-5xx, browser_error, etc) which happened for a particular product(eg: productA, productB) per minute and also get the same count for unique users.

According to the given use case the above 5 example messages will yield an output:

unique_user_event,event_type=http-5xx,product=productA value=1
cumulative_user_event,event_type=http-5xx,product=productA value=2
unique_user_event,event_type=os_error,product=productA value=1
cumulative_user_event,event_type=os_error,product=productA value=1
unique_user_event,event_type=browser_error,product=productB value=2
cumulative_user_event,event_type=browser_error,product=productB value=2

Understanding the above output

cumulative_user_event tag describes events which happened within one minute for a particular product, i.e if an event happened for testuser1 two times within a minute it will be counted two times. On the other hand

unique_user_event tag describes events which happened within one minute for a particular product, considering unique users only, i.e if an event happened for testuser1 two times within a minute it will be counted once only.

The first output message says: Within a duration of 1 minute the product productA gave http-5xx to 1 user only.
The second output message says: Within a duration of 1 minute the product productA gave http-5xx 2 times.
The third output message says: Within a duration of 1 minute the product productA gave os_error to 1 user only.
The fourth output message says: Within a duration of 1 minute the product productA gave os_error 1 time only.
The fifth output message says: Within a duration of 1 minute the product productB gave browser_error to 2 users.
The sixth output message says: Within a duration of 1 minute the product productB gave browser_error 2 times.

Service Requirements

Let’s call the above example messages as input, the output yield as output and the service to be implemented as Event Data Service. It should be pretty clear by now that we need to write a service which takes in the input from multiple products and sends output to the database cluster.

Easy and unpromising solution

We use a simple hash map with the following keys and values:

cumulative_user_event,<product_type>,<event_type> : ` (int)(count_of_events)`
unique_user_event,<product_type>,<event_type> : (set)(usernames)

Whenever a new message comes we increase the counter for cumulative_user_event, and add a username to set corresposnding to unique_user_event. Parallely, we run a loop every minute and send the aggregated ouput: cumulative_user_event with the counter and unique_user_event with the cardinality of set to database cluster and clear the hash maps.

Event Data Service Easy Solution

Problems with above solution

The above solution works perfectly fine when deployed on a single machine as a service, but deploying the service on a single node makes it a single point of failure for our complete system, i.e even though we have HA for our database but whenever the node containing Event Data Service crashes, all our database servers stop receiving data.

If you observe carefully, in this solution we can simply deploy the service on Machine-1 or Machine-2 and just point our DNS to one of them with deployed service. It will disable our HA and when the server pointed by DNS crashes whole of our system will crash.

Deploying Event Data Service on multiple machines

This is the actual solution for our problem and the main reason of this blog post, i.e we want to deploy our Event Data Service both on Machine-1 and Machine-2.

Challenges and Brainstorming

I’ll describe solution for how to achieve this in my next post i.e Part-2 of this series but I’ll leave some interesting hints for reader to brainstorm on:

When Event Data Service gets deployed to both Machine-1 and Machine-2, DNS points to both the machines and does regular round robin load balancing:

Which machine gets the event messages from products ?
When does that machine gets the message ?
Which machine processes and pushes data and at what intervals ?
Does the one minute loop finish at same time on both machines ?
What is the title of this post ? :)

Until then keep thinking and stay tuned. Feel free to discuss the possible solution with me in the meantime or ask questions if you couldn’t get the problem statement clearly.