Sun. Sep 25th, 2022

Context

We are going to discuss a web application with REST API’s and how Guava cache was used to filter invalid requests

A few aspects to note regarding the traffic experienced

Number of hits per second: 1,37,155
Number of unique keys per day: 94,178
Memory per record: 2 MB

Number of servers: 4
Number of instances: 16
    i.e. each server runs 4 instances of Apache Tomcat

This was before everyone went to the cloud and on premise hosting was still a common occurrence.

Problem Statement

We receive traffic from external sources which can be easily exploited by bots and invalid data requests.
We want to filter the bots (specific conditions based on application logic) and invalid data
Identify and block bad data dynamically across the entire cluster of application servers

To help associate with the problem, consider an example of blocking traffic from an IP Address if we receive more than N requests from it.

Solution

Maintain a central list of blacklisted data (existing Application Data Store: Aerospike)
Each web application instance will do the following:
1. Retrieve list of blacklisted data from central location
2. Block incoming data which is already blacklisted
3. Identify if incoming data is to be blacklisted

Approach 1

Retrieve blacklist on application start up

Periodically sync up the blacklisted data with central server

Approach 2

Filters check if incoming data is already identified as blacklisted

Problem Statement

Live detection of suspicious traffic

Approach

Problem: If same key maps to multiple values
We had to keep track of the keys that come in via the API’s, Blacklist a key if it generates more than N different values

The huge volume of keys made it difficult to keep track of all keys forever.

We decided to introduce time windows, Each window was a cache having a TTL

Level 1 Cache

The smallest window was TTL = 10 m
If any key has more than N (e.g. 1) value, move it to next cache
This cache would have maximum number of elements
This cache will hold around 1.5 GB of data based on our data analysis and estimates

Level 2 Cache

          The next cache will hold data for 1 hour
          If any key has more than N (e.g. 1) value, move it to next window

Level 3 Cache

          The next cache will hold data for 1 day
          If any key has more than N (e.g. 1) value, move it to Aerospike

Hence 3 LRU caches, TTL: 10 mins, 1 hour, 1 day

Each cache
Data -> Key: Object, Value: Set<Long>
Has a thread which will invoke cleanUp periodically
Retention Threshold: Number of values recorded per key after which it will be moved to another cache
Blacklist Threshold: Number of values recorded per key after which it will be moved to permanent storage (marked as blacklisted)
Maximum Size: 500000 (number of elements in cache)
Concurrency Level: 200

Parameter/CacheL1L2L3
Expire after write10 mins1 hour1 day
EvictionTTL or Cache maximum size TTL or Cache maximum size TTL or Cache maximum size
ID crossed retention limitMove to L2Move to L3Move to Aerospike
ID crossed blacklist limit Move to L2Move to AerospikeMove to Aerospike

Logic flow

Logic flow

Preparations / Ground work

Investigated application logs to find out 
a. total hits per minute
b. unique keys per day
c. memory requirement per key
d. memory requirement for each cache

Additional Challenges

Handling synchronous requests for an non-existent key and different value
Solution is to use Get with callable and concurrencyLevels

Cache Eviction listener
Keys are evicted from cache for various reasons

To ensure you handle only real eviction scenario’s, wrap the logic within the wasEvicted check.

Eviction

The moment the key is removed from a cache and moved to next level, it needs to surely removed from lower level cache
Else the concurrent or next requests will cause delayed blacklisting of that key

Invalidating cache element

Cache clean up

Ensuring the blacklisted data fits into Aerospike
Total Record size in Aerospike
“Every record value size is limited to the write-block-size, which is typically set in the configuration to 128KB for Flash devices,  but can be increased up to 1MB (which is the default if not specified in the configuration file).”
          
To ensure that we are safe and can store maximum data, analysis was carried to see how many data points will fit within 1MB limit from Aerospike.
As an additional safeguard the data being stored into Aerospike was compressed to support additional data storage
A warning log statement was also added to help trace if the issue occurs in production.

Environment changes
a. Increase Tomcat memory allocation from 128 MB to 16 GB
b. Increase Aerospike record size to the max of 1 MB
c. Set the concurrencyLevel based on the threads in Tomcat.

Monitoring

If JVM heap size falls below a threshold raise an alert
Added counters to track the following:
Size of Cache(s): Number of elements in Cache
Number of elements removed from Cache(s)
Number of elements added to Cache(s)

Performance

Benchmarking Tests: Fires addToCache logic 1M times in different scenario’s. The total time / number of hits gives the time per request.

Concurrency Tests: Run multiple threads with different scenario’s to check the impact of concurrency on the time per request

Further Improvements

  1. The blacklisting logic works on each application server, hence the bad data can get through multiple servers till one of servers cross the threshold and block the request. A central cache would have helped ensure that the blocking happens exactly as per thresholds mentioned instead of at worst case: threshold * number of servers 
  2. Once a server blacklists a Key, this server itself and others do not immediately start blocking the IDs. The application loads the data from data store every N minutes. This lag can be solved by using a queue: the moment an id is blacklisted add it to a queue, all application servers will be consumers of this topic and will update their blacklisted IDs immediately
  3. We are adding counters into logs. Alerting/Monitoring/Variance based on the counts will help to tweak the parameters if required.

References

Guava cache wiki

Details of caching concepts

More about Guava Cache

API Architecture

https://prashant-prasannakumaran.co.in/welcome-architecture-best-practices/rest-apis/
More insights into API designing

Leave a Reply

Your email address will not be published. Required fields are marked *