We are going to discuss a web application with REST API’s and how Guava cache was used to filter invalid requests
A few aspects to note regarding the traffic experienced
Number of servers: 4
Number of instances: 16
i.e. each server runs 4 instances of Apache Tomcat
This was before everyone went to the cloud and on premise hosting was still a common occurrence.
We receive traffic from external sources which can be easily exploited by bots and invalid data requests.
We want to filter the bots (specific conditions based on application logic) and invalid data
Identify and block bad data dynamically across the entire cluster of application servers
To help associate with the problem, consider an example of blocking traffic from an IP Address if we receive more than N requests from it.
Maintain a central list of blacklisted data (existing Application Data Store: Aerospike)
Each web application instance will do the following:
1. Retrieve list of blacklisted data from central location
2. Block incoming data which is already blacklisted
3. Identify if incoming data is to be blacklisted
Retrieve blacklist on application start up
Periodically sync up the blacklisted data with central server
Filters check if incoming data is already identified as blacklisted
Live detection of suspicious traffic
Problem: If same key maps to multiple values
We had to keep track of the keys that come in via the API’s, Blacklist a key if it generates more than N different values
The huge volume of keys made it difficult to keep track of all keys forever.
We decided to introduce time windows, Each window was a cache having a TTL
Level 1 Cache
The smallest window was TTL = 10 m
If any key has more than N (e.g. 1) value, move it to next cache
This cache would have maximum number of elements
This cache will hold around 1.5 GB of data based on our data analysis and estimates
Level 2 Cache
The next cache will hold data for 1 hour
If any key has more than N (e.g. 1) value, move it to next window
Level 3 Cache
The next cache will hold data for 1 day
If any key has more than N (e.g. 1) value, move it to Aerospike
Hence 3 LRU caches, TTL: 10 mins, 1 hour, 1 day
Data -> Key: Object, Value: Set<Long>
Has a thread which will invoke cleanUp periodically
Retention Threshold: Number of values recorded per key after which it will be moved to another cache
Blacklist Threshold: Number of values recorded per key after which it will be moved to permanent storage (marked as blacklisted)
Maximum Size: 500000 (number of elements in cache)
Concurrency Level: 200
|Expire after write||10 mins||1 hour||1 day|
|Eviction||TTL or Cache maximum size||TTL or Cache maximum size||TTL or Cache maximum size|
|ID crossed retention limit||Move to L2||Move to L3||Move to Aerospike|
|ID crossed blacklist limit||Move to L2||Move to Aerospike||Move to Aerospike|
Preparations / Ground work
Investigated application logs to find out
a. total hits per minute
b. unique keys per day
c. memory requirement per key
d. memory requirement for each cache
Cache Eviction listener
Keys are evicted from cache for various reasons
To ensure you handle only real eviction scenario’s, wrap the logic within the wasEvicted check.
The moment the key is removed from a cache and moved to next level, it needs to surely removed from lower level cache
Else the concurrent or next requests will cause delayed blacklisting of that key
Ensuring the blacklisted data fits into Aerospike
Total Record size in Aerospike
“Every record value size is limited to the write-block-size, which is typically set in the configuration to 128KB for Flash devices, but can be increased up to 1MB (which is the default if not specified in the configuration file).”
To ensure that we are safe and can store maximum data, analysis was carried to see how many data points will fit within 1MB limit from Aerospike.
As an additional safeguard the data being stored into Aerospike was compressed to support additional data storage
A warning log statement was also added to help trace if the issue occurs in production.
a. Increase Tomcat memory allocation from 128 MB to 16 GB
b. Increase Aerospike record size to the max of 1 MB
c. Set the concurrencyLevel based on the threads in Tomcat.
If JVM heap size falls below a threshold raise an alert
Added counters to track the following:
Size of Cache(s): Number of elements in Cache
Number of elements removed from Cache(s)
Number of elements added to Cache(s)
Benchmarking Tests: Fires addToCache logic 1M times in different scenario’s. The total time / number of hits gives the time per request.
Concurrency Tests: Run multiple threads with different scenario’s to check the impact of concurrency on the time per request
- The blacklisting logic works on each application server, hence the bad data can get through multiple servers till one of servers cross the threshold and block the request. A central cache would have helped ensure that the blocking happens exactly as per thresholds mentioned instead of at worst case: threshold * number of servers
- Once a server blacklists a Key, this server itself and others do not immediately start blocking the IDs. The application loads the data from data store every N minutes. This lag can be solved by using a queue: the moment an id is blacklisted add it to a queue, all application servers will be consumers of this topic and will update their blacklisted IDs immediately
- We are adding counters into logs. Alerting/Monitoring/Variance based on the counts will help to tweak the parameters if required.