Sun. Sep 25th, 2022

Let’s discuss how to build the best ETL and Batch Jobs / Scheduled Jobs by considering a range of topics. ETL or Batch Jobs are are essential ingredient in an enterprise solution

There are some aspects that need to be carefully thought through and designed/coded when implementing such functionality

Read optimization

Ensure repetitive loops of read and write to ensure the application always deals with a consistent memory usage and is not running the risk of unbounded memory requirements

Example, run iterations of read/transform/write

  • Read 100 records
  • Transform
  • Write 100 or less records

Write Optimization

Use bulk mode when saving data into databases

Advantage(s)

  • DB does not get a sudden burst of traffic
  • Job will execute faster

Disadvantage(s)

  • Memory footprint would be higher since we need to collect the data
  • Large chunk of data lost in case of error

Error Handling

Ensure clear thought and direction to error handling — the more time you spend here during your design phase and iron out the details the more robust would be your job

Decide if you can continue running the application or loop in case of an error

Ensure logs are precise, the application would be sifting through a lot of data and if something goes wrong, we want to know exactly which record had issues to reduce debugging time.

Introduce retry for automatic resolution in case of APIs and DB calls to overcome intermittent network issues

Would the next execution of the job be able to continue from the same point and attempt to move forward?

  • if the source data can support a timestamp/bookmark from which we can request for data, then it would be possible to restart from the last completed timestamp/bookmark

Consider skipping records which breach the error thresholds to enable the job to move ahead and not get stuck at a particular problematic record.

Parallel Processing

Evaluate feasibility for the job to consider running various aspects in parallel to speed up job execution times

e.g.

  • Processing the batches
  • Records within a batch
  • Steps within each record

Execution

We can have code for multiple jobs in a single repo if the share certain common dependencies e.g. source and destination is same for a set of jobs

However ensure during execution each job is an independent process(ideally a Kubernetes Cron Job)

Why is this useful?

  • Reduced resource utilization (application is running only at scheduled time)
  • New container for each execution
  • Dedicated execution space — No side effects from other jobs

Bringing it all together

do
{

get batch data from source [e.g. DB]: stop job in case of failure

for each record

transform: decide if it’s stop or continue in case of failure

collect for batch write

write batch data to destination: stop job in case of failure

} while ( there are more records in source [e.g. DB])

Avoid re-inventing the wheel

Use frameworks like Spring Batch if they fit your use case which are robust, manage the boilerplate and lets you focus on the business logic

Learn about REST APIs

Use link to get insights into effective REST API development

Leave a Reply

Your email address will not be published. Required fields are marked *