I assume that you are familiar with how spark runs the job, basics of distributed systems, current utilisation of cluster, job SLA, resources details etc.
There are mainly two ways we can optimise our jobs:
There are lot of things we can do in spark while writing a job which i might also not be aware but i am trying to cover some important standard that each should follow according to my experience for better resource utilisation and execution :
A. Caching : Let suppose we are reading data from MySql through spark JDBC connector as…
I am a technology enthusiastic and trying to solve data puzzles