Learning Spark

January 24, 2019

I’m learning Apache Spark for a work project and of the “big data” systems I’ve looked at, it’s my favorite so far.

I was pretty disappointed with how little assistance it provides the developer in terms of making parallel computing transparent (vs. HPC tools & languages) but it has its charms.

In any event the biggest barrier for me so far has been getting my head around the MapReduce programming model. My first professional experience in working with data was SQL/RDBMS-oriented and a few years back I started working with key-value stores, but I never really embraced MapReduce, so now I’m paying for that.

It’s frustrating to be held-back from reaching a goal or solving a problem that you could solve quickly using what you know because you have to learn something new, but on the flip-side I love learning new things, so I’m trying really hard to frame this process with a focus on learning how to adopt the philosophy of Spark & MapReduce and not trying to simply force the models I’m comfortable with onto this new environment (having been on the receiving end of that kind of force, I can appreciate why it’s a bad idea).

Another exciting aspect about this is that it gives me a very practical application for the RAIN project. Spark is probably not a perfect fit for a machine like RAIN since Spark is really oriented to data-bound processing and RAIN is more constrained in the storage/memory resources than a more traditional server farm, but if nothing else it will give me a more practical workload to experiment vs. benchmarks.

(it also gives me an excuse to wrench on RAIN during “business hours”)