Development of Spark jobs seems easy enough on the surface and for the most part it really is. The provided APIs are pretty well designed and feature-rich and if you are familiar with Scala collections or Java streams, you will be done with your implementation in no time. The hard part actually comes when running them on cluster and under full load as not all jobs are created equal in terms of performance. Unfortunately, to implement your jobs in an optimal way, you have to know quite a bit about Spark and its internals.
In this article I will talk about the most common performance problems that you can run into when developing Spark applications and how to avoid or mitigate them.
It has been ten months since I started this blog. I had high hopes for it and wanted to share all the great info about all the awesome technologies I was working with, but it all ended up being much more time-intensive than I anticipated. Nevertheless, I still find myself wanting to share my knowledge with the rest of the community, so here we go for attempt number two.
Read further if you are interested in what I was up to this year.
If you work with Big Data, you have probably heard about Apache Spark, the popular engine for large-scale distributed data processing. And if you followed the project’s development, you know that its original RDD model was superseded by the much faster DataFrame model. Unfortunately, to gain in performance the model became much less weildy due to the new requirement of data schema specification. This was improved on by the presently used Dataset model which provided automatic schema inference based on language types, however the core logic remained largely the same. Because of that, extending the model is not such an easy task as one would think (especially if you want to do it properly).
In this article, I will demonstrate how to create a custom data source that uses Parboiled2 to read custom time-ordered logs efficiently.
In this post, I explain how I ended up with my present blogging solution and provide a simple guide to anyone who is interested in using the Jekyll/GitHub Pages combination to publish a site of their own.
As a Scala engineer, I always prized simplicity in all the projects I worked on. I do not like frontend development as I find it tedious and needlessly time-consuming, so I wanted to avoid that. However I also see myself as a “hacker” so I wanted to be able to freely tweak and customize my solution. That is why I didn’t choose the insanely popular WordPress platform (even though I was tempted by it more than once while setting things up). All the tools had to, of course, be free and open-source, so that I would be able to migrate should the need arise. And lastly, I wanted to be able to integrate my solution with GitHub and share it there.
Hello, everyone, and welcome to my first blog.
I am a software engineer currently working in Seznam.cz on the Sklik advertising platform. My job is to design and implement processes which consume the massive amounts of data that Seznam.cz collects via its fulltext search solution as well as its partner network.
I also recently started contributing to the Apache Spark project during my free time.