Latest posts

Ramandeep Singh Nanda

Reactively Streaming CSV using RXJava

RXJava is an extremely useful streaming framework (here is an example application using it for parallel processing of restful calls to both uber and lyft (RT_UBER_NYC_TAXI)). However, In this post, I will cover how you can reactively stream and process a CSV file.

Firstly, you can create a Flowable of …

Ramandeep Singh Nanda

Spark Scaling to large datasets

In this post, I will share a few quick tips about scaling your Spark applications to larger datasets without having large executor memory.

  • Increase Shuffle partitions: The default shuffle partitions is 200, for larger datasets, you are better off with larger number of shuffle partitions. This helps in many ways …

Ramandeep Singh Nanda

Removing Projection Column Ambiguity in Spark

Column ambiguity is quite common when you join two tables. Now this poses a unnecessary hassle when you want to select all the columns from both the tables whilst discarding the duplicate columns. The aforementioned problem is difficult to handle especially, if you have wide tables, where you would want …

Ramandeep Singh Nanda

Efficient Spark Dataframe Transforms

If you are working with Spark, you will most likely have to write transforms on dataframes. Dataframe exposes the obvious method df.withColumn(col_name,col_expression) for adding a column with a specified expression. Now, as we know that the dataframes are immutable in nature, so we are getting a newly …

Ramandeep Singh Nanda

Writing Generic UDFs in Spark

Apache Spark offers the ability to write Generic UDFs. However, for an idiomatic implementation, there are a couple of things that one needs to keep in mind.

  1. You should return a subtype of Option because Spark treats None subtype automatically as null and is able to extract value from Some …