Now

Currently building a segmentation and ad-tech platform for my employer.  Most of my work is around how we do data engineering, of course.  As per usual, the vast majority of performance improvement on a Hadoop cluster is based on how we store our data.  I am continuing to experiment with different approaches for storing the data that underlies our application.  Currently we are using compressed ORC files, which are partitioned, bucketed, and sorted.  Doing that along with gathering column statistics makes everything we do more performant.  My current experiment is to see how Parquet performance compares to ORC, and how different partition sizes impacts performance and resource utilization.

 

Upcoming experiments are going to include a performance layer on top of our hadoop datasets.  This will be something like Apache Drill, Presto, or perhaps Dremio.