PNEC 2019

Automation of Analytics & Data Pipelines #digitize #analytics #machinelearning #automation #bigdata (Room Salon A-D)

In the last few years, container technology has opened the door to the idea of application collaboration platforms. These platforms have evolved to a degree that they show great promise in being used as Data Science collaboration platforms. Fundamentally, Data Scientists have very similar challenges seen in IT and Software Development processes. For Data Scientists, it is crucial to be able to share our results and collaborate quickly and efficiently with our Colleagues and/or Users. We would like to share our knowledge of a particular platform (Red Hat OpenShift) which we have found useful for Data Science collaboration at ExxonMobil. We will discuss the desired platform requirements such as: an interactive environment, being able to store and share code/data without others needing to set up a new environment on their laptops and/or PCs, addressing security access/needs, and finally (when needed) having the ability to “burst” the environment and increase RAM, CPU and/or storage. From our journey, we would like to share how one can distinguish between an “Enterprise” and a “non-Enterprise” ready platform. With respect to “Enterprise-ready” platforms, we will overview the Cloud Native Computing Foundation (CNCF) and their role in defining Kubernetes which is the backbone of container orchestration. We will also highlight the roadmap for Kubernetes including support for unique performance computing requirements which Data Science workloads demand along with Artificial Intelligence, Machine Learning and Neural Network and what has traditionally been considered typical HPC compute. Now having the knowledge of what an “Enterprise-ready” platform is, we can then describe the path for Data Science users to leverage this computing platform that allows for sharing of expensive resources such as GPUs, FPGAs and Infiniband, quotas and thresholds and CI/CD integration. In particular, CI/CD integration is now creating an emerging notion of ScienceOps workflows for automating data science analysis and moving from Data Science development into production Data Science applications.