Big data frameworks and the accessibility of cloud computing has democratized data science. Data processing frameworks have evolved such that exceedingly large data sources can be consumed, processed, and modeled. Coupled with cloud-based solutions, the processing times now afford the data scientist to focus on core data science problem solving rather than wrangling systems, platforms, and engineering concerns.

At the core of the modern data science tech stack is distributed computing. This refers to the use of a network of machines that act as a single machine, to complete a task. These distributed systems typically utilize a driver/worker architecture where one machine in the system is the driver and acts as a coordinator for the worker machines that execute tasks and report back to the driver. Distributed solutions provide an astounding amount of power but an equal amount of complexity, which can distract the data scientist.

As distributed computing evolved, developments in data processing methodologies emerged to capitalize on the processing advances of the distributed capacity.  One such processing framework is MapReduce:  designed for processing large amounts of data in parallel in a reliable and fault-tolerant manner.

MapReduce works in two main phases: The Map phase where input data is split into pairs and mapped, and the Reduce phase where data is shuffled and then reduced. While MapReduce was a huge step forward for distributed computing it still has its disadvantages:

1) limited to batch processing

2) IO bound to disk which results in undesirable compute times

3) designed to work specifically with Hadoop which limited its use cases

The shortcomings of MapReduce spurred advancements which matured into Apache Spark, an open-source unified computing engine that is up to 100x times faster than Hadoop MapReduce [1]. On top of this core processing engine, Spark has libraries for SQL, graph computation, stream processing and machine learning which have contributed to Spark becoming one of the most popular tools for Big Data. Spark also provides support for a multitude of languages such as Python, R, Java, Scala and SQL. While the speed increase provided by Spark was a game changer on its own, a major key to the success of Spark is the “unified” component. Unlike MapReduce which was designed to work with one specific kind of storage, Spark is designed to support a wide variety of persistent storage systems such as cloud storage systems like Azure or Amazon S3, distributed file systems, key-value stores, or message buses. Prior to Spark people were forced to use a combination of different systems, libraries and API’s in order to complete big data tasks. But with Spark’s host of libraries and API’s it can be addressed over the same computing engine with a consistent set of API’s and efficient project codebases.  While Spark made tremendous strides in improving ease of use, the configuration, deployment, and management of compute clusters is still shrouded in a layer of complexity that scales up as the clusters do.

The use of cloud resources is optimal for distributed computing solutions due to the overhead and infrastructure required to utilize distributed computing.  Using cloud solutions over on-premise hardware allows for much greater flexibility and scalability. Cloud solutions allow businesses to be agile and react quickly to changes without involving the commitments that accompany traditional on-premise hardware. Cloud solutions can be scaled up or down at a moment’s notice whereas making changes to infrastructure involving physical equipment is more complicated and time-consuming. Physical equipment also requires maintenance and upgrades both of which can be eliminated with cloud computing as these burdens fall to the cloud provider. Cloud solutions position businesses to leverage distributed computing.

Putting it all together, Databricks has been introduced to provide a truly unified analytics platform which drives machine learning development through the deployment function. Built on Apache Spark and embedding the cluster management tools, with Databricks, one can:

  • Configure, deploy, and manage clusters without having to invest in IT infrastructure
  • Connect to a variety of node types including CPU and GPU enabled nodes of various sizes/configurations
  • Utilize a managed environment including a managed version of MLflow, one of the fastest growing ML lifecycle management tools
  • Build/Deploy on MS Azure or Amazon Web Services
  • Integrate with other resources. Databricks allows you to create mount points on the Databricks File System (DBFS) that allow you to access data easily from blob storage, a Data Lake, or even an Amazon S3 bucket.
  • Integrate with Azure Data Factory, allowing businesses to leverage their existing Databricks service for things such as ETL in their data pipelines.
  • Facilitate version control with a revision history built into Databricks notebooks and along with support to easily link to an existing Git repository hosted on GitHub, Bitbucket, or Azure Repos.

As all aspects of the data analytics space continues to evolve at a tremendous pace, clearly the shortcomings of data processing and distributed computing presented as pain points for the data science workflow.  Databricks and its suite of functionality has proven to remove a great deal of distraction from our projects.  At DecisivEdge, we have developed a robust and scalable machine learning development platform based on Databricks that allows the data scientist to focus on data science.  As a result, we are delivering projects faster and providing clients with higher quality deliverables.