Chances are, if you’ve worked with data in your career, then you have used commercially licensed data analytics software such as SAS, SPSS, or MatLab. The large analytic software developers dominated most industries as the leading analytics software for the last 40 years and for good reason. Commercial analytics software is typically delivered pre-packaged with hundreds of analytic functions, process flow management software, and an easy to use GUI, which, when put together, made for seamless analysis of large data. They sat uncontested in the field of the statistical analysis until data volume, data frequency, and data formatting complexity exposed shortcomings. To address these issues, open source communities have developed programming languages (e.g. Python and R) and analytic packages for these languages like Pandas for Python and Dplyr for R. In short, Open Source solutions are gaining prominence as the functionalities are hardening and addressing the ever changing requirements of machine learning.
What is Open Source?
Strictly speaking, open source simply means that the source code is available to all users to be accessed and/or modified freely. This is usually made possible through an open source license, such as the GPL (GNU General Public License). Anyone modifying the source code of a program under the GPL has to make their code publicly available. Therefore, open source software is typically distributed with no cost because it is open to access and modification by any interested party. Python, R, and Spark are all implemented under different open source licenses. Adopting Open Source software tools is a vastly different relationship compared to utilizing commercially licensed software. By interacting with the Open Source community, the user base has the opportunity to directly influence features and functionality. In fact, organizations and individuals are encouraged to participate directly in the evolution and growth of Open Source tools.
Why Should I Transition?
There are three primary benefits of transitioning to an open source analytics platform: (1) reduction in cost, (2) ability to innovate, and (3) availability of resources. Introducing such a drastic change with your analytics team is difficult and the transition period can be challenging. Organizations that have invested in the migration tell us that they wish they had initiated the effort sooner. The benefits extend beyond the purely monetary to include the ability to dynamically scale your analytics, drive innovation (as opposed to responding to the marketplace), and source creative and effective talent provides organizational benefits which will yield second order impacts for years to come.
The evolution of Open Source software is interesting: the marketplace has provided viable and powerful analytical alternatives to commercially licenced contenders.
The recent Burtch Works software survey (Fig. 1. Language Preference for Data Processing by Industry Experience. Graph from Burtch Works LLC, 2019 SAS, R, or Python Survey Update: Which Tool do Data Scientists & Analytics Pros Prefer?, 2019. Web.), highlights the trend of Python and R preference compared to traditional tools like SAS over the past 6 years.
It is clear that by eliminating the need to purchase hardware, software licences, and the staff to manage the implementation, a direct bottom line impact will be realized while replacing the functionality with Cloud computing resources and Open Source software.
Moving to open source languages maintained and contributed to by their communities, the open source options are integrating machine learning and artificial intelligence capabilities on a continuous basis. So, not only are the Open Source projects innovating, they are providing tools for business to innovate: tools to create new products, tools to evolve to customer experiences, and tools to provide solutions to complex problems.
And finally, talent pools gravitating to Data Science and Machine learning are increasingly adopting R and Python as the tool of choice. To compete and innovate, it is imperative that your infrastructure match the talent pool demands. The trend continues to gain momentum and suggests that new analytical teams will be well versed in open source and look to continue their skill set development.
The chart below, produced by the Burtch Works annual SAS, R, and Python survey, shows language preference in 2019 by industry experience, clearly indicating new industry entrants are arriving with a strong understanding of Open Source.
Fig. 2. Language Preference for Data Processing by Industry Experience. Graph from Burtch Works LLC, 2019 SAS, R, or Python Survey Update: Which Tool do Data Scientists & Analytics Pros Prefer?, 2019. Web.
It’s Time to Make the Move
Admittedly, transitioning to the Cloud is no small task; the decision requires commitment and careful planning on the part of leadership. Legacy code needs to be refactored, and data scientists need to be retrained, which is a formidable task for career-long professionals with decades of experience with commercially licensed software. However, the transition is a necessary one, and the sooner it is undertaken, the sooner the benefits can be realized. In a recent InfoQ interview, Christine Doig, a data scientist with Continuum Analytics was quoted: “The Python data science ecosystem has grown tremendously in the past 2-4 years. It has a great community, mature open source libraries for scientific and array computing that are the base for algorithm development. Many users are moving from commercial tools to languages like SAS or Matlab, or R, because of their general purpose and extensive libraries”3. It’s typical to see staff become invigorated once the motivation for such a transition is understood.
DecisivEdge has undertaken a transition to an open source environment. As a result, we are experiencing faster model development cycle times and increased productivity among team members while leveraging newly developed technologies to bring business solutions to market.
Keep an eye out for our next blog post, as we will discuss the benefits of basing machine learning development and deployment on cloud-based infrastructure.