At this point in time, the value of data analytics hardly needs justification. In recent years, it has established itself as a gamechanger, shaking up industries that have been around for decades. With ever-growing sophistication in analytical techniques, organisations are increasingly embarking on an aggressive push towards applying data analytics to solve problems in new and creative ways – from fraud prediction to personalization of the consumer experience.
While success stories are well-publicized, they remain quite elusive for a majority of organisations. Many companies have been exploring the use of data analytics for their businesses. But such initiatives are still largely limited to experimentation and are stuck in a proof-of-concept (POC) stage.
Indeed, for a majority of organisations, the challenge is moving beyond the initial promise of data analytics into operationalizing their data analytics efforts at a production scale.
Performance is key when it comes to production-grade analytics
For companies looking to scale up their data analytics efforts, the process can be overwhelming and akin to entering into uncharted territory. First of all, priorities will likely shift. At the POC stage, predictive accuracy is probably the most important criterion for success. Companies create a sandbox environment in which project data scientists can test and experiment with their analytics models and algorithms.
But when businesses put out an actual product with the embedded algorithms, they are coming face to face with mature digital users and their expectations for a high-quality user experience. How well the algorithm performs, not only in terms of accuracy but also in terms of speed, will ultimately decide if wide-spread adoption will occur.
We can use Google’s search engine as an analogy to explain why user experience becomes a critical factor in its popularity. Search engines are essentially complex analytics machine learning models. When we use Google’s search engine to find information, we expect accuracy, which is driven by algorithms. But another factor we often take for granted as a user is the speed at which the results are delivered to us. We wouldn’t want to wait five minutes for a search result.
Data lake as a solution for data integration
Most POCs are currently completed in small and controlled scopes, and limited to a few departments in an organisation. At this stage, data is often extracted manually offline from various sources and mashed together for analysis, from which insights are derived.
The intention is often to expand the project to the entire company should the POC be successful. When that happens, the volume of input data could exponentially increase, and the integration of data from various sources becomes a challenge. As such, the fundamental way in which the data is handled will need to change. Companies may need to look into building a data repository before they begin their analysis.
With increasing volume, variety and velocity of data, many companies are looking to setting up a data lake—a central repository capable of holding all different forms of data, structured, unstructured and semi-structured data, taken from multiple sources. The data lake is designed to retain all attributes of the data, and to facilitate projects in which there isn’t complete clarity on what the scope of data or its use will be. In many industries, data lakes are increasingly being adopted as a way to tackle the issue of data integration, gain more visibility and put an end to data silos.
Yet, the nature of the data lake means that if not properly maintained, it can easily turn into a data swamp filled with poor quality data, which when relied upon, could result in insights akin to being in a murky pool with poor visibility.
Robust data governance is needed to maintain quality data
One of the surest way to prevent a data lake from turning into a data swamp is to establish robust data governance policies and procedures to ensure data integrity and quality, and using tools to enforce such data governance policies. Without quality data, no amount of sophisticated analytical techniques will yield the desired results. It will just be “garbage in, garbage out”.
Unfortunately, data governance tends not to be the focus for data analytics projects at the POC stage. However, when a data analytics initiative moves into production-scale, more users are involved and the stakes become higher. Any mistake made due to incomplete or erroneous data will have greater implications than before.
Data governance is therefore critical in its function as the gatekeeper to ensure that the integrity of data is maintained, and the risk of mishandling of the data is mitigated.
Going beyond POC requires more than just technical skillsets
As part of a larger industry trend, one of the biggest barriers organisations face is attracting and retaining the right talent.
Data science is a relatively niche field at the moment, and professionals working in tend to come from IT, mathematics or statistics backgrounds. However, data scientist of the future requires one to have skillsets across various disciplines, including humanities, social science and business. This skills gap is a major hurdle for organisations that want to move their POCs into production.
For instance, at the POC stage, for expediency and cost efficiency, most data scientists tend to favour writing advanced analytics algorithms in open source languages such as R or Python. The results are rendered into simple data visualisation formats for ease of communication to business stakeholders. But to deploy these analytics algorithms on a sustained basis, they would need to be embedded in an application that business users can readily interact with. This can potentially take the form of a web-based or custom application. In designing such applications, there are various components to take note of. For example, user-centric design principles should be followed, and this would require design thinking and app development skillsets, in addition to data science knowledge.
Ultimately, being able to successfully move data analytics efforts out of its ‘nursery’—that is, the POC—and into production will require professionals with a right mix of business savvy and technical skillsets.
Proof-of-concept to Production in five key steps:
1. Deploy model in an APPLICATION
Data science models need to be embedded in an application that business users can readily interact with. Performance is also an important factor—the code used to construct the algorithm has to be optimised to deliver results within an acceptable length of time.
2. Estimate total production DATA VOLUME
Most PoC projects are carried out using only a subset of the total production data. Proper sizing is important in determining the appropriate infrastructure architecture that needs to be put in place. Is a Big Data architecture such as Hadoop needed, or will a traditional data warehouse suffice?
3. Implement robust DATA GOVERNANCE
In a production environment, there will be an increase in data volume and greater reliance on the model output from more stakeholders. To prevent a ‘garbage-in, garbage-out’ situation from happening, data quality and data integrity will be paramount.
4. Implement process AUTOMATION
During PoC, data tend to be extracted from source systems offline. They are then transformed before any data analytics can be carried out. These manual processes can be repetitive and time consuming. When scaling up data analytics efforts, such Extract-Transform-Load (ETL) process should be automated with minimal human intervention.
5. Perform ongoing model MANAGEMENT and MAINTENANCE
For the algorithm to remain effective and relevant, it needs to be re-calibrated and re-trained periodically with new data. Hence, a systematic and structured way of managing these models on an on-going basis will become critical.