Coding Data Analytics into your Organisation"s DNA
Singapore, June 24, 2019
The power of data analytics hardly needs introduction. In recent years, it has established itself as a gamechanger, shaking up industries that have been around for decades.
Algorithms are getting increasingly sophisticated, with organisations embarking on an aggressive push towards applying data analytics to solve problems in new and creative ways. By transforming raw datasets into useful information and drawing insights from the data, healthcare researchers have accurately diagnosed diseases such as cancer, regulators have detected fraud, and music streaming platforms such as Spotify have created personalised playlists for users everywhere.
“Companies have been exploring the use of data analytics for their businesses, but many of such initiatives are still largely limited to proof of concepts (POCs) at the moment,” said Mr Loo Soo Kiat, Director of Advanced Analytics and Big Data, NCS.
The greater challenge lies ahead. In a world where analytics can upend entire industries, the question now is how companies can integrate new capabilities into their operations and strategies, research firm McKinsey noted in a report. “Collectively, there is a sense that the ecosystem is moving beyond the initial hype, and that the next big wave will be focusing on ‘productionising’ their data analytics efforts,” Mr Loo concurred.
Performance is key when it comes to production-grade analytics
For companies looking to scale up their data analytics efforts, the process can be overwhelming and akin to entering into uncharted territory. First of all, the priorities will likely shift. At the POC stage, predictive accuracy is probably the most important criterion for success. Companies create a sandbox environment in which project data scientists can test and experiment with their analytics models and algorithms.
But when businesses put out an actual product with the embedded algorithms, they are coming face to face with mature digital users and their expectations for a high-quality user experience. At this juncture, performance—not pristine data sets, interesting patterns, or killer algorithms—is ultimately the point.
Mr Loo gave the analogy of Google’s search engine to explain why user experience becomes a critical factor in its popularity. “Search engines are essentially complex analytics machine learning models. When we use Google’s search engine to find information, we expect accuracy, which is driven by algorithms. But another factor we often take for granted as a user is the speed at which the results are delivered to us.
You wouldn’t want to wait five minutes for a search result.”
At the end of the day, no matter how good an application is or how accurate the results generated are, one cannot omit the need for performance speed, he emphasised.
Data lake as a solution for data integration
Most POCs are currently completed in small and controlled scopes and limited to a few departments in an organisation. At this stage, data is often extracted manually offline from various sources and mashed together for analysis, from which insights are derived.
The intention is often to expand the project to the entire company should the POC be successful. When that happens, the volume of input data could exponentially increase, and the integration of data from various sources becomes a challenge. As such, the fundamental way in which the data is handled will need to change. Companies may need to look into building a data repository before they begin their analysis.
With increasing volume, variety and velocity of data, Mr Loo advised companies to look into setting up a data lake—a central repository capable of holding all different forms of data, structured, unstructured and semi-structured data, taken from multiple sources.
The data lake is designed to retain all attributes of the data, and to facilitate projects in which there isn’t complete clarity on what the scope of data or its use will be. In many industries, data lakes are increasingly being adopted as a way to tackle the issue of data integration, gain more visibility and put an end to data silos.
“However, the nature of the data lake means that if you are not careful, it can easily turn into a data swamp,” he cautioned. A data lake makes it easy for data scientists to mine the data and derive actionable insights. In contrast, a data swamp is filled with poor quality data, which when relied upon, could result in insights akin to being in a murky pool with poor visibility.
Robust data governance is needed to maintain quality data
The truth is that data in itself is not magic— and without a clear objective, it carries little value. There is a need to think through and identify with clarity the purpose of the data collected and being put to use, noted a white paper published by Weber Shandwick.
To optimise the potential of the data, a robust data governance strategy is needed to prevent a “rubbish in, rubbish out” situation from happening. Defined as a ‘strategic business programme’, data governance mitigates the business risk of poor data practices and quality, at the same time determining financial benefit data brings to organisations. Michele Goetz, Principal Analyst at Forrester, highlighted that data governance doesn’t just concern the IT department. “It just happens to orient toward data performance—the same way marketing and sales is about customers and product sales, and accounting orients towards billing and collections,” she explained.
Mr Loo added that data governance tends to not be a focus at the POC stage. But when a data analytics initiative moves into production-scale, more users are involved and the stakes become higher. Any mistake made due to incomplete or erroneous data wil
l have greater implications than before. “Data governance is therefore critical in its function as the gatekeeper to ensure that the integrity of data is maintained, and the risk of mishandling of the data is mitigated.”
Going beyond POC requires more than just technical skillsets
As part of a larger industry trend, one of the biggest barriers organisations face is attracting and retaining the right talent.
Data science is a relatively niche field at the moment, and professionals working in tend to come from IT, mathematics or statistics backgrounds. However, data scientists of the future require skillsets across various disciplines, including humanities, social science and business. This skills gap is a major hurdle for organisations that want to move their POCs into production.
Some technical domains are going converge. Mr Loo explained that at the POC stage, for expediency and cost efficiency, most data scientists tend to favour writing advanced analytics algorithms in open source languages such as R or Python. The results are rendered into simple data visualisation formats for ease of communication to business stakeholders.
“But if you want to deploy these analytics algorithms on a sustained basis, they need to be embedded in an application that business users can readily interact with,” he said. This can potentially take the form of a web-based or custom application. In designing such applications, there are various components to take note of. For example, user-centric design principles should be followed, and this would require design thinking and app development skillsets, in addition to data science knowledge.
On the other hand, McKinsey also highlighted that ‘business translators’ will come to play an important role in the equation. By combining data savviness with industry and functional expertise, they will be able to communicate the outcome of data-driven insights in a way that business users can understand.
Moving data analytics efforts out of its ‘nursery’—that is, the POC—and into production can be daunting due to the drastic mindset changes that need to happen. Mr Loo’s advice for companies embarking on the journey is to always start with the purpose, not the data. “Establish clarity on your goals and values, and ask yourself how data can serve them, not lead them,” he said in conclusion.
Proof-of-concept to Production in five key steps:
Deploy model in an APPLICATION
Data science models need to be embedded in an application that business users can readily interact with. Performance is also an important factor—the code used to construct the algorithm has to be optimised to deliver results within an acceptable length of time.
Estimate total production DATA VOLUME
Most PoC projects are carried out using only a subset of the total production data. Proper sizing is important in determining the appropriate infrastructure architecture that needs to be put in place. Is a Big Data architecture such as Hadoop needed, or will a traditional data warehouse suffice?
Implement robust DATA GOVERNANCE
In a production environment, there will be an increase in data volume and greater reliance on the model output from more stakeholders. To prevent a ‘garbage-in, garbage-out’ situation from happening, data quality and data integrity will be paramount.
Implement process AUTOMATION
During PoC, data tend to be extracted from source systems offline. They are then transformed before any data analytics can be carried out. These manual processes can be repetitive and time consuming. When scaling up data analytics efforts, such ExtractTransform-Load (ETL) process should be automated with minimal human intervention.
Perform ongoing model MANAGEMENT and MAINTENANCE
For the algorithm to remain effective and relevant, it needs to be re-calibrated and retrained periodically with new data. Hence, a systematic and structured way of managing these models on an on-going basis will become critical.
Keen to arrange a Data Analytics demo or to find out more?
- THIS IS THE FIRST ARTICLE
- Next Article The Data Science Procurement Dilemma — Build, Buy, Or Outsource?