Becoming a Data Driven Organization
When achieving MOV, the top management may have more than one way of implementing/executing it and usually, this is documented in Business Case.
Having a well-focused strategy and striving for it by achieving milestones through various projects which gives back Measurable Organizational Value (MOV) is the fundamental principle of any organization.
Different ways of achieving the MOV can be identified as different alternatives and these should be compared using financial models such as Payback Period analysis, Net Present Value (NPV) calculation, weight scoring models and risk analysis models to understand which alternative gives the most capital, which one emphasize higher margins or faster growth, and which capabilities are needed to ensure strong performance. In current context, organizational strategic planning has been supported by the emerge of Data Science.
Data Science can be identified as a set of interdisciplinary techniques that put data (Usually known as ‘Big Data’) to work to extract useful insights, predictions, or knowledge which is useful to drive the business. The whole process of processing Big Data using techniques, tools and practices is known as Business Intelligence (BI). Today BI has been implemented in a large variety of areas such as online shopping like Amazon, social networks like Facebook, professional networks like LinkedIn.
To become a Data Driven organization we need to be aware of the nature of data and the techniques which apply on top of data to deduce or extract vital information.
Data can exist at different levels of structure throughout any data ecosystem (Big Data source) and to decide on defining a level for a data set, we need to understand the costs and benefits of adding structure. Cost can be twofold.
The people costs: As the unstructured nature of the data, it will require a set of advanced data engineering techniques to map the extracted data patterns to business requirements which will need expensive analysis effort.
The time costs: Not like the traditional way of data processing where we collect requirements and define schema according to it, more time needs to be spent with unstructured data, preparing test data, evaluating it, and forming an agreement about whether it should be used in the decision-making process.
Working with fully structured data system, cause an organization to stick to the defined business path and it lacks addressing supply and demand fluctuation of business users. Working with structured data may give gaps between what business supplies and what user needs. On the other hand, stick only to unstructured data may affect the overall performance of the system as it will add overhead as it doesn’t provide standardized metrics or business limitations as with structured data. So as a perfect solution, it should look for a hybrid model which showcases both aspects of the data (structured/unstructured) and this can boost up data agility and performance at the same time. For an instance, all the data which managed under traditional Relational Database Management System (RDBMS) is considered as a structured representation of data while unstructured data can be identified with two major categories, human-generated and machinegenerated.
While sources like text files, emails, social media posts, websites, mobile data, are considered as humangenerated, satellite imagery, scientific data, digital surveillance, sensor data, can be considered as machine-generated.
Before applying any Data Science solution to an organization, it is worth to think about the problem and the necessity in the involvement in Data Science to get resolutions. Problems that can be solved only through several rules need not to have an expensive Data Science solution. When implementing Data Science, instead of having an unsupervised predictive model at first hand it is recommended to go with the supervised solution. Data Science may not work for instances where it needs highly precise results and no error is permitted. For example, when Machine Learning (ML – Statistical techniques to give computer systems the ability to learn with data) application trying to read an amount from a bill or an invoice incorrectly (Missing one position of currency value), it may cause inconsistency in finance records.
Following scenario of a Recruitment Agency demonstrates the applicability of Data Science in the day to day business.
The main goal of a recruitment agency is to find appropriate positions for those who submit CVs. The general flow is like, organizations communicate their vacancies to the agency which finds appropriate candidates by cross-checking position requirements with submitted CVs and shortlisting. With the involvement of Data Science same scenario can be explained differently.
When there is a pool of CVs to be shortlisted on the job position requirements, this can be achieved through ML. What need to be done is read through all the CVs and get relevant information (This can be technical qualifications, experience) and cross match with the requirements for the position. As this is categorization based on already known features (position requirements), it can be identified as supervised learning. Read through the CVs can be identified as text processing which is unstructured data. Structured data like demographic information of candidate (gender, age, etc.) which is supplied when registering with the agency can also be used in processing.
It can get an idea on mostly demanded areas/technologies by analyzing the information on CVs about mostly practiced technologies by candidates. This can be a useful information for the organizations that are looking forward to filling the vacancies. As there are no predefined categories and groups have been defined and grouping done at the time of processing (clustering), this can be identified as unsupervised learning.
The organization should select technologies & tools considering capability to identify future business opportunities (scalability of the business), performance, data security and financial feasibility. When the technology is not scalable, it adds limitations to recognize new paths of core business and this cause to shrink in business and unable to survive with disruptive technology changes. Any organization should avoid sticking to a technology which is not powerful enough to address changing business requirements, thus technology selection should be done considering both the future requirements as well as immediate requirements. When replacing a technology, it should consider whether there are major improvements or technology advancement than before, due to the replacement.
Always try to understand that any novel technology is an expense to the organization and think about whether there are enough returns (capabilities provided by the technology). Data security is also an important concern considering the sensitive nature of data. It can be healthcare, finance or any other field where sensitive data need to be processed and privacy needs to be maintained.
Following is a list of technologies which are frequently used in the field of Data Science.
There is a pool of products which comes under defined technologies and available from several Vendors like Alteryx, IBM, KNIME, Microsoft, Oracle, Rapid Miner, SAP, and SAS. The products and solutions they offer have their own benefits and limitations and need to have a proper investigation before selecting a vendor product to the organization.
Opensource technology integration support: Alteryx Designer, Microsoft R, SAS Enterprise Miner, Oracle’s ORAAH and KNIME’s Analytics Platform support integration with R and other open source extensions like Python and Apache Spark which helps to extend the functionalities.
Addressing data diversity: Managing structured and unstructured data should be done either by using NOSQL database, cloud-based data solutions or big data platforms like Hadoop. Vendor products should have features to handle the data diversity regarding data import, export, and connectivity. Vendor products like Microsoft R, Oracle Advanced Analytics, RapidMiner has built in support for this.
Scalability: Depend on the size of the data set different solutions should be provided. For organizations with limited dataset don’t need advance solutions right from the beginning. Vendors like RapidMiner, KNIME, Microsoft R Open and Alteryx Designer provide solutions which can run on desktop systems and do not require additional erver components. When requirement grows up and data management getting advanced, different technologies should exist to support it.
Performance: In advance data management, performance is a key factor and most of the vendor products are incorporated with data papalism feature of Hadoop. For an example IBM SPSS supports for multithreaded analytical, SAP’s Expert Analytics supports in-memory execution for data mining, Microsoft R Server provides support parallelization through ScaleR module, SAS Enterprise Miner provides performance enhanced scoring algorithms which can execute in Hadoop environment. And as for the support to integrating with Apache Spark, the products like SPSS, KNIME, Oracle, RapidMiner, and SAP provides high dataprocessing in scaling data volumes.
Data collaboration: Vendor software supports for sharing data, analysis, models, and applications among the stakeholder groups. Data security is a major concern in this context as there may be sensitive information which should be authorized for share/view/process only for privileged groups.
Incorporating a Data Science technology is not a straight forward task as various aspect need to be considered. The problem which is going to be addressed should worth enough to be resolved through a Data Science technology. Various aspects like scalability, security, performance should be considered in vendor product selection process. Always deduce the results using both structured and unstructured data and solution should be simple enough to apply. Which means instead of going for advanced ML algorithms which used in unsupervised learning, should try supervised learning initially and make sure framework is stable enough to continue.
Tech Lead