Although the name suggests so, big data is not about volume. It is more about the complexity of the dataset that is too much for traditional processing techniques.
The term was coined by Doug Laney in a research report by Meta Group (now part of Gartner) in 2001. Laney is currently vice president and distinguished analyst with Gartner’s chief data officer research team. Gartner’s updated its definition in 2012 to: “Big Data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimisation.”
Volume. EMC said 4.4 zetabytes (trillion gigabytes) of data existed in 2013. It comes from sensors, devices, video/audio, networks, log files, transactional applications, web, and social media – much of it generated in real time and in a very large scale. IBM estimated 2.5 million terabytes of data – the equivalent of 10 million Blu-ray discs – are created daily.
The figure is expected to double every year to reach 44 zetabytes by 2020. Most of these, however, are streaming music and videos and not needed for analytics.
Velocity. This refers to how quickly data is created, moved or accessed. Big Data requires rapid access to data, while data for other uses, such as archives, need not be lightning fast.
Variety. As different companies pursued their own path for storage and computing, data comes in different incompatible formats. Data can be kept in non-aligned data structures, use different semantics and increasingly, in unstructured formats.
IBM added another “V” for Veracity. Not all data can be trusted; some may be incomplete or poorly architected; some may be biased; some may be inaccurate, like fake news; some may be compromised, if not secured; some may be irrelevant or just “noise.” Hence, data scientists are needed to clean up data before it can be used.
Other “Vs” that have appeared in discussions on BDA include:
Viability. Is the data relevant for what is needed? Can the system differentiate between data used for measuring and those used for making predictions?
Volatility. Does the data change often and will it remain accurate and relevant? How will this affect storage requirements and costs?
Vulnerability. Is the data secure?
Visualisation. Can the data be presented visually to the user for ease of understanding?
Value. Can the data give a meaningful return on the investment?
To ensure that the data meets these criteria and be useful, companies need to establish standard operating procedures or protocols for capturing data, recording and sharing information and communication.
BDA has been increasingly embraced by retailers, financial services and insurance industry, healthcare organisations, manufacturers, energy companies and other mainstream enterprises.
One of the key benefits of BDA is its predictive analysis capabilities. Based on statistical analysis, deep data mining, machine learning and predictive modelling, BDA systems can predict what will happen for a given parameter and how customers will respond.
Bluewolf published a State of Salesforce report which said 75% of companies that invested in analytics saw revenue gains; 81% of Salesforce customers reported the use of predictive analytics as the most important initiative for their sales strategies.
In the healthcare industry, BDA has been used to support clinical decisions and recommend therapy to patients as well as risk intervention; detect revenue leakage; monitor ICU patients’ vital signs; identify a diabetic population and manage population health, among others.
In education and training, BDA help companies identify manoeuvres, procedures or exercises that students have difficulties in, enabling facilitators to modify their training programs accordingly.
In the financial services sector, BDA helped institutions detect fraud and trigger alerts faster when unusual behaviours are detected in credit card transactions.
In an article published in Channel News Asia, Paul Cobban, chief data and transformation officer of Singapore’s DBS Bank, said the bank used machine learning to predict when a relationship manager is going to resign and potentially take its clients with her.
The bank could also predict when ATMs will fail, which branch will likely have the next operational issue. It could predict queues at branches and ATM and even the sales performances of job candidates. It also helped detect rogue traders and fraudsters in procurement and trade. The bank uses machine learning on video files to monitor IT personnel who have access to production systems.
Cobban said DBS Bank was an early adopter of IBM Watson and used it to analyse research material and make investment recommendations with good success. He added that when DBS launched Digibank, its mobile-only bank in India, a chatbot was used to answer customer questions in their language. “It was able to respond to 80% of customer queries on its own,” he said.
Cobban shared a few lessons DBS learned on its BDA journey.
- Always start with the question in mind. Be clear about the problem you are trying to solve. Simply wallowing around in the data does not work.
- Start with your own data, supplement with social and IoT on a need basis.
- Don’t work alone. DBS worked with IBM, Kasisto, A*Star (the research arm of the Singapore government). DBS started with limited capability and talent but each partner helped accelerate the learning curve and yield results.
- Design for data – design products with data in mind. Ask, “what data should I produce from this product or service that will enhance customer offerings or drive efficiencies?”