I would like to clarify it right here: we currently do not do big data. While all our average projects collect and process significant amounts of data and while we do implement BI and Analytics, it would be a stretch to pretend that we are currently doing big data or data science.
Fig. 1 Big Data is a lot of Data! |
I am writing this article to educate the public on what big data actually is, how it gets collected and processed, what kind of resources and technologies you would need to implement for big data, what kind of projects / initiatives is big data for and what kind of budgets is big data for. I will also touch on aspects of data science as many times data science goes hand in hand with big data.
This article does not want to be comprehensive however, if you are really interested in big data, it will give you a start point.
According to Wikipedia "Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, and information privacy."
Examples of applications that involve big data are the ones under work in different Government Agencies (including the ones under the Obama's Administration Big Data Initiative), applications implemented by large consumer product or social media companies (where they are trying to understand consumer's behavior to increase sales), by car manufacturers (who collect a lot of technical data on their deployed fleets of vehicles in order to predict a component failure and replace it right on time), by national insurance companies, by telcos or by professional sports teams and franchises. Companies like EBay, Google, Facebook, IBM and SAP have the budgets and practical reasons to implement big data. GM does big data. And of course the Government does big data.
The objective of the systems handling these data is generally to collect and process them, to correlate them and find patterns in these large amounts of unstructured data and to design predictive algorithms or software that, with a certain amount of accuracy, can predict the behavior of certain systems or processes.
For example @ CERN (the place where the web was born and which is actually a nuclear physics lab that hosts the world's most powerful particle accelerator) scientists run experiments that are trying to explain some of the secrets of our universe. Their big data playground looks like this: about 60,000 processors distributed in thousands of computers in more than 100 data centers across the Globe. They collect and process some 30 petabytes of data.
Fig. 2 Supercomputer |
Processing big data and coming up with relevant / useful results also involves specialized technologies. They sound like these: Hadoop (used @ Yahoo), MapReduced, Hive, PIG and MongoDB. Predictive analytics are written in packages like MatLab, Mathematica, RapidMiner or Pervasive. Machine learning mechanisms are written in things like R, dlib or ELKI. Your average MySQL database with some back-end php code will not be able to successfully handle big data. While you can implement some traditional scripting on the data collection & data communication side of things, you will need specialized tools and complicated formulas to dig into the data you collected.
On the business side, last year we had an interesting discussion with the CTO of the Dolphins at the Dolphin's Stadium. As they are currently running a Big Data initiative to collect and dig into data collected from fans to increase fans experience on the Stadium and to increase tickets sales, they were struggling with a very simple notion. As follows: "The big question is ... what question to ask the system?". And that's obviously what question(s) to ask and how to design this system in order to actually see a benefit in increased sales. So before collecting large amounts of data and designing, think about your questions and your objectives. Or maybe it will come out ... It will jump at you ... as you start digging into those quadrillions of bits and pieces .,.
To do data science you have to know statistics. Concepts to start with are as follows:
- parameters estimations
- confidence intervals
- p-value
Fig. 3 Graphical representation of p-value. Authors Repapetilto @ Wikipedia and Chen-Pan Liao @ Wikipedia |
To do data science you have to be very familiar with cost functions:
- log-loss
- DCG
- NDCG
Here is a fairly intuitive (even if not that straight forward) explanation of log-loss or logarithmic loss according to Data Scientist Alice Zhen. "Log-loss measures the accuracy of a classifier. It is used when the model outputs a probability for each class, rather than just the most likely class. Log-loss is a “soft” measurement of accuracy that incorporates the idea of probabilistic confidence. It is intimately tied to information theory: log-loss is the cross entropy between the distribution of the true labels and the predictions. Intuitively speaking, entropy measures the unpredictability of something. Cross entropy incorporates the entropy of the true distribution plus the extra unpredictability when one assumes a different distribution than the true distribution. So log-loss is an information-theoretic measure to gauge the “extra noise” that comes from using a predictor as opposed to the true labels. By minimizing the cross entropy, one maximizes the accuracy of the classifier."
And another explanation by Software Engineer Artem Onuchin:
"Log-loss can be useful when your goal is not only say if an object belongs to class A or class B, but provide its probability (say object belong to class A with probability 30%). Good example of case where log-loss can be useful is predicting CTR or click probability in on-line advertising: Google uses log loss as CTR prediction metric."
To do data science you also have to be competent in machine learning and be able to understand concepts like:
- classification
- regression
- ranking
- overfitting
- convex optimization
- trees
"Overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model that has been over fit will generally have poor predictive performance, as it can exaggerate minor fluctuations in the data." [quote from Wikipedia]
To do data science you have to conversant in some of the following technologies and tools:
- R
- Python
- Mathematica
- Weka
- Kaggle
Fig. 4 Data Types in R by Assoc. Prof. Roger Peng from John Hopkins School of Public Health |
To do data science you also have to have a good understanding of complexities of algorithms, things like:
- eigen vectors
- singular values
- PCA
- LDA
- Gibbs Sampling
- Bottlenecks
Complexity of algorithms is crucial when crunching tones of data and trying to find the ... un-find-able. No matter how many hardware resources you will have in hand (and hardware resources are never infinite nor cheap at this level) the problems that you will have to solve will often raise you up exponentially. And it's never good to have exponential algorithms running.
And, somehow, with all this science you also have to have a certain feeling on the expected behavior of the user, reasonable ranges, top-level engagement etc. Yes, data science is a science and an art at the same time.
So, please next time when you think big data, think in terms of exabytes of data collected and processed, sophisticated machine learning mechanisms, thousands of servers and storage, large Corporations or the Government. And think some of the most brilliant minds on Planet Earth writing predictive analysis code and millions of dollars in research and development budgets. [and, yes, maybe us one day too but not now :)]For everything else that's "new" and "cool" please check out our website http://wittywebnow.com.
Make it a great day!
Adrian Corbuleanu
Miami, FL
http://wittywebnow.com
Note: To document for this blog I used online resources from the following sites. I thank them for making these info available.
1. http://wikipedia.com
2. http://quora.com
3. http://linkedin.com