Monday, July 6, 2015

On big data and data science

These days there is a lot of talk in the Miami start-ups and mid size organizations communities about big data. Usually looked at as something "new" and "cool" that you can do with your data, people ask you generic questions such as "Do you guys do big data?" or they even venture to state it as follows "Oh, I see, you guys do big data!".

I would like to clarify it right here: we currently do not do big data. While all our average projects collect and process significant amounts of data and while we do implement BI and Analytics, it would be a stretch to pretend that we are currently doing big data or data science.


Fig. 1 Big Data is a lot of Data!

I am writing this article to educate the public on what big data actually is, how it gets collected and processed, what kind of resources and technologies you would need to implement for big data, what kind of projects / initiatives is big data for and what kind of budgets is big data for. I will also touch on aspects of data science as many times data science goes hand in hand with big data.

This article does not want to be comprehensive however, if you are really interested in big data, it will give you a start point.

According to Wikipedia "Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, and information privacy."

Big data applications collect and process exceptionally large amounts of data, i.e. hundreds of terrabytes, petabytes or even exabytes of data of great varieties (not just alphanumeric data or binary pictures). These data are generated at great velocities and they have no guaranteed quality. Data sets are generally not easily link-able or immediately related to each other.

Examples of applications that involve big data are the ones under work in different Government Agencies (including the ones under the Obama's Administration Big Data Initiative), applications implemented by large consumer product or social media companies (where they are trying to understand consumer's behavior to increase sales), by car manufacturers (who collect a lot of technical data on their deployed fleets of vehicles in order to predict a component failure and replace it right on time), by national insurance companies, by telcos or by professional sports teams and franchises. Companies like EBay, Google, Facebook, IBM and SAP have the budgets and practical reasons to implement big data. GM does big data. And of course the Government does big data.

An example of an unusual or unstructured piece of datum would be a "like" triggered by a certain post on social media or a shortened url. How do you collect, store and analyze that in conjunction with other likes of the same (or of a different person) on the same (or on a different) type of post. Remember posts can be of any kind: pictures, videos, texts, binaries, urls etc.

The objective of the systems handling these data is generally to collect and process them, to correlate them and find patterns in these large amounts of unstructured data and to design predictive algorithms or software that, with a certain amount of accuracy, can predict the behavior of certain systems or processes.


For example @ CERN (the place where the web was born and which is actually a nuclear physics lab that hosts the world's most powerful particle accelerator) scientists run experiments that are trying to explain some of the secrets of our universe. Their big data playground looks like this: about 60,000 processors distributed in thousands of computers in more than 100 data centers across the Globe. They collect and process some 30 petabytes of data.

Fig. 2 Supercomputer
Weather it's a supercomputer like Pleiades or a massively parallel / distributed / networked system you will need very serious infrastructure to handle big data. Your average shared (or even private) cloud host will more likely not be able to handle big data projects. The reason is they are generally designed to handle general purpose business applications and lack the processing horsepower, memory and storage capacity to handle these kinds of projects.

Processing big data and coming up with relevant / useful results also involves specialized technologies. They sound like these: Hadoop (used @ Yahoo), MapReduced, Hive, PIG and MongoDB. Predictive analytics are written in packages like MatLab, Mathematica, RapidMiner or Pervasive. Machine learning mechanisms are written in things like R, dlib or ELKI. Your average MySQL database with some back-end php code will not be able to successfully handle big data. While you can implement some traditional scripting on the data collection & data communication side of things, you will need specialized tools and complicated formulas to dig into the data you collected.

On the business side, last year we had an interesting discussion with the CTO of the Dolphins at the Dolphin's Stadium. As they are currently running a Big Data initiative to collect and dig into data collected from fans to increase fans experience on the Stadium and to increase tickets sales, they were struggling with a very simple notion. As follows: "The big question is ... what question to ask the system?". And that's obviously what question(s) to ask and how to design this system in order to actually see a benefit in increased sales. So before collecting large amounts of data and designing, think about your questions and your objectives. Or maybe it will come out ... It will jump at you ... as you start digging into those quadrillions of bits and pieces .,.

Now, a few words about data science as it does relate to big data and it is many times used in conjunction with big data. Here we go, back to a simple definition, I quote from Wikipedia "Data Science is the extraction of knowledge from large volumes of data that are structured or unstructured which is a continuation of the field data mining and predictive analytics [...]."

So what are some of the things your programmers have to be competent in to say they are data scientists? Here we go, just a few of key concepts.

To do data science you have to know statistics. Concepts to start with are as follows:

- parameters estimations

- confidence intervals
- p-value

I will not bother you with long definitions but, just an example, "confidence interval (CI) is a type of interval estimate of a population parameter. It is an observed interval (i.e., it is calculated from the observations), in principle different from sample to sample, that frequently includes the value of an unobservable parameter of interest if the experiment is repeated. How frequently the observed interval contains the parameter is determined by the confidence level or confidence coefficient." [quote from Wikipedia]

Fig. 3 Graphical representation of p-value. Authors Repapetilto @ Wikipedia and Chen-Pan Liao @ Wikipedia

To do data science you have to be very familiar with cost functions:


- log-loss
- DCG
- NDCG

Here is a fairly intuitive (even if not that straight forward) explanation of log-loss or logarithmic loss according to Data Scientist Alice Zhen. "Log-loss measures the accuracy of a classifier. It is used when the model outputs a probability for each class, rather than just the most likely class.  Log-loss is a “soft” measurement of accuracy that incorporates the idea of probabilistic confidence. It is intimately tied to information theory: log-loss is the cross entropy between the distribution of the true labels and the predictions. Intuitively speaking, entropy measures the unpredictability of something. Cross entropy incorporates the entropy of the true distribution plus the extra unpredictability when one assumes a different distribution than the true distribution. So log-loss is an information-theoretic measure to gauge the “extra noise” that comes from using a predictor as opposed to the true labels. By minimizing the cross entropy, one maximizes the accuracy of the classifier."


And another explanation by Software Engineer Artem Onuchin:


"Log-loss can be useful when your goal is not only say if an object belongs to class A or class B, but provide its probability (say object belong to class A with probability 30%). Good example of case where log-loss can be useful is predicting CTR or click probability in on-line advertising: Google uses log loss as CTR prediction metric." 

Let's say you are collecting data on millions of posted pictures and you want to write an image classifier that can make the difference between a banana and a ... boat for example. We know that pictures are nothing but matrices of pixels with different colors and levels of illumination but we also need to implement the formulas to classify these objects. 

To do data science you also have to be competent in machine learning and be able to understand concepts like:

- classification
- regression
- ranking
- overfitting
- convex optimization
- trees

"Overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model that has been over fit will generally have poor predictive performance, as it can exaggerate minor fluctuations in the data." [quote from Wikipedia]


To do data science you have to conversant in some of the following technologies and tools:


- R
- Python
- Mathematica
- Weka
- Kaggle

R is a specialized programming language and environment used in statistic and data mining. It is a fairly straight forward open source command line language which provides powerful statistical functions that do not exist in other programming languages but it's a "different kind of language" that the ones your average programmer is used with.


Fig. 4 Data Types in R by Assoc. Prof. Roger Peng from John Hopkins School of Public Health

To do data science you also have to have a good understanding of complexities of algorithms, things like:

- eigen vectors
- singular values
- PCA
- LDA
- Gibbs Sampling
- Bottlenecks

Complexity of algorithms is crucial when crunching tones of data and trying to find the ... un-find-able. No matter how many hardware resources you will have in hand (and hardware resources are never infinite nor cheap at this level) the problems that you will have to solve will often raise you up exponentially. And it's never good to have exponential algorithms running.



And, somehow, with all this science you also have to have a certain feeling on the expected behavior of the user, reasonable ranges, top-level engagement etc. Yes, data science is a science and an art at the same time.

So, please next time when you think big data, think in terms of exabytes of data collected and processed, sophisticated machine learning mechanisms, thousands of servers and storage, large Corporations or the Government. And think some of the most brilliant minds on Planet Earth writing predictive analysis code and millions of dollars in research and development budgets. [and, yes, maybe us one day too but not now :)]

For everything else that's "new" and "cool" please check out our website http://wittywebnow.com.

Make it a great day!

Adrian Corbuleanu
Miami, FL
http://wittywebnow.com

Note: To document for this blog I used online resources from the following sites. I thank them for making these info available.

1. http://wikipedia.com
2. http://quora.com
3. http://linkedin.com