Saturday, August 17, 2013

Six steps in data science

"Data science" and "Big data" has become a very hot term in last 2 years.  Since the data storage price dropped significantly, enterprises and web companies has collected huge amount of their customer's behavioral data even before figuring out how to use them.  Enterprises also start realizing that these data, if use appropriately, can turn into useful insight that can guide its business along a more successful path.

From my observation, the effort of data science can be categorized roughly into the following areas.
  • Data acquisition
  • Data visualization
  • OLAP, Report generation
  • Response automation
  • Predictive analytics
  • Decision optimization

Data Acquisition

Data acquisition is about collecting data into the system.  This is an important step before any meaningful processing or analysis can begin.  Most companies start with collecting business transaction records from their OLTP system.  Typically there is an ETL (Extract/Transform/Load) process involved to ingest the raw data, clean the data and transform them appropriately.  Finally, the data is loaded into a data warehouse where the subsequent data analytics exercises are performed.  In today's big data world, the destination has been shifted from traditional data warehouse to Hadoop distributed file system (HDFS).

Data Visualization

Data visualization is usually the first step of analyzing the data.  It is typically done by plotting data in different ways to establish give a quick sense of its shape, in order to guide the data scientist to determine what subsequent analysis should be conducted.  This is also where a more experience data scientist can be distinguished from an less experienced one based on how fast they can spot common patterns or anomalies from data.  Most of the plotting package works only for data that fits into one single machine.  Therefore, in the big data world, data sampling is typically conducted first to reduce the data size, and then import to a single machine where R, SAS, SPSS can be used to visualized the sample data.

OLAP / Reporting

OLAP is about aggregating transaction data (e.g. revenue) along different dimensions (e.g. Month, Location, Product) where the enterprise defines its KPI/business metrics that measures the companies' performance.  This can be done either in an ad/hoc manner (OLAP), or in a predefined manner (Report template).  Report writer (e.g. Tableau, Microstrategy) are use to produce the reports.  The data is typically stored in regular RDBMS or a multidimensional cube which is optimizing for OLAP processing (ie: slice, dice, rollup, drilldown).  In the big data world, Hive provides a SQL-like access mechanism and is commonly used to access data stored in HDFS.  Most popular report writers have integrated with Hive (or declare plan to integrate with it) to access big data stored in HDFS.

Response Automation

Response automation is about leveraging domain-specific knowledge to encode a set of "rules" which include event/condition/action.  The system monitors all events observed, matches them against the conditions, (which can be a boolean expression of event attributes, or a sequence of event occurrence), and trigger appropriate actions.  In the big data world, automating such response is typically done by stream processing mechanism (such as Flume and Storm).  Notice that the "rule" need to be well-defined and unambiguous.

Predictive Analytics

Prediction is about estimating unknown data based on observed data through statistical/probabilistic approach.  Depends on the data type of the output, "prediction" can be subdivided into "classification" (when the output is a category) or "regression" (when the output is a number).

Prediction is typically done by first "train" a predictive model using historic data (where all input and output value is known).  This training is done via an iterative process where the performance of the model in each iteration is measured at the end of each iteration.  Additional input data or different model parameters will be used in next round of iteration.  When the predictive performance is good enough and no significant improvement is made between subsequent iterations, the process will stop and the best model created during the process will be used.

Once we have the predictive model, we can use it to predict information we haven't observed, either this is information that is hidden, or hasn't happen yet (ie: predicting future)

For a more detail description of what is involved in performing predictive analytics, please refer to my following blog.

Decision Optimization

Decision optimization is about making the best decision after carefully evaluating possible options against some measure of business objectives.  The business objective is defined by some "objective function" which is expressed as a mathematical formula of some "decision variables".  Through various optimization technique, the system will figure out what the decision variables should be in order ro maximize (or minimize) the value of the objective function.  The optimizing technique for discrete decision variables is typically done using exhaustive search, greedy/heuristic search, integer programming technique, while optimization for continuous decision variables is done using linear/quadratic programming.

From my observation, "decision optimization" is the end goal of most data science effort.  But "decision optimization" relies on previous effort.  To obtain the overall picture (not just observed data, but also hidden or future data) at the optimization phase, we need to make use of the output of the prediction.  And in order to train a good predictive model, we need to have the data acquisition to provide clean data.  All these efforts are inter-linked with each other in some sense.