Sunday, December 5, 2010

BI at large scale

As more and more data being collected everywhere from pretty much everything a user do, such as transactions activities, social interactions, information search ... enterprises has been actively looking into ways to turn these vast amount of raw data into useful information.

BI process flow

It include the following stages of processing
  1. ETL: Extract operational data (inside enterprise or external sources) into data warehouse (typically organized in Star/Snowflake schema with Fact and Dimension tables).
  2. Data exploration: Get insight into data using simple visualization tools (e.g. histogram, summary statistics) or sophisticated OLAP tools (slice, dice, rollup, drilldown)
  3. Report generation: Produce executive reports
  4. Data mining: Extract patterns of the underlying data to form models (e.g. bayesian networks, linear regression, neural networks, decision trees, support vector machines, nearest neighbors, association rules, principal component analysis)
  5. Feedback: The model will be used to assist business decision making (predicting the future)
The gap of processing BIG data
Many data mining and machine learning algorithms are available in both commercial packages (e.g. SAS, SPSS) as well as open source libraries (e.g. Weka, R). Nevertheless, most of these ML algorithms implementation are based on fitting al data in memory and not designed to process big data (e.g. Tera byte data volume).

On the other hand, massively parallel processing platform such as Hadoop, Map/Reduce, over the last few years, has been proven in processing Terabyte or even Petabyte range of data. Although many sequential algorithm can be restructured to run in map reduce, including a big portion of machine learning algorithm, there isn't a corresponding parallel implementation of ML available in massively parallel form.

Approach 1: Apache Mahout
One approach is to "re-implement" the ML algorithm in Map/Reduce and this is the path of Apache Mahout project. Mahout seems to have implemented an impressive list of algorithms although I haven't used them for my projects yet.

Approach 2: Ensemble of parallel independent learners
This is an alternative path that doesn't require re-implementation of existing algorithms. It works in the following way.
  1. Draw samples from the Big data into many sample data sets, which can fit into the memory of a single, individual learner.
  2. Assign each sample data set to an individual learner, who use existing algorithms to learn the model. After learning, each individual learner keep their own learned model
  3. When a decision / prediction request is received, each individual learner will come up with its own prediction and then combine their results in some ways. (e.g. for classification task, the learners will vote for the predicted class and the majority wins. for regression, the average of the estimate values will be used to predict the output value)

I also found this approach can smoothly fade out outdated model. As user's behavior may change over time, same happens to the validity of a learned model. With this ensemble approach, I can have multiple learners each learn their model periodically. Everytime when a prediction is needed, I will pick the latest k models and combine the final prediction based on a time-decayed weighted voting model. Outdated model will automatically slide out the k-size window automatically.

One gotchas of sampling approach is the handling of rare events (since you may lost those rare events in sampling). In this case, stratified sampling (instead of simple random sampling) should be used.

4 comments:

Unknown said...

Ricky, hi!

What do you think: are the infrastructure requirements, and solutions thereof, are fundamentally different between:
a) the business intelligence (BI) tasks [computation-/storage- intensive, math-rich] and
b) the business logic (BL) tasks [user IO responsive, data-rich]?

You seem to be familiar with both (as I do also), and I wondered your opinion...

Michael D. Healy said...

Ricky:

You echo the sentiment I have heard from statisticians, that big data can become an endeavor unto itself which wears out the team before real insights are gained.

Stratified sampling is definitely the way to go, although it needs to be implemented correctly.

If more decision makers would appropriately grasp statistical significance and the magnitude of the effect, we could spend more time analyzing data and less time pushing it around the block.

Michael D. Healy
http://michaeldhealy.com/
http://twitter.com/michaeldhealy

Ricky Ho said...

Regarding infrastructure, I don't think one size will fit all. I am leaning towards a combination of Map/Reduce (to batch-process data at large scale) and NoSQL (to allow data to be retrieved at real-time). I also think CEP technique should be part of it as well.

Should we do "big data with simple processing", or "small data (sampled) with sophisticated logic ? I think this will be case by case basis. But I would like to have a wide spectrum of solution so I can pick the most optimal point.

I am a big fan of "approximation algorithm" where the user can tune the tradeoff between accuracy and workload capacity.

As Michael point out, how to correctly do the sample is the key to success.

Unknown said...

Ricky,

I got your point. Diversity is good. But is it good for a Developer as well as a Researcher? Is it inevitable?

For now it seems like the world had split in two: 1) Small Data vs. Big Data, 2) Declarative/Imperative Programming vs. Functional Programming, 3) Single Machine vs. Cluster, 4) Relational (Schema-wise) vs. NoSQL.

I mean, right now, it is black and white! And if you're not an expert in both, you can never pick the right solution without countless trials. So a lot of time is wasted on the tool, instead of concentrating on a problem. (Take MapReduce programming, for example. Or schema-less BigTable approach.)

Long ago I switched to Java from C/C++. I did so because I was sick and tired by dealing with the programming language mess... And so many people just keep on doing it!!! I would never go back.

That's what I'm eager for in, what I call, BI/BL gap: arrival of some core common "THING". (The gray matter, if you want, as opposed to the B/W today. ;-) )

What's you opinion: might it be a common programming language to fit both? Other ideas?