Rolling Window Regression: a Simple Approach for Time Series Next value Predictions

Given a time series, predicting the next value is a problem that fascinated  programmers for a long time. Obviously, a key reason for this attention is stock markets, which promised untold riches if you can crack it. However, except for few (see A rare interview with the mathematician who cracked Wall Street), those riches have proved elusive.

Thanks to IoT (Internet of Things), time series analysis is poise to a come back into the lime light. IoT let us place ubiquitous sensors everywhere, collect data, and act on that data. IoT devices collect data through time and resulting data are almost always time series data.

Following are few use cases for time series prediction.

  1. Power load prediction
  2. Demand prediction for Retail Stores
  3. Services (e.g. airline check in counters, government offices) client prediction
  4. Revenue forecasts
  5. ICU care vital monitoring
  6. Yield and crop prediction

Let’s explore the techniques available for time series forecasts.

The first question is that “isn’t it regression?”. It is close, but not the same as regression. In a time series, each value is affected by the values just preceding this value. For example, if there is lot of traffic at 4.55 in a junction, chances are that there will be some traffic at 4.56 as well. This is called autocorrelation. If you are doing regression, you will only consider x(t) while due to auto correlation, x(t-1), x(t-2), … will also affect the outcome. So we can think about time series forecasts as regression that factor in autocorrelation as well.

For this discussion, let’s consider “Individual household electric power consumption Data Set”, which is data collected from a one house hold over four years in one minute intervals. Let’s only consider three fields, and data set will look like following.

The first question to ask is how do we measure success? We do this via a loss function, where we try to minimize the loss function. There are several loss functions, and they are different pros and cons.

  1. MAE ( Mean absolute error) — here all errors, big and small, are treated equal.
  2. Root Mean Square Error (RMSE) — this penalizes large errors due to the squared term. For example, with errors [0.5, 0.5] and [0.1, 0.9], MSE for both will be 0.5 while RMSE is 0.5 and. 0.45.
  3. MAPE ( Mean Absolute Percentage Error) — Since #1 and #2 depend on the value range of the target variable, they cannot be compared across data sets. In contrast, MAPE is a percentage, hence relative. It is like accuracy in a classification problem, where everyone knows 99% accuracy is pretty good.
  4. RMSEP ( Root Mean Square Percentage Error) — This is a hybrid between #2 and #3.
  5. Almost correct Predictions Error rate (AC_errorRate)—percentage of predictions that is within %p percentage of the true value

If we are trying to forecast the next value, we have several choices.

ARIMA Model

The gold standard for this kind of problems is ARIMA model. The core idea behind ARIMA is to break the time series in o different components such as trend component, seasonality component etc and carefully estimate a model for each component. See Using R for Time Series Analysis for a good overview.

However, ARIMA has an unfortunate problem. It needs an expert ( a good statistics degree or a grad student) to calibrate the model parameters. If you want to do multivariate ARIMA, that is to factor in multiple fields, then things get even harder.

However, R has a function called auto.arima, which estimates model parameters for you. I tried that out.

library("forecast")
....
x_train <- train data set
X-test <- test data set
..
powerTs <- ts(x_train, frequency=525600, start=c(2006,503604))
arimaModel <- auto.arima(powerTs)
powerforecast <- forecast.Arima(arimaModel, h=length(x_test))
accuracy(powerforecast)

You can find detail discussion on how to do ARIMA from the links given above. I only used 200k from the data set as our focus is mid-size data sets. It gave a MAPE of 19.5.

Temporal Features

The second approach is to come up with a list of features that captures the temporal aspects so that the auto correlation information is not lost. For example, Stock market technical analysis uses features built using moving averages. In the simple case, an analyst will track 7 days and 21 days moving averages and take decisions based on cross-over points between those values.

Following are some feature ideas

  1. collection of moving averages/ medians(e.g. 7, 14, 30, 90 day)
  2. Time since certain event
  3. Time between two events
  4. Mathematical measures such as Entropy, Z-scores etc.
  5. X(t) raised to functions such as power(X(t),n), cos((X(t)/k)) etc

Common trick people use is to apply those features with techniques like Random Forest and Gradient Boosting, that can provide the relative feature importance. We can use that data to keep good features and drop ineffective features.

I will not dwell too much time on this topic. However, with some hard work, this method have shown to give very good results. For example, most competitions are won using this method (e.g.http://blog.kaggle.com/2016/02/03/rossmann-store-sales-winners-interview-2nd-place-nima-shahbazi /).

Down side, however, is crafting features is a black art. It takes lots of work and experience to craft the features.

Rolling Windows based Regression

Now we got to the interesting part. It seems there is an another method that gives pretty good results without lots of hand holding.

Idea is to to predict X(t+1), next value in a time series, we feed not only X(t), but X(t-1), X(t-2) etc to the model. A similar idea has being discussed in Rolling Analysis of Time Series although it is used to solve a different problem.

Let’s look at an example. Let’s say that we need to predict x(t+1) given X(t). Then the source and target variables will look like following.

Data set would look like following after transformed with rolling window of three.

Then, we will use above transformed data set with a well-known regression algorithm such as linear regression and Random Forest Regression. The expectation is that the regression algorithm will figure out the autocorrelation coefficients from X(t-2) to X(t).

For example, with the above data set, applying Linear regression on the transformed data set using a rolling window of 14 data points provided following results. Here AC_errorRate considers forecast to be correct if it is within 10% of the actual value.

LR AC_errorRate=44.0 RMSEP=29.4632 MAPE=13.3814 RMSE=0.261307

This is pretty interesting as this beats the auto ARIMA right way ( MAPE 0.19 vs 0.13 with rolling windows).

So we only tried Linear regression so far. Then I tried out several other methods, and results are given below.

Linear regression still does pretty well, however, it is weak on keeping the error rate within 10%. Deep learning is better on that aspect, however, took some serious tuning. Please note that tests are done with 200k data points as my main focus is on small data sets.

I got the best results from a Neural network with 2 hidden layers of size 20 units in each layer with zero dropout or regularisation, activation function “relu”, and optimizer Adam(lr=0.001) running for 500 epochs. The network is implemented with Keras. While tuning, I found articles [1] and [2] pretty useful.

Then I tried out the same idea with few more datasets.

  1. Milk production Data set ( small < 200 data points)
  2. Bike sharing Data set (about 18,000 data points)
  3. USD to Euro Exchange rate ( about 6500 data points)
  4. Apple Stocks Prices (about 13000 data points)

Forecasts are done as univariate time series. That is we only consider time stamps and the value we are forecasting. Any missing value is imputed using padding ( using most recent value). For all tests, we used a window of size 14 for as the rolling window.

Following tables shows the results. Here except for Auto.Arima, other methods using a rolling window based data set.

There is no clear winner. However, rolling window method we discussed coupled with a regression algorithm seems to work pretty well.

Conclusion

We discussed three methods: ARIMA, Using Features to represent time effects, and Rolling windows to do time series next value forecasts with medium size data sets.

Among the three, the third method provides good results comparable with auto ARIMA model although it needs minimal hand holding by the end user.

Hence we believe that “Rolling Window based Regression” is a useful addition for the forecaster’s bag of tricks!

However, this does not discredit ARIMA, as with expert tuning, it will do much better. At the same time, with hand crafted features methods two and three will also do better.

One crucial consideration is picking the size of the window for rolling window method. Often we can get a good idea from the domain. Users can also do a parameter search on the window size.

Following are few things that need further exploration.

  • Can we use RNN and CNN? I tried RNN, but could not get good results so far.
  • It might be useful to feed other features such as time of day, day of the week, and also moving averages of different time windows.

References

  1. An overview of gradient descent optimization algorithms
  2. CS231n Convolutional Neural Networks for Visual Recognition

Understanding Causality and Big Data: Complexities, Challenges, and Tradeoffs

1-U-ziHSgDL8lrvbgtHeb9DA
image credit: Wikipedia, Amitchell125

“Does smoking causes cancer?”

We have heard that lot of smokers have lung cancer. However, can we mathematically tell that smoking causes cancer?

We can look at cancer patients and check how many of them are smoking. We can look at smokers and check will they develop cancer. Let’s assume that answers come up 100%. That is, hypothetically, we can see a 1–1 relationship between smokers and cancer.

Ok great, can we claim that smoking causes cancer? Apparently it is not easy to make that claim. Let’s assume that there is a gene that causes cancer and also makes people like to smoke. If that is the cause, we will see the 1–1 relationship between cancer and smoking. In this scenario, cancer is caused by the gene. That means there may be an innocent explanation to 1–1 relationship we saw between cancer and smoking.

This example shows two interesting concepts: correlation and causality from statistics, which play a key role in Data Science and Big Data. Correlation means that we will see two readings behave together (e.g. smoking and cancer) while causality means one is the cause of the other. The key point is that if there is a causality, removing the first will change or remove the second. That is not the case with correlation.

Correlation does not mean Causation!

This difference is critical when deciding how to react to an observation. If there is causality between A and B, then A is responsible. We might decide to punish A in some way or we might decide to control A. However, correlation does warrant such actions.

For example, as described in the post The Blagojevich Upside, the state of Illinois found that having books at home is highly correlated with better test scores even if the kids have not read them. So they decide the distribute books. In retrospect, we can easily find a common cause. Having the book in a home could be an indicator of how studious parents are, which will help with better scores. Sending books home, however, is unlikely to change anything.

You see correlation without a causality when there is a common cause that drives both readings. This is a common theme of the discussion. You can find a detailed discussion on causality from the talk “Challenges in Causality” by Isabelle Guyon.

Can we prove Causality?

Great, how can I show causality? Casualty is measured through randomized experiments (a.k.a. randomized trials or AB tests). A randomized experiment selects samples and randomly break them into two groups called the control and variation. Then we apply the cause (e.g. send a book home) to variation group and measure the effects (e.g. test scores). Finally, we measure the casualty by comparing the effect in control and variation groups. This is how medications are tested.

To be precise, if error bars for groups does not overlap for both the groups, then there is a causality. Check https://www.optimizely.com/ab-testing/ for more details.

However, that is not always practical. For example, if you want to prove that smoking causes cancer, you need to first select a population, place them randomly into two groups, make half of the smoke, and make sure other half does not smoke. Then wait for like 50 years and compare.

Did you see the catch? it is not good enough to compare smokers and non-smokers as there may be a common cause like the gene that cause them to do so. Do prove causality, you need to randomly pick people and ask some of them to smoke. Well, that is not ethical. So this experiment can never be done. Actually, this argument has been used before (e.g.https://en.wikipedia.org/wiki/A_Frank_Statement. )

This can get funnier. If you want to prove that greenhouse gasses cause global warming, you need to find another copy of earth, apply greenhouse gasses to one, and wait few hundred years!!

To summarize, Casualty, sometime, might be very hard to prove and you really need to differentiate between correlation and causality.

Following are examples when causality is needed.

  • Before punishing someone
  • Diagnosing a patient
  • Measure effectiveness of a new drug
  • Evaluate the effect of a new policy (e.g. new Tax)
  • To change a behavior

Big Data and Causality

Most big data datasets are observational data collected from the real world. Hence, there is no control group. Therefore, most of the time all you can only show and it is very hard to prove causality.

There are two reactions to this problem.

First, “Big data guys does not understand what they are doing. It is stupid to try to draw conclusions without randomized experiment”.

I find this view lazy.

Obviously, there are lots of interesting knowledge in observational data. If we can find a way to use them, that will let us use these techniques in many more applications. We need to figure out a way to use it and stop complaining. If current statistics does not know how to do it, we need to find a way.

Second is “forget causality! correlation is enough”.

I find this view blind.

Playing ostrich does not make the problem go away. This kind of crude generalizations make people do stupid things and can limit the adoption of Big Data technologies.

We need to find the middle ground!

When do we need Causality?

The answer depends on what are we going to do with the data. For example, if we are going to just recommend a product based on the data, chances are that correlation is enough. However, if we are taking a life changing decision or make a major policy decision, we might need causality.

Let us investigate both types of cases.

Correlation is enough when stakes are low, or we can later verify our decision. Following are few examples.

  1. When stakes are low ( e.g. marketing, recommendations) — when showing an advertisement or recommending a product to buy, one has more freedom to make an error.
  2. As a starting point for an investigation — correlation is never enough to prove someone is guilty, however, it can show us useful places to start digging.
  3. Sometimes, it is hard to know what things are connected, but easy verify the quality given a choice. For example, if you are trying to match candidates to a job or decide good dating pairs, correlation might be enough. In both these cases, given a pair, there are good ways to verify the fit.

There are other cases where causality is crucial. Following are few examples.

  1. Find a cause for disease
  2. Policy decisions ( would 15$ minimum wage be better? would free health care is better?)
  3. When stakes are too high ( Shutting down a company, passing a verdict in court, sending a book to each kid in the state)
  4. When we are acting on the decision ( firing an employee)

Even, in these cases, correlation might be useful to find good experiments that you want to run. You can find factors that are correlated, and design the experiments to test causality, which will reduce the number of experiments you need to do. In the book example, state could have run a experiment by selecting a population and sending the book to half of them and looking at the outcome.

Some cases, you can build your system to inherently run experiments that let you measure causality. Google is famous for A/B testing every small thing, down to the placement of a button and shade of color. When they roll out a new feature, they select a polulation and rollout the feature for only part of the population and compare the two.

So in any of the cases, correlation is pretty useful. However, the key is to make sure that the decision makers understand the difference when they act on the results.

Closing Remarks

Causality can be a pretty hard thing to prove. Since most big data is observational data, often we can only show the correlation, but not causality. If we mixed up the two, we can end up doing stupid things.

Most important thing is having a clear understanding at the point when we act on the decisions. Sometimes, when stakes are low, correlation might be enough. On some other cases, it is best to run an experiment to verify our claims. Finally, some systems might warrant building experiments into the system itself, letting you draw strong causality results. Choose wisely!

Original Post from my Medium account: https://medium.com/@srinathperera/understanding-causality-and-big-data-complexities-challenges-and-tradeoffs-db6755e8e220#.ca4j2smy3

Value Proposition of Big Data after a Decade

Big data is an umbrella term for many technologies: Search, NoSQL, Distributed File Systems, Batch and Realtime Processing, and Machine Learning ( Data Science). These Different technologies are developed and proven to various degree. After 10 years, is it real? Following are few success stories of what big data has done.

  1. Nate Silver predicted outcomes of 49 of the 50 states in the 2008 U.S. Presidential election
  2. Money Ball ( Baseball drafting)
  3. Cancer detection from Biopsy cells (Big Data find 12 tell-tale patterns while doctors only knew about nine). See http://go.ted.com/CseS
  4. Bristol-Myers Squibb reduced the time it takes to run clinical trial simulations by 98%
  5. Xerox used big data to reduce the attrition rate in its call centre by 20%.
  6. Kroger Loyalty programs ( growth in 45 consecutive quarters)

As these examples show, big data indeed can work. Could that work for you. Let’s explore this a bit.

The premise of big data goes as follows.

If you collect data about your business and feed it to a Big Data system, you will find useful insights that will provide a competitive advantage — (e.g. Analysis of data sets can find new correlations to “spot business trends, prevent diseases, combat crime and so on”. [Wikipedia])

When we say Big Data will make a difference, the underline assumption is that way we and organisations operate are inefficient.

This means Big Data is as an optimization technique. Hence, you must know what is worth optimizing. If your boss asked you to make sure the organization is using big data, doing “Big Data Washing” is easy.

  1. Publish or collect the data you can with a minimal effort
  2. Do a lot of simple aggregations
  3. Figure out what data combinations makes prettiest pictures
  4. Throw in some machine learning algorithms, predict something but don’t compare
  5. Create a cool dashboard and do a cool demo. Claim that you are just scratching the surface!!

However, adding value to your organization through big data is not that easy. This is because insights are not automatic. Insights are possible only if we have right data, we look at the right place, such insights exists, and we do find the insights.

Making a difference will need you to understand what is possible with big data, what are its tools, as well as the pain points in your domain and organization? Following Pictures shows some of the applications of big data within an organization.

The first step is asking, what are some of those applications that can make a difference for your organization.

The next step is understanding tools in “Big Data toolbox”. They come in many forms.

KPI ( Key Performance Indicators) — People used to take canaries into the coal mines. Since those small birds are very sensitive to the oxygen level in the air, if they got knocked out, you need to be running out of the mine. KPIs are canaries for your organization. They are numbers that can give you an idea about the performance of something — E.g. GDP, Per Capita Income, HDI index etc for a country, Company Revenue, Lifetime value of a customer, Revenue per Square foot ( in the retail industry). Chances are your organization or your domain has already defined them. Idea is to use Big Data to monitor the KPIs.

Dashboard — Think about a car dashboard. It gives you an idea about the overall system in a glance. It is boring when all is good, but it grabs attention when something is wrong. However, unlike car dashboards, Big data dashboards have support for drill down and find root cause.

Alerts — Alerts are Notifications ( sent via email, SMS, Pager etc.). Their Goal is to give you a peace of mind by not having to check all the time. They should be specific, infrequent, and have very low false positives.

Sensors — Sensors collect data and make them available to the rest of the system. They are expensive and time-consuming to install.

Analytics — Analytics take decisions. They come in four forms: batch real-time, interactive, predictive.

  • Batch Analytics— process the data that resides in the disk. If you can wait (e.g. more than an hour) for data to be available, this is what you use.
  • Interactive Analytics —It is used by a human to issue ad-hoc queries and to understand a dataset. Think of it as having a conversation with the data.
  • Realtime Analytics— It is used to detect something quickly within few milliseconds to few seconds. Realtime analytics are very powerful in detecting conditions over time (e.g. Football Analytics). Alerts are implemented through Realtime analytics
  • Predictive Analytics — It learns a solution from examples. Example, It is very hard to write a program to drive a car. This is because there are too many edge conditions. We solve that kind of problems by giving lot of examples and asking the computer to figure out a program that solves the problem ( which we call a model). Two common forms are predicting next value (e.g. electricity load prediction) and predicting a category (e.g. is this email a SPAM?).

Drill down — To make decisions, operators need to see the data in context and drill down into detail to understand the root cause. The typical model is to start from an alert or dashboard, see data in context (other transactions around the same time, what does the same user did before and after etc.) and then let the user drill down. For example, see WSO2 Fraud Detection Solution Demo.

The process of deriving insight from the data, using above tools, looks like following.

Here different roles work together to explore data, understand data, to define KPIs, create dashboards, alerts etc.

In this process, keeping the system running is a key challenge. This includes DevOps challenges, Integrate data continuously, update models, and get feedback about the effectiveness of decisions (e.g. Accuracy of Fraud). Hence doing things in production is expensive.

On the other hand, “doing it Once” is cheap. Hence, you must first try your scenarios in an ad-hoc manner first (hire some expertise if you must) and make sure it can add value to the organization before setting up a system that does it every day.

Actionable Insights are the Key!!

Insights that you generate must be actionable. That means several things.

  1. Information you share is significant and warrant attention, and they are presented with their ramifications ( e.g. more than two technical issues would lead customer to churn)
  2. Decision makers can identify the context associated with the insight ( e.g. operators can see through history of customers who qualify)
  3. Decision makers can do something about the insight ( e.g. can work with customers to reassures and fix)

For each information you show the user, think hard “why I am showing him this?”, “what can he do with this information?”, and “what other information I can show to make him understand the context?”.

Where to Start?

Big Data projects can take many forms.

  1. Use an existing Dataset: I already have a data set, and list of potential problems. I will use Big data to solve some of the problems.
  2. **Fix a known Problem: Find a problem, collect data about it, analyse, visualize, build a model and improve. Then build a dashboard to monitor.
  3. Improve Overall Process: Instrument processes ( start with most crucial parts), find KPIs, analyze and visualize the processes, and improve
  4. Find Correlations: Collect all available data, data mine the data or visualize, find interesting correlations.

My recommendation is to start with #2, fix a known problem in the organization. That is the least expensive, and that will let you demonstrate the value of Big data right away.

Finally, the following are key take away points.

  • Big Data provide a way to optimize. However, blind application does not guarantee success.
  • Learn tools in Big Data toolbox: KPIs, Analytics ( Batch, Real-time, Interactive, Predicative), Visualizations, Dashboards. Alerts, Sensors.
  • Start small. Try out with data sets before investing in a system
  • Find a high impact problem and make it work end to end