





















































In this article by Jay Gendron, author of the book, Introduction to R for Business Intelligence, we will see that the time series analysis is the most difficult analysis technique. It is true that this is a challenging topic. However, one may also argue that an introductory awareness of a difficult topic is better than perfect ignorance about it. Time series analysis is a technique designed to look at chronologically ordered data that may form cycles over time. Key topics covered in this article include the following:
(For more resources related to this topic, see here.)
Time series analysis is an upper-level college statistics course. It is also a demanding topic taught in econometrics. This article provides you an understanding of a useful but difficult analysis technique. It provides a combination of theoretical learning and hands-on practice. The goal is to provide you a basic understanding of working with time series data and give you a foundation to learn more.
Use Case: forecasting future ridership
The finance group approached the BI team and asked for help with forecasting future trends. They heard about your great work for the marketing team and wanted to get your perspective on their problem. Once a year they prepare an annual report that includes ridership details. They are hoping to include not only last year's ridership levels, but also a forecast of ridership levels in the coming year. These types of time-based predictions are forecasts. The Ch6_ridership_data_2011-2012.csv data file is available at the website—http://jgendron.github.io/com.packtpub.intro.r.bi/.
This data is a subset of the bike sharing data. It contains two years of observations, including the date and a count of users by hour.
You just applied a linear regression model to time series data and saw it did not work. The biggest problem was not a failure in fitting a linear model to the trend. For this well-behaved time series, the average formed a linear plot over time. Where was the problem?
The problem was in seasonal fluctuations. The seasonal fluctuations were one year in length and then repeated. Most of the data points existed above and below the fitted line, instead of on it or near it. As we saw, the ability to make a point estimate prediction was poor. There is an old adage that says even a broken clock is correct twice a day. This is a good analogy for analyzing seasonal time series data with linear regression. The fitted linear line would be a good predictor twice every cycle. You will need to do something about the seasonal fluctuations in order to make better forecasts; otherwise, they will simply be straight lines with no account of the seasonality.
With seasonality in mind, there are functions in R that can break apart the trend, seasonality, and random components of a time series. The decompose() function found in the forecast package shows how each of these three components influence the data. You can think of this technique as being similar to creating the correlogram plot during exploratory data analysis. It captures a greater understanding of the data in a single plot:
library(forecast); plot(decompose(airpass))
The output of this code is shown here:
This decomposition capability is nice as it gives you insights about approaches you may want to take with the data, and with reference to the previous output, they are described as follows:
There is an assumption for creating time series models. The data must be stationary. Stationary data exists when its mean and variance do not change as a function of time. If you decompose a time series and witness a trend, seasonal component, or both, then you have non-stationary data. You can transform them into stationary data in order to meet the required assumption.
Using a linear model for comparison, there is randomness around a mean—represented by data points scattered randomly around a fitted line. The data is independent of time and it does not follow other data in a cycle. This means that the data is stationary. Not all the data lies on the fitted line, but it is not moving. In order to analyze time series data, you need your data points to stay still.
Imagine trying to count a class of primary school students while they are on the playground during recess. They are running about back and forth. In order to count them, you need them to stay still—be stationary. Transforming non-stationary data into stationary data allows you to analyze it. You can transform non-stationary data into stationary data using a technique called differencing.
Differencing subtracts each data point from the data point that is immediately in front of it in the series. This is done with the diff() function. Mathematically, it works as follows:
Seasonal differencing is similar, but it subtracts each data point from its related data point in the next cycle. This is done with the diff() function, along with a lag parameter set to the number of data points in a cycle. Mathematically, it works as follows:
Look at the results of differencing in this toy example. Build a small sample dataset of 36 data points that include an upward trend and seasonal component, as shown here:
seq_down <- seq(.625, .125, -0.125)
seq_up <- seq(0, 1.5, 0.25)
y <- c(seq_down, seq_up, seq_down + .75, seq_up + .75,
seq_down + 1.5, seq_up + 1.5)
Then, plot the original data and the results obtained after calling the diff() function:
par(mfrow = c(1, 3))
plot(y, type = "b", ylim = c(-.1, 3))
plot(diff(y), ylim = c(-.1, 3), xlim = c(0, 36))
plot(diff(diff(y), lag = 12), ylim = c(-.1, 3), xlim = c(0, 36))
par(mfrow = c(1, 1))
detach(package:TSA, unload=TRUE)
These three panels show the results of differencing and seasonal differencing. Detach the TSA package to avoid conflicts with other functions in the forecast library we will use, as follows:
These three panes are described as follows:
Congratulations, you truly deserve recognition for getting through a very tough topic. You now have more awareness about time series analysis than some people with formal statistical training.
Further resources on this subject: