0.1 Abstract

In this project, we will deal with a typical problem in time series, prediction, to predict the monthly sales of Beer, Wine, and Distilled Alcoholic Beverages in the US. As one may notice in the data, this is a perfect fit for time series prediction because of its seasonality and increased variance. By successfully predicting the sales of such a product, we will be able to approach the same question of other products sales, which would help us know the market better.

0.2 Introduction

We would like to address the prediction issue for alcoholic products sales here. This dataset contains the total sales (in millions of dollars) of alcoholic products from 1/1/1992 to 1/1/2019. By splitting the data into training and test set, we would like to forecast the results for the test set, and check if the prediction is accurate or not. The result is successful, we would be able to come out with an accurate prediction by applying \(SARIMA(4,1,0)\times(4,1,0)_{12}\) model. Thanks to Rstudio for providing packages and US Census Bureau for providing such a dataset.

0.3 Section

0.3.1 Split Training and Testing

We split the data into training and test data set in this part. We left last 24 months (2017/1/1 - 2019/1/1) as test data set for prediction, and the rest as training dataset. Please check Appendix for codes.

0.3.2 Original plot and analysis

First of all, we plot the time series data as above. Notice that in the graph, there is an increased variance, which implies that we need data transformation by Box-Cox transformation. Then, Notice that there is a linear trend from time to time, and also there is a seasonality of 12 by looking at the acf. In addtion to this, we also find that acf decrease slowly, which means that the original data is non-stationary. Last but not the least, notice that there is a periodic sudden drop from December to January.

0.3.3 Data Transformation and differencing

By doing box-cox function in R, we choose our lambda as 0.3 (because this is neither close to 0 nor close to 0.5), and then apply the function \(\frac{1}{\lambda}(x^{\lambda}-1)\). Then, we observe the time series plot again. Notice that there is no more increased variance right now after doing the box-cox transformation. Then, we start to differencing the data to get rid of seasonality and trend.

*******R output

## the variance after taking difference at lag 1 once 5.264656
## the variance after taking difference at lag 12 once and lag 1 once 1.216632
## the variance after taking difference at lag 12 once and lag 1 twice 3.857639
## the variance after taking difference at lag 12 twice and lag 1 once 2.053683

******* R output

Based on the output, notice that the variance starts to increase after we take difference at more than once at lag 12 and 1, so we decide to take difference at lag 1 once and lag 12 once.

After differencing and transformation, one could observe that there is no trend or seasonality or increasing variance anymore. According to the acf and PACF plot, it seems to be a AR model. Thus, we might conclude that we have \(ARMA(p,0)\times(P,0)_{12}\) model. Now based on PACF plot, I may set P (upper case) = 4 or 3 because there is some significant values at around lag 36 and 48. then, I may set p (lower case) = 4 or 2 because the significant pacf is either 2 lags away or 4 lags away from lag 12p for p = 1,2,3,4. Thus, we will try these four models below for diagnostic checking.

0.3.4 Model selection

*******R output

## 
## Call:
## arima(x = alc.bc, order = c(2, 1, 0), seasonal = list(order = c(4, 1, 0), period = 12), 
##     method = "ML")
## 
## Coefficients:
##           ar1      ar2     sar1     sar2     sar3     sar4
##       -0.7781  -0.4944  -0.4488  -0.5108  -0.5708  -0.5533
## s.e.   0.0599   0.0576   0.0506   0.0485   0.0505   0.0518
## 
## sigma^2 estimated as 0.2224:  log likelihood = -206.3,  aic = 426.6
## 
## Call:
## arima(x = alc.bc, order = c(4, 1, 0), seasonal = list(order = c(4, 1, 0), period = 12), 
##     method = "ML")
## 
## Coefficients:
##           ar1      ar2     ar3      ar4     sar1     sar2     sar3     sar4
##       -0.8009  -0.6037  -0.162  -0.1471  -0.4548  -0.5080  -0.5901  -0.5424
## s.e.   0.0624   0.0784   0.077   0.0601   0.0512   0.0487   0.0503   0.0533
## 
## sigma^2 estimated as 0.2172:  log likelihood = -203.09,  aic = 424.17
## 
## Call:
## arima(x = alc.bc, order = c(4, 1, 0), seasonal = list(order = c(3, 1, 0), period = 12), 
##     method = "ML")
## 
## Coefficients:
##           ar1      ar2      ar3      ar4     sar1     sar2     sar3
##       -0.8988  -0.7891  -0.2408  -0.2166  -0.2417  -0.3216  -0.4573
## s.e.   0.0670   0.0795   0.0787   0.0595   0.0590   0.0530   0.0651
## 
## sigma^2 estimated as 0.3011:  log likelihood = -240.92,  aic = 497.85
## 
## Call:
## arima(x = alc.bc, order = c(2, 1, 0), seasonal = list(order = c(3, 1, 0), period = 12), 
##     method = "ML")
## 
## Coefficients:
##           ar1      ar2     sar1    sar2     sar3
##       -0.8970  -0.6313  -0.2557  -0.327  -0.3916
## s.e.   0.0631   0.0542   0.0627   0.055   0.0695
## 
## sigma^2 estimated as 0.3181:  log likelihood = -247.53,  aic = 507.06
## [1] "The AICC for each model is the following:"
## The AICc for SARIMA(2,1,0)(4,1,0)_12 is,  426.8824
## The AICc for SARIMA(4,1,0)(4,1,0)_12 is,  424.6686
## The AICc for SARIMA(4,1,0)(3,1,0)_12 is,  498.2327
## The AICc for SARIMA(2,1,0)(3,1,2)_12 is,  507.2602

*******R output

Based on the result above and Principle of Parsimony, I would pick \(SARIMA(2,1,0)\times(4,1,0)_{12}\) and \(SARIMA(4,1,0)\times(4,1,0)_{12}\) for diagnostic checking because they have lowest AICC.

0.3.5 Diagonostic Checking for models

0.3.6 checking for SARIMA(2,1,0)(4,1,0)_12

## 
##  Box-Pierce test
## 
## data:  res
## X-squared = 20.415, df = 11, p-value = 0.03996
## 
##  Box-Ljung test
## 
## data:  res
## X-squared = 21.231, df = 11, p-value = 0.03106
## 
##  Box-Ljung test
## 
## data:  res^2
## X-squared = 21.87, df = 17, p-value = 0.1898
## 
##  Shapiro-Wilk normality test
## 
## data:  res
## W = 0.99452, p-value = 0.3583

One could observe that \(SARIMA(2,1,0)\times(4,1,0)\) does not pass the Box-Pierce test and Box-Ljung test because the p value of them are smaller than 0.05, which means we reject the null hypothesis that the residuals are not correlated. In addition, notice that we have that PACF and ACF at lag 17 and lag 3 are significant. Thus, it does not pass our diagonostic checking, and we move on to the next candidate.

0.3.7 checking for SARIMA(4,1,0)(4,1,0)_12

*******R output

## 
##  Box-Pierce test
## 
## data:  res1
## X-squared = 16.564, df = 9, p-value = 0.05599
## 
##  Box-Ljung test
## 
## data:  res1
## X-squared = 17.362, df = 9, p-value = 0.04335
## 
##  Box-Ljung test
## 
## data:  res1^2
## X-squared = 19.613, df = 17, p-value = 0.2945
## 
##  Shapiro-Wilk normality test
## 
## data:  res1
## W = 0.99559, p-value = 0.5585

*******R output

Based on the result above, we passed Box-Pierce test and McLeod-Li test, and Shapiro-Wilk normality test because we have that p value of those tests are greater than 0.5. We almost passed Box-Ljung test, the p value is 0.043 in this case. All values in PACF and ACF are inside the confidence interval, and Normal Q-Q plot looks linear. Thus, we could conclude that \(SARIMA(4,1,0)\times(4,1,0)_{12}\) model is a desired model. The algebraic form could be written as following: \[ (1 + 0.8009B + 0.6037B^2 + 0.162B^3 + 0.1471B^4)(1 + 0.4548B^{12} + 0.5080B^{24} + 0.5901B^{36} + 0.5424B^{48})(1-B)(1-B^{12})U_t \\= Z_t \] Where \(U_t = \frac{Y_t^{0.3} - 1}{0.3}\)

0.3.8 Stationarity Check

*******R output

## The mod of polynomial ar is 1.359662 1.359662 1.91762 1.91762
## The mod of polynomial ar_12 is 1.072313 1.266247 1.266247 1.072313

*******R output

We compute the root by using the function polyroot in R, then we output its mod by using Mod in R. Therefore, by observing the values above, we notice that all the roots are standing outside of the unit circle. Thus, we can conclude that this model is stationary and invertible.

0.3.9 Spectral Analysis

## By Fisher test, the p-value is: 0.3470133

By spectral analysis of the residuals, it passed the Fisher test and Kolmogorov-Smirnov Test. The P value is larger than 0.05, and also we have that the plot of Kolmogorov-Smirnov is inside the interval. In addition to that, by Periodogram generated above, we also do not have dominant frequency. Thus, we are confident that this is a guassian white noise.

0.3.10 Forcasting

Based on the plot above, notice that the prediction of alcoholic sales lies exactly inside of the prediction interval, which means that our prediction is accurate enough.

0.4 Conclusion Section

AS a conclusion, we have successfully predicted the sales of alcoholic drink products by using \(SARIMA(4,1,0)\times(4,1,0)_{12}\) model, and it has passed all the diagnostic checking mentioned above. The following again is its algebraic model: \[ (1 + 0.8009B + 0.6037B^2 + 0.162B^3 + 0.1471B^4)(1 + 0.4548B^{12} + 0.5080B^{24} + 0.5901B^{36} + 0.5424B^{48})(1-B)(1-B^{12})U_t \\= Z_t \] There are numerous products which has the same property with alcoholic products in the market, it will be left as exercise for the readers. Thank to the instructions of Professor Raya Feldman. This project cannot be done without her help.

0.5 References:

https://fred.stlouisfed.org/series/S4248SM144NCEN

All these content based on PSTAT174/274 course at University of California, Santa Barbara instructed by Raya Feldman.

0.6 Appedix

## time series plot
alc = read.csv("alcohol_Sales.csv", col.names = c("date", "Millions_of_Dollars"))
alc_data = alc$Millions_of_Dollars

## test train split
alc_train = alc_data[1:300]
alc_test = alc_data[301:325]

## Original Data Plot and Analysis
par(mfrow = c(3,1))
plot.ts(alc_train, xlab = "Time", ylab = "Alchol Sales in millions of Dollars")
acf(alc_train, lag.max = 50)
pacf(alc_train, lag.max = 50)

## Box-Cox Transformation
par(mfrow = c(2,1))
require(MASS)
bcTransform = boxcox(alc_train ~ as.numeric(1:length(alc_train)))
lambda = bcTransform$x[which(bcTransform$y == max(bcTransform$y))]
alc.bc = (1/lambda)*(alc_train^lambda-1)
plot.ts(alc.bc, main = "plot after doing box-cox transform")


## apply diff 1 to get rid of the linear trend
alc.diff1 = diff(alc.bc, lag = 1)
cat("the variance after taking difference at lag 1 once", var(alc.diff1), "\n")

## apply diff 12 to get rid of seasonality of 12 (plot acf to show it)
alc.diff1_12 = diff(alc.diff1, lag = 12)
cat("the variance after taking difference at lag 12 once and lag 1 once", var(alc.diff1_12), "\n")

alc.diff1_1_12 = diff(alc.diff1_12, lag = 1)
cat("the variance after taking difference at lag 12 once and lag 1 twice", var(alc.diff1_1_12), "\n")

alc.diff1_12_12 = diff(alc.diff1_12, lag = 12)
cat("the variance after taking difference at lag 12 twice and lag 1 once", var(alc.diff1_12_12))

## draw time sereis plot after differencing and its ACF and PACF
par(mfrow = c(3,1))
x = 1:length(alc.diff1_12)
plot.ts(alc.diff1_12, main = "plot aftering differencing")
abline(h = mean(alc.diff1_12))
abline(lm(alc.diff1_12~x))
acf(alc.diff1_12, lag.max = 70)
pacf(alc.diff1_12, lag.max = 70)

## Model Fit
fit = arima(alc.bc, order = c(2,1,0), seasonal = list(order = c(4,1,0), period = 12), method = "ML")
fit

fit1 = arima(alc.bc, order = c(4,1,0), seasonal = list(order = c(4,1,0), period = 12), method = "ML")
fit1

fit2 = arima(alc.bc, order = c(4,1,0), seasonal = list(order = c(3,1,0), period = 12), method = "ML")
fit2

fit3 = arima(alc.bc, order = c(2,1,0), seasonal = list(order = c(3,1,0), period = 12), method = "ML")
fit3

## Diagonostic Checking Process 
### Plot of Residuals and its ACF and PACF
par(mfrow = c(2,2))
res = residuals(fit)
hist(res)
ts.plot(res)
abline(h=mean(res))
abline(res~as.numeric(1:length(res)))

acf(res)
pacf(res)


# Test for independence of residuals
h = round(sqrt(length(alc_train)))
Box.test(res, lag = h, type = c("Box-Pierce"), fitdf = 6)

Box.test(res, lag = h, type = c("Ljung-Box"), fitdf = 6)

Box.test(res^2, lag = h, type = c("Ljung-Box"), fitdf = 0)

# Test for normality
shapiro.test(res)
qqnorm(res)

# Residuals plot for another model and its ACF and PACF
par(mfrow = c(2,2))
res1 = residuals(fit1)
hist(res1)
ts.plot(res)
abline(h=mean(res1))
abline(res~as.numeric(1:length(res1)))

acf(res1)
pacf(res1)


# Test for independence of residuals for another model
h = round(sqrt(length(alc_train)))
Box.test(res1, lag = h, type = c("Box-Pierce"), fitdf = 8)

Box.test(res1, lag = h, type = c("Ljung-Box"), fitdf = 8)

Box.test(res1^2, lag = h, type = c("Ljung-Box"), fitdf = 0)

#Test for Normality for another model
shapiro.test(res1)
qqnorm(res1)

## check for stationary
require(dse)
Z_vector1 = polyroot(c(1, 0.8009, 0.6037, 0.162, 0.1471))
Z_vector2 = polyroot(c(1, 0.4548, 0.5080, 0.5901, 0.5424))
Mod(Z_vector)
Mod(Z_vector2)


## Spectral Analysis
library("GeneCycle")
cat("By Fisher test, the p-value is:", fisher.g.test(res1))

par(mfrow = c(2,1))
cpgram(res1)

require(TSA) 
periodogram(res1)
abline(h=0)

## Prediction 
pred.tr <- predict(fit1, n.ahead = 25)
U.tr= pred.tr$pred + 2*pred.tr$se
L.tr= pred.tr$pred - 2*pred.tr$se
ts.plot(alc.bc, xlim=c(1,length(alc.bc)+25), ylim = c(min(alc.bc),max(U.tr)), 
        main = "Predictions on Cox-Box transformed data", ylab = "Sales of Cox-Box transformed data")
lines(U.tr, col="blue", lty="dashed")
lines(L.tr, col="blue", lty="dashed")
points((length(alc.bc)+1):(length(alc.bc)+25), pred.tr$pred, col="red")


original_dat <- function(number){
  return((number*lambda + 1)^(1/lambda))
}
pred.orig <- original_dat(pred.tr$pred)
U= original_dat(U.tr)
L= original_dat(L.tr)
ts.plot(alc_train, xlim=c(1,length(alc_train)+25), ylim = c(min(alc_train),max(U)), 
        main = "Predictions on Original Alcoholic Data", ylab = "Sales in Millions of Dollars")
lines(U, col="blue", lty="dashed")
lines(L, col="blue", lty="dashed")
points((length(alc_train)+1):(length(alc_train)+25), pred.orig, col="red")

#To zoom the graph, starting from entry 100:
ts.plot(alc_data, xlim = c(200,length(alc_train)+25), ylim = c(250,max(U)),
        main = "Zoomed Version of Predictions on Original Alcoholic Data", ylab = "Sales in Millions of Dollars")
lines(U, col="blue", lty="dashed")
lines(L, col="blue", lty="dashed")
points((length(alc_train)+1):(length(alc_train)+25), pred.orig, col="red")