Using Probabilistic Machine Learning Methods to Improve Beef Cattle Price Modeling and Promote Beef Production Efficiency and Sustainability in Canada

By inergency On Feb 22, 2024

3.2. Data Acquisition, Description, and Exploration

This study uses historical cattle price data gathered from CanFax [33], a market research firm that is a much-relied-upon source of cattle market information in Canada. CanFax Research Services (CRS) delivers comprehensive statistical and market information on domestic and worldwide beef trends in the Canadian beef sector. The dataset includes regularly updated monthly cattle prices (CAD/cwt) starting in January 2005 to September 2023 for fed steer cattle in Alberta, along with several other cattle classes. In order to perform a multivariate machine learning analysis for predicting cattle prices, we constructed a variable matrix comprising several key data series known to be related to cattle prices. For our analysis, we include the consumer price index, which is a measure of overall price inflation, for all items in Canada [34], the monthly average Alberta natural gas price [35] (CAD/gigajoule (GJ)), which is strongly related to agricultural production costs, the Canadian–US dollar exchange rate [36], and Alberta barely prices [37] (CAD/tonne), as barley is the main feed grain used in Alberta beef production. Table 2 summarizes the annual average values for these variables. Figure 2 and Figure 3 also visualize the time series trends of fed steer prices and these related variables that we used for predicting cattle prices from January 2005 to September 2023. The natural gas price and exchange rate display the highest volatility over time. Meanwhile, the fed steer price, barley price, and Canadian consumer price index show a noticeable steep increase in trend starting in 2020, coinciding with the onset of the COVID-19 pandemic outbreak.

We used Pearson correlation analysis to investigate the correlations between the predictors and fed steer prices. The correlation values are statistically significant (p-value Figure 4.

It shows that there is a high positive relationship between the price of fed steer cattle and the Canadian consumer price index (0.86), Alberta barley prices (0.73), and the exchange rate (0.66). Cattle prices, on the other hand, are negatively correlated with the natural gas price (−0.52).

3.3. Data Preprocessing, Partitioning (Train-Test), and Tuning

First, the quality of the data is visually checked by searching for obvious errors, outliers, and missing data. For testing the stationary assumption of the data, we utilized Augmented Dickey–Fuller (ADF) tests and determined that first-order differencing was sufficient to transform all variables, with the exception of the consumer price index. The consumer price index required second-order differencing to achieve stationarity. Consequently, for the multivariate dataset, the application of second-order differencing resulted in the entire dataset becoming stationary.

Then, we estimate the multivariate models using the training dataset on the scaled trained data (min-max scaler). Preprocessing the data with the min-max scaler constrains the range of our dataset, and scaling the data enhances model stability and facilitates machine learning analysis. Subsequently, we utilized these models to make predictions and assessed their performance on the scaled test dataset. We used 80% of the data for training and 20% for testing. The next step is to ‘tune’ the machine learning model, which is a way of prioritizing different features in terms of their importance in the overall prediction performance. We used the ‘hyper-parameter module technique’ to tune our training datasets. This technique discovers the ideal hyperparameter for each specific machine learning algorithm individually by comparing several model settings and comparing the metric to get the best combination of settings. The tuned model hyperparameter approach is used to improve the performance of the model [38].

3.4. An Introduction to Machine Learning Algorithms and Description

Despite the limited applications in agricultural commodity price analysis, machine learning is widely used in a variety of other fields to address complex problems that are difficult to solve with traditional analytical methods [39]. ML is an artificial intelligence branch that uses algorithms rather than model-based analysis [40] to systematically synthesize the core connections between data and information with the purpose of predicting future scenarios. The main strength of machine learning is identifying underlying relationships within datasets via pattern detection and prediction. ML systems can also detect disruptions to existing models and redesign and retrain themselves to adapt to and coevolve with new information. By relying on historical experience, the machine learning process plays a critical role in generalizing prediction problems to allow for maximum extraction of useful information from prior observed behaviors and patterns. Thus, historical observed data become ‘training’ datasets for the machine learning algorithms and better allow the ML model to generate largely accurate predictions even in novel situations. Many big data applications use ML to run at optimal efficiency. Here, we applied our ML techniques to analyze fed steer prices using two different approaches: multivariate and univariate modeling.

3.4.1. Multivariate Analysis

After preparing the data matrix, we applied multivariate and univariate algorithms to predict Alberta fed steer prices. For multivariate machine learning regression modeling, we applied three robust and widely used algorithms: random forest, Adaboost, and support vector machines.

Support Vector Machines (SVM) are a commonly used classification technique that properly categorizes data. Theoretically, it only takes a short training sample and is unaffected by the number of dimensions, but it can be computationally intensive. Furthermore, effective approaches for training SVM are being developed at a swift pace, and they can also be used for regression purposes by making minor changes [41,42,43].

Random forests (RF), which were introduced by Breiman [44], are a set of tree predictors in which each tree is determined by the values of a random vector selected separately with an identical distribution of the trees in the forest. As a widely used classification and regression approach, the random forest has proven to be quite an effective method, which aggregates numerous randomized decision trees and averages their predictions. It has proven to be able to perform well in situations where the number of variables exceeds the number of observations. Furthermore, it is adaptable to a variety of unstructured learning tasks and provides measures of variable significance, making it suitable for large-scale problems [45]. The RF algorithm assesses the significance of every feature in the prediction process and displays lower sensitivity to feature scaling and normalization. This characteristic makes it simpler for training and tuning.

The Adaboost algorithm, or adaptive boosting, is another multivariate method that we applied in this study. Adaboost, among the initial practical boosting techniques, was pioneered by Freund and Schapire [46]. Its primary concept is based on merging multiple classifiers, termed weak learners, into a singular classifier called a strong classifier by optimizing it through a weighted linear combination and integrating one weak classifier at each step.

Boosting is an ensemble method and employs multiple predictors to enhance accuracy in regression and classification tasks. To amplify and diversify the training dataset, boosting involves sequential sampling, repeatedly drawing samples with replacements from the original data. These methods are learned in a series, primarily benefiting unstable learners like neural networks or decision trees. There’s some indication that boosting leads to heightened accuracy levels [47,48].

Each machine learning algorithm has its own set of strengths and weaknesses. For example, Random Forest and Adaboost might be susceptible to overfitting, especially when confronted with noisy datasets. It is important to highlight that Adaboost, despite being designed for improved generalization, can still be vulnerable to overfitting. To mitigate the impact of noisy data and outliers in the original dataset, particularly for Adaboost, we implemented specific data preprocessing strategies as detailed in the paper (in Section 3.3).

Moreover, these models may demand significant computational resources, particularly when handling complex datasets. In situations involving imbalanced datasets, they may show a bias toward the dominant or majority class. Another weakness of RF is its sensitivity to hyperparameters, necessitating meticulous tuning to achieve optimal performance. Additionally, RF models typically operate on fixed datasets, presenting obstacles to the seamless integration of continuous updates with new data. The SVR algorithm shares similar weaknesses, with its main challenge lying in its sensitivity to hyperparameters. The efficacy of SVR is greatly contingent on the precise tuning of kernel parameters and the identification of their optimal values. Furthermore, interpreting SVR can be challenging due to its black-box nature, impeding a clear understanding of the relationships between features and the output. Therefore, proper parameter tuning and feature scaling are imperative for ensuring the effective application of SVR.

Despite these weaknesses, these ML algorithms remain widely used and effective in various applications. Addressing these limitations often involves precise hyperparameter tuning, feature engineering, and considering alternative models based on the dataset’s specific characteristics. Also, careful consideration of these limitations and appropriate preprocessing strategies can help mitigate some of these challenges.

3.4.2. Univariate Analysis

For the univariate approach, we used the autoregressive integrated moving average (ARIMA) model, seasonal ARIMA (SARIMA), and the seasonal autoregressive integrated moving average with exogenous factors (SARIMAX). Here, to predict fed steer prices, we only used its previous or historical time series data. Univariate time-series analysis is a method for explaining sequential problems over regular time intervals. When a continuous variable is time-dependent, it is advantageous to apply this method, especially when finding consistent patterns in market data.

ARIMA is a class of models that explains a time series based on its own past values. ARIMA models can be used to model any non-seasonal time series that has patterns and is not random white noise. Making the time series stationary is the first step in creating an ARIMA model, which is achieved through differencing. Depending on the complexity of the series, multiple levels of differencing may be required. Linear regression machine learning models work best when the predictors are not correlated and are independent of one another.

The problem with the basic ARIMA model is that it does not account for seasonality. Considering the seasonality effect, seasonal terms should be added to the ARIMA model to create the Seasonal ARIMA model (SARIMA). Seasonal differencing is used by SARIMA, which is similar to regular differencing, except that instead of subtracting consecutive terms, the value from the previous season is subtracted.

The SARIMAX model is able to deal with external factors. We can include an external predictor, also known as an ‘exogenous variable’, in the model with the seasonal index. The seasonal index repeats every frequency cycle.

In univariate time series modeling, ARIMA, SARIMA, and SARIMAX are very popular time series prediction models, but like every other model, they also have some weaknesses. ARIMA models assume that the underlying relationships in the time series are linear. They may not capture non-linear relationships effectively, which limits their flexibility. They are also sensitive to order selection, and determining the optimal orders is the main challenge for these models. In the SARIMA model, despite accounting for seasonality, it may not capture complex seasonal patterns. Additionally, these models assume that the variance of the residuals is constant over time. Table 3 presents a summary of the multivariate and univariate machine learning algorithms that were assessed in this study.

3.5. Validation Methods

A comparison between the multivariate and univariate algorithms is done to evaluate the best models’ performance on cattle price prediction. To minimize errors in prediction models, predicted prices are assessed using mean absolute error (MAE), root mean square error (RMSE), mean square error (MSE), mean absolute percentage error (MAPE), mean percentage error (MPE), and root mean square percentage error (RMSPE) [53,54,55,56].

The mean absolute error (MAE) is an extensively used metric for verifying a deterministic prediction and shows the magnitude of the error regardless of the prediction value [57,58]. The average distance between a data point and the fitted line, measured along a vertical line, is known as the root mean squared error (RMSE). RMSE is sensitive to outliers and exhibits both under- and over-estimation in the same pattern. The mean squared error (MSE) measures the average squared gap between observed and predicted values. By utilizing squared units rather than the original data units, it magnifies the influence of larger errors, causing them to be penalized more heavily than smaller errors. This attribute is crucial when identifying a model with smaller errors. MAPE represents the average absolute error in percentages by calculating the average of the absolute percentage errors. This method provides a straightforward interpretation of errors in percentage terms [57]. MPE is similar to MAPE, but it does not involve taking the absolute value of the errors. This can be valuable when you want to understand whether the model tends to under-forecast or over-forecast. The primary advantage of MPE is comparing variances between data sets with different scales. RMSPE is the square root of the mean of the squared percentage errors. It penalizes larger errors more than MAPE. This metric gauges the distance between predicted values and the actual values in the dataset, referred to as the “residual” or prediction error. It indicates how closely the actual data aligns with the line of best fit. A lower score reflects the better predictive performance of the model [59]. The formulas for each metric are listed below.

$M A E = \frac{1}{n} \sum_{i = 1}^{n} |O_{i} - P_{i}|$

$R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(O_{i} - P_{i})}^{2}}$

$M S E = \frac{1}{n} \sum_{i = 1}^{n} {(O_{i} - P_{i})}^{2}$

$M A P E = \frac{100}{n} \sum_{i = 1}^{n} |\frac{O_{i} - P_{i}}{O_{i}}|$

$M P E = \frac{100}{n} \sum_{i = 1}^{n} \frac{O_{i} - P_{i}}{O_{i}}$

$R M S P E = 100 \times \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(\frac{O_{i} - P_{i}}{O_{i}})}^{2}}$

where $O_{i}$ is observed data, $P_{i}$ is a deterministic prediction, and $n$ is the number of observations. Models with the lowest metric values were assumed to be the best models, as these metrices are negatively oriented.

3.6. Technical Approach at a Glance

In this research, we utilized Python, a powerful and versatile programming language, as the foundation for our entire data processing pipeline. Python’s extensive libraries and tools have enabled us to seamlessly integrate various stages of our workflow, including data exploration, feature engineering, data analytics, and visualization. This approach ensures consistency and efficiency throughout our research process.

We employed Scikit-learn in ML modeling, which is a widely used Python library, for implementing various tasks in our machine learning pipeline. This choice was driven by Scikit-learn’s comprehensive suite of tools for data preprocessing, cross-validation, hyperparameter tuning, and model training, specifically for algorithms like SVM, RF, and Adaboost. Scikit-learn is particularly favored in data analytics research due to its ease of use, robustness, and comprehensive nature of its algorithms, which are well-suited for a wide range of data types and machine learning tasks.

Our regression pipeline for the multivariate models entailed the following steps: (1) in data preprocessing, we considered relevant samples and features, ensuring the data were appropriate for the models considering stationary assumption and scaling; (2) in data splitting, the dataset was split into training and testing subsets, a standard practice in machine learning to evaluate model performance; (3) in data normalization, we applied MinMaxScaler to the data to bring all variables to a similar scale, which is crucial for algorithms like SVM that are sensitive to the scale of input features; (4) in hyperparameter tuning, each algorithm underwent separate hyperparameter tuning to identify the optimal parameters. This process is critical to enhance model performance and prevent issues like overfitting; and (5) in model evaluation, after choosing the best parameters, the models were trained on the training sample and subsequently evaluated on the testing sample to assess their predictive accuracy and generalizability.

Statistical analysis was done with the Statsmodels library for our univariate analyses of the ARIMA, SARIMA, and SARIMAX models. Statsmodels is an essential tool in our research due to its extensive capabilities in statistical modeling, hypothesis testing, and data exploration, making it ideal for detailed statistical analysis. The use of Statsmodels in our research is justified by its strong statistical foundation, offering precise and reliable tools for time series analysis, which are vital for making informed predictions and understanding temporal dynamics in our data.

The pipeline for the univariate models involves the following: (1) model identification, which determines the order of the ARIMA/SARIMA model by examining autocorrelation and partial autocorrelation functions of the time series data; (2) in parameter estimation of the model, we used techniques like maximum likelihood estimation; (3) in model diagnostics, we assessed the model’s performance by checking for autocorrelation in the residuals and ensuring the residuals are normally distributed; and (4) in forecasting, we use the model to make forecasts and evaluate the accuracy of these forecasts against real data. Through the combined use of Scikit-learn and Statsmodels, our research leverages the strengths of both machine learning and statistical modeling, ensuring a robust and comprehensive analysis of the data. This integration allows us to capitalize on the predictive power of machine learning while also benefiting from the inferential capabilities of statistical models, thereby enriching our research findings.