Predicting User Preference for Innovative Features in Intelligent Connected Vehicles from a Cultural Perspective

By inergency On Mar 25, 2024

[ad_1]

3.3. Measures and Procedure

The independent variables of the questionnaire consisted of two parts: basic demographic information and individual cultural orientation score, and the dependent variable was users’ preferences for the 18 innovative HMI features. The basic demographic information included gender, age, city, car brand, average weekly mileage and average weekly usage count of in-vehicle information system (IVIS). The part of individual cultural orientation was scored in the form of the Likert scale, with 1 being the lowest degree and 5 being the highest degree. The specific formulation of the options varied with the questions. To ensure high standardization and validity, the questions in this part were selected from the standard questionnaire used by Hofstede and adapted with reference to previous studies to fit well with the present study [34]. A total of 15 questions were selected from this study, and each user’s orientation toward each cultural dimension was assessed by scores on three related questions. User preferences were measured by a two-part question. Firstly, two ranking questions about the main factors and features that users pay the most attention to when using the car provide the reference for the final ranking and weighting comparison of the features preferred by each user group. Secondly, people would make binary classification choices (like/dislike) for 18 features to obtain user preference data that could be used for subsequent analysis.

The data were initially processed by SPSS 25 after the questionnaires were collected. Since the basic information of users is in the form of character strings, the study used the numbers 1–5 to refer to different options for the subsequent data processing. Moreover, to avoid interference in the process, nominal variables and ordered categorical variables with unequal distance between different levels were treated as dummy variables so that each variable referred to at most 2 meanings. The answers in the cultural dimension part were converted into a score of 1–5 points, and the total score (0–15 points) of 3 questions about a specific cultural dimension was also calculated to represent individual cultural orientation. The score of each cultural dimension was converted to 5 grades (1–5 scales from low to high, respectively) according to a 3-point gap to balance the weights among all the variables, avoiding the data bias generated by too large a difference in scores. This study adopted 0 or 1 as a binary classification variable in the part of user preferences assessment to indicate that users do not want or want the car to have a certain feature.

Common machine learning algorithms for binary classification problems include logistic regression, support vector machine (SVM) and random forest [48].

Random forest is a machine learning algorithm that adopts the concept of ensemble learning. It combines multiple CART decision trees and achieves better predictive performance by outputting the mode of the predicted results from decision trees [49]. Random forest can not only effectively assess the role of different features in the classification problem with high accuracy [50,51,52], but it also handles a large number of input variables with high-dimensional features without dimensionality reduction. What is more, the random forest algorithm is highly inclusive of default values and unbalanced datasets; thus, it is suitable for this study. Taking the “ambient lighting” as an example, the predictive accuracy obtained with the random forest algorithm is 93.521%, indicating a satisfactory overall performance.

Logistic regression is a simple, efficient and commonly used classification model. Its advantage lies in its ability to obtain a classification probability and robustness to small noise in the data [53]. However, it may not handle categorical variables with many levels well. In this study, there exist some categorical variables such as gender, age group and usage frequency, making logistic regression unsuitable. Moreover, categorical probability from logistic regression is not the main focus of this study, thus not bringing out its strengths. Taking the data of ambient lighting into the logistic regression model yields a classification accuracy of 86.982%, which is just a moderate performance. Therefore, logistic regression is not considered in this study.

The SVM algorithm excels in handling large feature spaces and interactions among nonlinear features. However, its efficiency decreases with large sample sizes, and it performs poorly on imbalanced datasets. Through the SVM algorithm, the accuracy of classifying user preferences for “ambient lighting” is 91.765%, which is slightly weaker than the random forest model. Unlike decision trees, SVM cannot provide feature importance values, meaning that its interpretability is significantly weaker than random forest. Moreover, the data obtained in this study are an unbalanced sample set. The SVM algorithm is not considered for use in this study.

Through comparing different algorithms [54,55], it was found that the random forest algorithm achieved the highest predictive accuracy and demonstrated strong interpretability. Therefore, the model is built by the random forest algorithm to find out the human factors that have an influential role in the user preference for innovative HMI features and to deeply analyze the influence mechanism of individual cultural orientation.

After multiple features enter the model, decision trees commonly employ the Gini index to determine the selected features. The Gini index is a parameter used to measure the ability of a decision tree to distinguish sample data and select attributes for classification. A lower Gini index indicates higher dataset purity. The Gini index is defined as follows:

$G i n i (D) = \sum_{k = 1}^{K} {(\frac{|C_{k}|}{|D|})}^{2}$

(1)

where D represents the dataset to be distinguished, and $C_{k}$ represents the subset of samples belonging to the kth class in D.

CART calculates the Gini index for all feature nodes, starting from the root node of the decision tree. When attribute A divides the training sample set D into D₁ and D₂, then the Gini index formula of E after separation is (taking the example of determining whether feature A = a is true)

$G i n i (D, A = a) = \frac{|D_{1}|}{|D|} G i n i (D_{1}) + \frac{|D_{2}|}{|D|} G i n i (D_{2})$

(2)

Among them, |D_k“https://www.mdpi.com/”D| is the probability of a subset of k (k = 1, 2).

Subsequently, the calculation results of all features are compared, and the feature with the smallest Gini index is selected as the best classification attribute. Following this principle, the data in the set are classified accordingly. In random forests, n samples are randomly selected from the dataset, and k features are randomly selected from all features. Then, this process is repeated m times (building m decision trees). Finally, the classification result is decided by the voting results of m decision trees. The results are shown in the following equation [49] where y is the output variable, and x are the input variables.

$H (x) = \arg \max \sum_{i k} I (h_{i} (x) = y)$

(3)

In this study, the random forest model was constructed by calling the sklearn module through Python, setting the training set and test set as 75% and 25% and the number of decision trees as 100. The rest parameters in the model were kept at default values. The influences affecting each dependent variable are explored through loops.

At last, the young potential customers in China were divided into several user groups with different cultural orientations by the clustering algorithm so that the characteristics of each group were brought into the random forest algorithm to respectively predict user preferences, helping enterprises to make R&D plans and marketing programs [56].

[ad_2]