Robust Data Sampling in Machine Learning: A Game-Theoretic Framework for Training and Validation Data Selection
In this section, we will define the problem of selecting data for training ML models. Then, we will formulate this problem as a two-player game, which is followed by the introduction of the states, actions, rules, scores, and rewards of the game.
3.1. Problem Statement
Suppose f is a machine learning model mapping features to labels, . Function f represents the general form of machine learning models like neural networks. We consider a 1-dimensional label y for simplicity in this paper, and note that the results and conclusion persist if a multi-dimensional label is considered.
Given a model f and a dataset
, the whole process of machine learning is divided into three stages.
-
Sampling: The dataset is divided into training, validation, and test sets, i.e., . The training set is used to update the model parameter by implementing back-propagation algorithms. The validation set is used to monitor the model’s out-of-sample performance during training, and the test set is for the final evaluation of model f after training. The sizes of the training, validation, and test datasets are , , and , respectively.
-
Training: The training set is used to update the model parameter , and the validation set is used to avoid overfitting.
-
Evaluation: The trained model f is evaluated using the test set .
where and are the initial and optimal parameters, respectively. This problem can also be viewed as a bi-level optimization problem. In the high level, optimal sampling is found to split the training and validation sets, which are then used to find the optimal model parameter in the low level. In this paper, we focus on optimizing the sampling process, and use the existing training method to update the parameter.
3.2. Two-Player Game
The game starts with empty training and validation sets, and the trainer and validator take turns to select or skip the next data point (i.e., next feature-label pair) until all data points have been chosen. Below is an example game built on a small dataset with 10 data points in total. The validator moves first by selecting the first data points into the validation sets, and the trainer then decides to skip the first data. The validator moves on according to the historical moves of both the trainer and validator. After 10 turns, the validator selects the 1st, 6th, and 8th data points into the validation set , and the trainer selects the 2nd, 3rd, and 9th data points into the training set . The training and evaluation processes are followed to calculate the reward, which will be used to update the trainer and validator. In the remainder of this section, we will introduce the details of the two-player game, including its states, actions, rules and scores. The architecture of each game player, which contains reinforcement learning (RL) and Monte Carlo tree search (MCTS) components, will also be detailed.
3.2.1. Game State
The game is fully observed, i.e., both the training and validation player has access to the historical data selection. The current game state can be mathematically represented as a binary indicator vector to indicate whether the corresponding training and validation data points are selected, where N is the size of the dataset, and . For example, means the 1st data point is selected into the training set, i.e., . In the initial state, all elements of the game state are 0.
3.2.2. Game Action
The player will not have legal moves when all data has been traversed, and the game stops when both players don’t have any legal moves. The final state records our data selection.
3.2.3. Game Rule
The game is initialized with an empty game state. The game starts with the validator making the first move. Each player determines the action to select/skip the next data, starting from the leftmost data point (i.e., ). After making the selection, the current player is switched. The game terminates when both players have considered all the data, and the selected data is applied to train the model f, which will then be evaluated using the holdout test set.
3.2.4. Game Score
In the example game, the mean squared error (MSE) is used to evaluate the performance of the model, and thus can be considered as a score to evaluate the data selection. If the score is high, the validator wins the game, and vice versa. Then, a non-trivial question is how to determine the MSE threshold for claiming a winner. In contrast with traditional two-player games, like chess and go games, the trainer and validator play asymmetric roles in the game of sampling data. To address the matter, we first measure the average test MSE of a random sampler, which we call the base MSE. We claim the trainer (validator) as the winner if the test MSE is significantly lower (higher) than the base MSE. Suppose our two players play randomly for sufficient games; the test MSE of the trained model is then a random variable, denoted as
, where is the optimal parameter. The test MSEs of different games are independently and identically distributed. The expectation of M, which is in fact the base MSE, should be a moderate value because both players lack intelligence, and thus there is no guarantee one could always outperform the other. To win the game, the test MSE should be significantly higher (for validator) or lower (for trainer) than the base MSE . In practice, we use the mean of M to approximate :
, where n is the number of games played.
where s is the final game state, controls the penalty for the selected data size, and
is the norm that computes the number of non-zero elements. It is hard to find a unified form of the score for both the trainer and validator because they have opposite targets. Therefore, scores are calculated for both the trainer and validator separately, and the means of scores after sufficient games are and for the trainer and validator, respectively.
3.2.5. Game Reward
The game reward, usually ranging from −1 to 1, describes how favorable the game outcome is towards each player. As two players in our game play competitively, the rewards for the trainer and validator are opposite. We adopt a mapping function to convert the scores to rewards. Let and denote the score of a new game. If both the scores are bigger or lower than the average scores and by a tolerance, we may announce a winner, and reward for the winner is +1 and for the loser is −1. Otherwise, the game result is a tie. The reward is defined as follows:
-
If and , then and
-
If and , then and
-
Otherwise, a piece-wise linear function is adopted to map the score to the reward, which is shown in Figure 2. We randomly sample the training and validation data, then train and evaluate a CF model (the CF model will be introduced in Section 4), and Figure 2 shows the distribution of the test MSE (left y-axis). The blue line shows the piece-wise mapping function of the reward (right axis), and the validator’s score-to-reward mapping is the opposite of the trainer.
The range of is the same as the average scores and . Otherwise, if is too large, most games will end up with draws and return zero rewards; if is too small, a win with a narrow margin will receive a high reward value, imposing fluctuation on the game results. In experiments, is a hyperparameter to tune.
3.3. Monte Carlo Tree Search (MCTS)
In avoidance of myopic and greedy moves, a player always chooses its next action considering its opponent’s future moves. In other words, each player solves a search problem, where the size of the search tree grows exponentially if multiple moves ahead are taken into consideration. To tackle this issue, MCTS iteratively explores the searching space, and gradually adjusts its exploration towards the most promising region. Below, we briefly introduce the tree components before moving on to the tree search strategy.
A node s represents the state of the game at a certain stage. For example, a root node represents the initial game state with empty training and validation sets, and a termination node represents the final game state where the training and validation sets are sampled. All nodes, except for the termination node, connect to their children nodes through edges, which represent actions of the current player. A tree starts with the root node as its only component and grows by appending one child node. The newly added child node is called a leaf node, and it turns into a non-leaf node if one of its children nodes is also appended to the tree. Note that either the leaf or non-leaf attribute is specifically for the existing nodes of the tree. The termination node is not a leaf node until it is visited and appended to the tree.
-
Selection. The first phase starts with the root node and sequentially selects the next node to visit until a leaf node is encountered. Each selection is based on:
1 + n ( s , a ) } , where
is the average reward gathered over all tree-walks, is the number of visit of edge a, and
is the number of visit of node s. is a constant determining the level of exploration.
-
Expansion. When a leaf node is encountered, one of its children nodes is appended and the tree thus grows.
-
Playout. After the expansion phase, a random playout is used to finish the remaining search. That is, each player will randomly move in the rest of the game the termination node is reached and computing the associated reward.
-
Backup. The node and edge statistics are updated in the last phase of a searching iteration. First, the number of the visit of all traversed nodes and edges are incremented by one. Second, the current reward computed in the playout phase is back-propagated along the traversed path, and is used to update the average reward
.
where is a constant tunable hyperparameter that controls the level of exploration. A large encourages random move selection, and a small prefers the move with the highest number of visits. After that, the other player takes turns to continue the game.