Robust Data Sampling in Machine Learning: A Game-Theoretic Framework for Training and Validation Data Selection

By inergency On Mar 5, 2024

In this section, we will define the problem of selecting data for training ML models. Then, we will formulate this problem as a two-player game, which is followed by the introduction of the states, actions, rules, scores, and rewards of the game.

3.1. Problem Statement

Suppose f is a machine learning model mapping features to labels, $f : X \to y$ . Function f represents the general form of machine learning models like neural networks. We consider a 1-dimensional label y for simplicity in this paper, and note that the results and conclusion persist if a multi-dimensional label is considered.

Given a model f and a dataset $D = {D^{(i)}}_{i = 1}^{N} = {_{(X^{(i)}, y^{(i)})}^{}} i = 1 N$ , the whole process of machine learning is divided into three stages.

Sampling: The dataset $D$ is divided into training, validation, and test sets, i.e., $D = D_{t r a i n} \cup D_{v a l} \cup D_{t e s t}$ . The training set $D_{t r a i n}$ is used to update the model parameter $θ$ by implementing back-propagation algorithms. The validation set $D_{v a l}$ is used to monitor the model’s out-of-sample performance during training, and the test set $D_{t e s t}$ is for the final evaluation of model f after training. The sizes of the training, validation, and test datasets are $N_{t r a i n}$ , $N_{v a l}$ , and $N_{t e s t}$ , respectively.
Training: The training set $D_{t r a i n}$ is used to update the model parameter $θ$ , and the validation set $D_{v a l}$ is used to avoid overfitting.
Evaluation: The trained model f is evaluated using the test set $D_{t e s t}$ .

The goal of this paper is to propose a new sampling method that makes the trained model more robust and generalizes better on unseen data. To compare the effectiveness of different sampling methods, the test set

D_{t e s t}

is pre-selected by random sampling. That is, the sampling methods differ in their ways of splitting the training and validation sets, and they use the same test set for evaluation. Following this, the problem of training an ML model can be formulated as follows:

$\begin{matrix} \min \sum_{i}^{N_{t e s t}} | y_{t e s t}^{(i)} & - f (X_{t e s t}^{(i)} | θ^{*}) |^{2} / N_{t e s t} \\ s . t . D_{t r a i n}, D_{v a l} & = Sampling (D \ D_{t e s t}) \\ θ^{*} & = Training (f, D_{t r a i n}, D_{v a l}, θ^{0}), \end{matrix}$

(1)

where $θ^{0}$ and $θ^{*}$ are the initial and optimal parameters, respectively. This problem can also be viewed as a bi-level optimization problem. In the high level, optimal sampling is found to split the training and validation sets, which are then used to find the optimal model parameter in the low level. In this paper, we focus on optimizing the sampling process, and use the existing training method to update the parameter.

3.2. Two-Player Game

We formulate the sampling process as a two-player game, which is illustrated in Figure 1. A trainer aims to select the training data that minimizes the test error, while the validator selects the validation data to maximize the test error. The purpose of the validator is to provide adversarial examples during the training process so as to enhance the model generalizability. The game information, including each player’s move and the game result, is stored to update the strategy of each player. After several rounds of games and the convergence of the player’s strategy, the equilibrium is achieved so that the optimal data set components and sizes (i.e.,

N_{t r a i n}

and

N_{v a l}

) are determined.

The game starts with empty training and validation sets, and the trainer and validator take turns to select or skip the next data point (i.e., next feature-label pair) until all data points have been chosen. Below is an example game built on a small dataset with 10 data points in total. The validator moves first by selecting the first data points into the validation sets, and the trainer then decides to skip the first data. The validator moves on according to the historical moves of both the trainer and validator. After 10 turns, the validator selects the 1st, 6th, and 8th data points into the validation set $D_{v a l}$ , and the trainer selects the 2nd, 3rd, and 9th data points into the training set $D_{t r a i n}$ . The training and evaluation processes are followed to calculate the reward, which will be used to update the trainer and validator. In the remainder of this section, we will introduce the details of the two-player game, including its states, actions, rules and scores. The architecture of each game player, which contains reinforcement learning (RL) and Monte Carlo tree search (MCTS) components, will also be detailed.

3.2.1. Game State

The game is fully observed, i.e., both the training and validation player has access to the historical data selection. The current game state can be mathematically represented as a binary indicator vector $s = [t_{1}, \dots, t_{N}, v_{1}, \dots, v_{N}]$ to indicate whether the corresponding training and validation data points are selected, where N is the size of the dataset, $\sum_{i = 1}^{N} t_{i} = N_{t r a i n}$ and $\sum_{i = 1}^{N} v_{i} = N_{v a l}$ . For example, $t_{1} = 1$ means the 1st data point is selected into the training set, i.e., $D^{(1)} \in D_{t r a i n}$ . In the initial state, all elements of the game state are 0.

3.2.2. Game Action

In the example game, the actions of both players are the same: given the current state, each player chooses whether to select the next data point or not. We denote the action set as

A = {0, 1}

, where 0/1 indicates selecting/skipping the next data. Each player starts with the first data, which corresponds to the leftmost side of the dataset, and determines the choice of the next data by outputting the action

a \in A

. The player policy, i.e., the probability of each action set element to be selected, will be discussed later as shown in Equation (4).

The player will not have legal moves when all data has been traversed, and the game stops when both players don’t have any legal moves. The final state records our data selection.

3.2.3. Game Rule

The game is initialized with an empty game state. The game starts with the validator making the first move. Each player determines the action to select/skip the next data, starting from the leftmost data point (i.e., $D^{(1)}$ ). After making the selection, the current player is switched. The game terminates when both players have considered all the data, and the selected data is applied to train the model f, which will then be evaluated using the holdout test set.

3.2.4. Game Score

In the example game, the mean squared error (MSE) is used to evaluate the performance of the model, and thus can be considered as a score to evaluate the data selection. If the score is high, the validator wins the game, and vice versa. Then, a non-trivial question is how to determine the MSE threshold for claiming a winner. In contrast with traditional two-player games, like chess and go games, the trainer and validator play asymmetric roles in the game of sampling data. To address the matter, we first measure the average test MSE of a random sampler, which we call the base MSE. We claim the trainer (validator) as the winner if the test MSE is significantly lower (higher) than the base MSE. Suppose our two players play randomly for sufficient games; the test MSE of the trained model is then a random variable, denoted as $M = \sum_{i}^{N_{t e s t}} {| y_{t e s t}^{(i)} - f}^{(X_{t e s t}^{(i)} | θ^{*}) |} 2 / N_{t e s t}$ , where $θ^{*}$ is the optimal parameter. The test MSEs of different games are independently and identically distributed. The expectation of M, which is in fact the base MSE, should be a moderate value because both players lack intelligence, and thus there is no guarantee one could always outperform the other. To win the game, the test MSE should be significantly higher (for validator) or lower (for trainer) than the base MSE $E (M)$ . In practice, we use the mean of M to approximate $E (M)$ : $\sum M_{i} / n \approx E (M)$ , where n is the number of games played.

Another non-trivial question is that, in implementation the player may use as much data as possible, which is a conservative policy. However, the redundant data will incur computation costs and also may hide the essential one. To solve that, we penalize the amount of data each player chooses. We encourage both players to choose as little data as possible while trying to win the game. The benefit of this trick is to unveil the most optimal and efficient data points. Our score definitions are as follows:

$\{\begin{matrix} S_{t r a i n} = M S E + {α | | s | |}_{0} \\ S_{v a l i d} = M S E - {α | | s | |}_{0} \end{matrix}$

(2)

where s is the final game state, $α$ controls the penalty for the selected data size, and $| | \cdot {| |}_{0}$ is the $L_{0}$ norm that computes the number of non-zero elements. It is hard to find a unified form of the score for both the trainer and validator because they have opposite targets. Therefore, scores are calculated for both the trainer and validator separately, and the means of scores after sufficient games are ${\bar{S}}_{t r a i n} = \sum_{i = 1}^{n} S_{t r a i n}^{(i)} / n$ and ${\bar{S}}_{v a l i d} = \sum_{i = 1}^{n} S_{v a l i d}^{(i)} / n$ for the trainer and validator, respectively.

3.2.5. Game Reward

The game reward, usually ranging from −1 to 1, describes how favorable the game outcome is towards each player. As two players in our game play competitively, the rewards for the trainer and validator are opposite. We adopt a mapping function to convert the scores to rewards. Let $S_{t r a i n}^{n e w}$ and $S_{v a l i d}^{n e w}$ denote the score of a new game. If both the scores are bigger or lower than the average scores ${\bar{S}}_{t r a i n}$ and ${\bar{S}}_{v a l i d}$ by a tolerance, we may announce a winner, and reward for the winner is +1 and for the loser is −1. Otherwise, the game result is a tie. The reward is defined as follows:

If $S_{t r a i n}^{n e w} < {\bar{S}}_{t r a i n} - t o l$ and $S_{v a l i d}^{n e w} < {\bar{S}}_{v a l i d} - t o l$ , then $r e w a r d_{t r a i n} = 1$ and $r e w a r d_{v a l i d} = - 1$
If $S_{t r a i n}^{n e w} > {\bar{S}}_{t r a i n} + t o l$ and $S_{v a l i d}^{n e w} > {\bar{S}}_{v a l i d} + t o l$ , then $r e w a r d_{t r a i n} = - 1$ and $r e w a r d_{v a l i d} = 1$
Otherwise, a piece-wise linear function is adopted to map the score to the reward, which is shown in Figure 2. We randomly sample the training and validation data, then train and evaluate a CF model (the CF model will be introduced in Section 4), and Figure 2 shows the distribution of the test MSE (left y-axis). The blue line shows the piece-wise mapping function of the reward (right axis), and the validator’s score-to-reward mapping is the opposite of the trainer.

The range of $t o l$ is the same as the average scores ${\bar{S}}_{t r a i n}$ and ${\bar{S}}_{v a l i d}$ . Otherwise, if $t o l$ is too large, most games will end up with draws and return zero rewards; if $t o l$ is too small, a win with a narrow margin will receive a high reward value, imposing fluctuation on the game results. In experiments, $t o l$ is a hyperparameter to tune.

3.3. Monte Carlo Tree Search (MCTS)

In avoidance of myopic and greedy moves, a player always chooses its next action considering its opponent’s future moves. In other words, each player solves a search problem, where the size of the search tree grows exponentially if multiple moves ahead are taken into consideration. To tackle this issue, MCTS iteratively explores the searching space, and gradually adjusts its exploration towards the most promising region. Below, we briefly introduce the tree components before moving on to the tree search strategy.

A node s represents the state of the game at a certain stage. For example, a root node represents the initial game state with empty training and validation sets, and a termination node represents the final game state where the training and validation sets are sampled. All nodes, except for the termination node, connect to their children nodes through edges, which represent actions of the current player. A tree starts with the root node as its only component and grows by appending one child node. The newly added child node is called a leaf node, and it turns into a non-leaf node if one of its children nodes is also appended to the tree. Note that either the leaf or non-leaf attribute is specifically for the existing nodes of the tree. The termination node is not a leaf node until it is visited and appended to the tree.

The search iteration (Figure 3) involves traversing the existing tree from the root node to one leaf node, together with extending the tree by appending a new leaf node. In Figure 4, the red circle means the current player is the validator, and the blue square means the current player is the trainer. Specifically, each iteration includes four phases:

Selection. The first phase starts with the root node and sequentially selects the next node to visit until a leaf node is encountered. Each selection is based on:

$SELECT \underset{a}{argmax} {\bar{r} (s, a) + c_{puct} \frac{\sqrt{\sum_{b} n (s, b)}}{1 + n (s, a)}},$

(3)

where $\bar{r} (s, a)$ is the average reward gathered over all tree-walks, $n (s, a)$ is the number of visit of edge a, and $\sum_{b} n (s, b)$ is the number of visit of node s. $c_{p u c t}$ is a constant determining the level of exploration.
Expansion. When a leaf node is encountered, one of its children nodes is appended and the tree thus grows.
Playout. After the expansion phase, a random playout is used to finish the remaining search. That is, each player will randomly move in the rest of the game the termination node is reached and computing the associated reward.
Backup. The node and edge statistics are updated in the last phase of a searching iteration. First, the number of the visit of all traversed nodes and edges are incremented by one. Second, the current reward computed in the playout phase is back-propagated along the traversed path, and is used to update the average reward $\bar{r} (s, a)$ .

After the statistics of the nodes and edges are finished being backed up, MCTS starts the next iteration from the root node. The searching process terminated when the number of iterations reaches a preset value

n u m I t e r s

, which is tuned to allow the game to reach equilibrium while avoiding excessive iterations. Then, the current player makes its next move based on the following policy

$π (a | s) = \frac{n {(s, a)}^{1 / τ}}{\sum_{b} n {(s, b)}^{1 / τ}},$

(4)

where $τ$ is a constant tunable hyperparameter that controls the level of exploration. A large $τ$ encourages random move selection, and a small $τ$ prefers the move with the highest number of visits. After that, the other player takes turns to continue the game.