Vertical XGBoost Distributed

Introduction

The vertical XGBoost distributed model uses the Ray cluster for distributed computing, making it suitable for training models with large amounts of data.

Parameters List

Label_trainer

identity: "label_trainer"

model_info:

name: str Model name, should be vertical_xgboost_distributed

input:

trainset:
- type: str Training set type, supports csv.
- path: str When type is csv, it means the path to the folder where the training set is located.
- name: str or list Support single csv file or list of csv files when type is csv
- has_id: bool When type is csv, it indicates whether there is an id column
- has_label: bool When type is csv, it indicates whether there is a label column.
- missing_values: list Indicates missing values.
- is_centralized: bool Indicates whether the file is read uniformly from the ray_head node, currently only true is supported.
valset:
- type: str Validation set type, supports csv
- path: str When type is csv, it means the path to the folder where the validation set is located.
- name: str or list Support single csv file or list of csv files when type is csv.
- has_id: bool When type is csv, it indicates whether there is an id column.
- has_label: bool When type is csv, it indicates whether there is a label column.
- missing_values: list Indicates missing values.
- is_centralized: bool Indicates whether the file is read uniformly from the rayhead node. Currently only true is supported.

output:

path: str Output directory path.
model:
- name: str Model file name.
metric_train:
- name: str Training set metric file name.
metric_val:
- name: str Validation set metric file name
prediction_train:
- name: str Training set prediction result file name
prediction_val:
- name: str Validation set prediction result file name
ks_plot_train:
- name: str Training set ks table file name
ks_plot_val:
- name: str Validation set ks table file name
decision_table_train:
- name: str Training set decision table file name
decision_table_val:
- name: str Validation set decision table file name
feature_importance:
- name: str Feature importance table file name

train_info:

interaction_params:
- save_frequency: int Frequency of model saving in terms of trees, -1 means do not save intermediate models.
- echo_training_metrics: bool Whether to save training set metrics.
- write_training_prediction: bool Whether to save training set prediction result.
- write_validation_prediction: bool Whether to save validation set prediction result.
train_params:
- lossfunc: map Loss function configuration. The format is: {loss function name: {specific configuration}}. For example: “BCEWithLogitsLoss”: {}.
- num_trees: int Number of trees.
- learning_rate: float Learning rate.
- gamma: float L1 regularization term for the number of leaf nodes.
- lambda: float L2 regularization term for weights.
- max_depth: int Maximum depth of the tree.
- num_bins: int Number of bins.
- min_split_gain: float Minimum split gain, positive.
- min_sample_split: int Minimum number of samples in a tree node.
- feature_importance_type: str Type of feature importance, supports gain and split.
- downsampling: map
  
  column: map
  
  rate: float Feature dimension sampling rate.
  
  row: map
  
  run_goss: bool Whether to use goss sampling.
  
  top_rate: float The retain ratio of large gradient data in goss.
  
  other_rate: float The retain ratio of small gradient data in goss, 0 < top_rate + other_rate <= 1
- category: map
  
  cat_smooth: float Parameter used to reduce the effect of noise on categorical features. The default value is 0.
  
  cat_feature: map Configures the categorical features. The formula is: features that column indexes are in col_index if col_index_type is ‘inclusive’ or not in col_index if col_index_type is ‘exclusive’. union` featuresthat column names are in col_names if col_names_type is ‘inclusive’ or not in col_names if col_names_type is ‘exclusive’. union if max_num_value_type is ‘union’ or intersect if max_num_value_type is ‘intersection’ features that number of different values is less equal than max_num_value.
  
  col_index str: Index of the feature column that is (or is not) a categorical feature. Accepts slices or numbers, such as: “1, 4:5”. The default value is “”.
  
  col_names list<str>: Name of the feature column that is (or is not) a categorical feature. The default value is [].
  
  max_num_value int: If the number of unique values in a feature column is greater than or equal to this value, the feature column is a categorical feature. The default value is 0.
  
  col_index_type str: Supports ‘inclusive’ and ‘exclusive’. The default value is ‘inclusive’.
  
  col_names_type str: Supports ‘inclusive’ and ‘exclusive’. The default value is ‘inclusive’.
  
  max_num_value_type str: Supports ‘intersection’ and ‘union’. The default value is ‘union’.
- metric: map Performance evaluation indicators. All the following key values are optional.
  
  decision_table: map
  
  method: str Supports “equalfrequency” and “equalwith”.
  
  bins: int The number of divisions in the decision table.
  
  acc: {}
  
  precision: {}
  
  recall: {}
  
  f1_score: {}
  
  auc: {}
  
  ks: {}
- early_stopping:
  
  key: str The name of the metric used to judge whether to stop training. The metic name should have already been filled in the “metric” variable.
  
  patience: int Number of steps with no improvement after which training will be stopped.
  
  delta: float Minimum change in the value of metric to qualify as an improvement.
- encryption:
  
  paillier:
  
  key_bit_size: int The bit length of Paillier key, which should be at least 2048 or more.
  
  precision: int A precision-related parameter that can be null or a positive integer, such as 7.
  
  djn_on: bool Whether to use the DJN method to generate key pair.
  
  parallelize_on: bool Whether to use multi-core parallel computing.
  
  plain: map No encryption. Select either “plain” or “paillier”.
- batch_size_val: int The batch size for prediction on the validation set.
- atomic_row_size_per_cpu_core: int The maximum number of rows per segment after the data is partitioned.
- pack_grad_hess: bool Whether to pack the gradient and hessian into plaintext during encryption.

Trainer

identity: "trainer"

model_info:

name: str Model name, should be vertical_xgboost_distributed

input:

trainset:
- type: str Training set type, supports csv.
- path: str When type is csv, it means the path to the folder where the training set is located.
- name: str or list Support single csv file or list of csv files when type is csv
- has_id: bool When type is csv, it indicates whether there is an id column
- has_label: bool When type is csv, it indicates whether there is a label column.
- missing_values: list Indicates missing values.
- is_centralized: bool Indicates whether the file is read uniformly from the ray_head node, currently only true is supported.
valset:
- type: str Validation set type, supports csv
- path: str When type is csv, it means the path to the folder where the validation set is located.
- name: str or list Support single csv file or list of csv files when type is csv.
- has_id: bool When type is csv, it indicates whether there is an id column.
- has_label: bool When type is csv, it indicates whether there is a label column.
- missing_values: list Indicates missing values.
- is_centralized: bool Indicates whether the file is read uniformly from the rayhead node. Currently only true is supported.

output:

path: str Output directory path.
model:
- name: str Model file name.

train_info:

train_params:
- downsampling: map
  
  column: map
  
  rate: float Feature dimension sampling rate.
- category: map
  
  cat_feature: map Configures the categorical features. The formula is: features that column indexes are in col_index if col_index_type is ‘inclusive’ or not in col_index if col_index_type is ‘exclusive’. union` featuresthat column names are in col_names if col_names_type is ‘inclusive’ or not in col_names if col_names_type is ‘exclusive’. union if max_num_value_type is ‘union’ or intersect if max_num_value_type is ‘intersection’ features that number of different values is less equal than max_num_value.
  
  col_index str: Index of the feature column that is (or is not) a categorical feature. Accepts slices or numbers, such as: “1, 4:5”. The default value is “”.
  
  col_names list<str>: Name of the feature column that is (or is not) a categorical feature. The default value is [].
  
  max_num_value int: If the number of unique values in a feature column is greater than or equal to this value, the feature column is a categorical feature. The default value is 0.
  
  col_index_type str: Supports ‘inclusive’ and ‘exclusive’. The default value is ‘inclusive’.
  
  col_names_type str: Supports ‘inclusive’ and ‘exclusive’. The default value is ‘inclusive’.
  
  max_num_value_type str: Supports ‘intersection’ and ‘union’. The default value is ‘union’.
- batch_blocks_on_recv: int Number of data segments processed per batch on receive.
- ray_col_step: int Number of data columns processed simultaneously in a ray computing node. Automatically set by the algorithm if null.