Vertical XGBoost Distributed
Introduction
The vertical XGBoost distributed model uses the Ray cluster for distributed computing, making it suitable for training models with large amounts of data.
Parameters List
Label_trainer
identity: "label_trainer"
- model_info:
name:
strModel name, should be vertical_xgboost_distributed
- input:
- trainset:
type:
strTraining set type, supports csv.path:
strWhen type is csv, it means the path to the folder where the training set is located.name:
strorlistSupport single csv file or list of csv files when type is csvhas_id:
boolWhen type is csv, it indicates whether there is an id columnhas_label:
boolWhen type is csv, it indicates whether there is a label column.missing_values:
listIndicates missing values.is_centralized:
boolIndicates whether the file is read uniformly from the ray_head node, currently only true is supported.
- valset:
type:
strValidation set type, supports csvpath:
strWhen type is csv, it means the path to the folder where the validation set is located.name:
strorlistSupport single csv file or list of csv files when type is csv.has_id:
boolWhen type is csv, it indicates whether there is an id column.has_label:
boolWhen type is csv, it indicates whether there is a label column.missing_values:
listIndicates missing values.is_centralized:
boolIndicates whether the file is read uniformly from the rayhead node. Currently only true is supported.
- output:
path:
strOutput directory path.- model:
name:
strModel file name.
- metric_train:
name:
strTraining set metric file name.
- metric_val:
name:
strValidation set metric file name
- prediction_train:
name:
strTraining set prediction result file name
- prediction_val:
name:
strValidation set prediction result file name
- ks_plot_train:
name:
strTraining set ks table file name
- ks_plot_val:
name:
strValidation set ks table file name
- decision_table_train:
name:
strTraining set decision table file name
- decision_table_val:
name:
strValidation set decision table file name
- feature_importance:
name:
strFeature importance table file name
- train_info:
- interaction_params:
save_frequency:
intFrequency of model saving in terms of trees, -1 means do not save intermediate models.echo_training_metrics:
boolWhether to save training set metrics.write_training_prediction:
boolWhether to save training set prediction result.write_validation_prediction:
boolWhether to save validation set prediction result.
- train_params:
lossfunc:
mapLoss function configuration. The format is: {loss function name: {specific configuration}}. For example: “BCEWithLogitsLoss”: {}.num_trees:
intNumber of trees.learning_rate:
floatLearning rate.gamma:
floatL1 regularization term for the number of leaf nodes.lambda:
floatL2 regularization term for weights.max_depth:
intMaximum depth of the tree.num_bins:
intNumber of bins.min_split_gain:
floatMinimum split gain, positive.min_sample_split:
intMinimum number of samples in a tree node.feature_importance_type:
strType of feature importance, supports gain and split.- downsampling:
map - column:
map rate:
floatFeature dimension sampling rate.
- column:
- row:
map run_goss:
boolWhether to use goss sampling.top_rate:
floatThe retain ratio of large gradient data in goss.other_rate:
floatThe retain ratio of small gradient data in goss, 0 < top_rate + other_rate <= 1
- row:
- downsampling:
- category:
map cat_smooth:
floatParameter used to reduce the effect of noise on categorical features. The default value is 0.- cat_feature:
mapConfigures the categorical features. The formula is: features that column indexes are in col_index if col_index_type is ‘inclusive’ or not in col_index if col_index_type is ‘exclusive’. union` featuresthat column names are in col_names if col_names_type is ‘inclusive’ or not in col_names if col_names_type is ‘exclusive’. union if max_num_value_type is ‘union’ or intersect if max_num_value_type is ‘intersection’ features that number of different values is less equal than max_num_value. col_index
str: Index of the feature column that is (or is not) a categorical feature. Accepts slices or numbers, such as: “1, 4:5”. The default value is “”.col_names
list<str>: Name of the feature column that is (or is not) a categorical feature. The default value is [].max_num_value
int: If the number of unique values in a feature column is greater than or equal to this value, the feature column is a categorical feature. The default value is 0.col_index_type
str: Supports ‘inclusive’ and ‘exclusive’. The default value is ‘inclusive’.col_names_type
str: Supports ‘inclusive’ and ‘exclusive’. The default value is ‘inclusive’.max_num_value_type
str: Supports ‘intersection’ and ‘union’. The default value is ‘union’.
- cat_feature:
- category:
- metric:
mapPerformance evaluation indicators. All the following key values are optional. - decision_table:
map method:
strSupports “equalfrequency” and “equalwith”.bins:
intThe number of divisions in the decision table.
- decision_table:
acc: {}
precision: {}
recall: {}
f1_score: {}
auc: {}
ks: {}
- metric:
- early_stopping:
key:
strThe name of the metric used to judge whether to stop training. The metic name should have already been filled in the “metric” variable.patience:
intNumber of steps with no improvement after which training will be stopped.delta:
floatMinimum change in the value of metric to qualify as an improvement.
- encryption:
- paillier:
key_bit_size:
intThe bit length of Paillier key, which should be at least 2048 or more.precision:
intA precision-related parameter that can be null or a positive integer, such as 7.djn_on:
boolWhether to use the DJN method to generate key pair.parallelize_on:
boolWhether to use multi-core parallel computing.
plain:
mapNo encryption. Select either “plain” or “paillier”.
batch_size_val:
intThe batch size for prediction on the validation set.atomic_row_size_per_cpu_core:
intThe maximum number of rows per segment after the data is partitioned.pack_grad_hess:
boolWhether to pack the gradient and hessian into plaintext during encryption.
Trainer
identity: "trainer"
- model_info:
name:
strModel name, should be vertical_xgboost_distributed
- input:
- trainset:
type:
strTraining set type, supports csv.path:
strWhen type is csv, it means the path to the folder where the training set is located.name:
strorlistSupport single csv file or list of csv files when type is csvhas_id:
boolWhen type is csv, it indicates whether there is an id columnhas_label:
boolWhen type is csv, it indicates whether there is a label column.missing_values:
listIndicates missing values.is_centralized:
boolIndicates whether the file is read uniformly from the ray_head node, currently only true is supported.
- valset:
type:
strValidation set type, supports csvpath:
strWhen type is csv, it means the path to the folder where the validation set is located.name:
strorlistSupport single csv file or list of csv files when type is csv.has_id:
boolWhen type is csv, it indicates whether there is an id column.has_label:
boolWhen type is csv, it indicates whether there is a label column.missing_values:
listIndicates missing values.is_centralized:
boolIndicates whether the file is read uniformly from the rayhead node. Currently only true is supported.
- output:
path:
strOutput directory path.- model:
name:
strModel file name.
- train_info:
- train_params:
- downsampling:
map - column:
map rate:
floatFeature dimension sampling rate.
- column:
- downsampling:
- category:
map - cat_feature:
mapConfigures the categorical features. The formula is: features that column indexes are in col_index if col_index_type is ‘inclusive’ or not in col_index if col_index_type is ‘exclusive’. union` featuresthat column names are in col_names if col_names_type is ‘inclusive’ or not in col_names if col_names_type is ‘exclusive’. union if max_num_value_type is ‘union’ or intersect if max_num_value_type is ‘intersection’ features that number of different values is less equal than max_num_value. col_index
str: Index of the feature column that is (or is not) a categorical feature. Accepts slices or numbers, such as: “1, 4:5”. The default value is “”.col_names
list<str>: Name of the feature column that is (or is not) a categorical feature. The default value is [].max_num_value
int: If the number of unique values in a feature column is greater than or equal to this value, the feature column is a categorical feature. The default value is 0.col_index_type
str: Supports ‘inclusive’ and ‘exclusive’. The default value is ‘inclusive’.col_names_type
str: Supports ‘inclusive’ and ‘exclusive’. The default value is ‘inclusive’.max_num_value_type
str: Supports ‘intersection’ and ‘union’. The default value is ‘union’.
- cat_feature:
- category:
batch_blocks_on_recv:
intNumber of data segments processed per batch on receive.ray_col_step:
intNumber of data columns processed simultaneously in a ray computing node. Automatically set by the algorithm if null.