download:Kubernetes(k8s)生产级实践指南 从部署到核心应用
-
训练的基本配置
Training: 2021-12-14 14:49:44,690-rank_id: 0 Training: 2021-12-14 14:49:44,720-: loss cosface Training: 2021-12-14 14:49:44,720-: network r50 Training: 2021-12-14 14:49:44,720-: resume False Training: 2021-12-14 14:49:44,720-: output model Training: 2021-12-14 14:49:44,720-: dataset ms1m-retinaface-t1 Training: 2021-12-14 14:49:44,720-: embedding_size 512 Training: 2021-12-14 14:49:44,721-: fp16 True Training: 2021-12-14 14:49:44,721-: model_parallel True Training: 2021-12-14 14:49:44,721-: sample_rate 0.1 Training: 2021-12-14 14:49:44,721-: partial_fc 0 Training: 2021-12-14 14:49:44,721-: graph True Training: 2021-12-14 14:49:44,721-: synthetic False Training: 2021-12-14 14:49:44,721-: scale_grad False Training: 2021-12-14 14:49:44,721-: momentum 0.9 Training: 2021-12-14 14:49:44,721-: weight_decay 0.0005 Training: 2021-12-14 14:49:44,722-: batch_size 128 Training: 2021-12-14 14:49:44,722-: lr 0.1 Training: 2021-12-14 14:49:44,722-: val_image_num {'lfw': 12000, 'cfp_fp': 14000, 'agedb_30': 12000} Training: 2021-12-14 14:49:44,722-: ofrecord_path /dataset/18fad635/v1/ofrecord Training: 2021-12-14 14:49:44,722-: num_classes 93432 Training: 2021-12-14 14:49:44,722-: num_image 5179510 Training: 2021-12-14 14:49:44,722-: num_epoch 25 Training: 2021-12-14 14:49:44,722-: warmup_epoch -1 Training: 2021-12-14 14:49:44,722-: decay_epoch [10, 16, 22] Training: 2021-12-14 14:49:44,723-: val_targets ['lfw', 'cfp_fp', 'agedb_30'] Training: 2021-12-14 14:49:44,723-: ofrecord_part_num 32 复制代码
-
加载验证集数据的日志
Training: 2021-12-14 14:49:50,124-loading bin:0 Training: 2021-12-14 14:49:51,372-loading bin:1000 Training: 2021-12-14 14:49:52,649-loading bin:2000 Training: 2021-12-14 14:50:17,039-loading bin:9000 Training: 2021-12-14 14:50:18,300-loading bin:10000 Training: 2021-12-14 14:50:19,576-loading bin:11000 Training: 2021-12-14 14:50:20,839-loading bin:12000 Training: 2021-12-14 14:50:22,099-loading bin:13000 Training: 2021-12-14 14:50:23,353-oneflow.Size([14000, 3, 112, 112]) Training: 2021-12-14 14:50:23,709-loading bin:0 Training: 2021-12-14 14:50:24,991-loading bin:1000 Training: 2021-12-14 14:50:26,292-loading bin:2000 Training: 2021-12-14 14:50:27,590-loading bin:3000 Training: 2021-12-14 14:50:28,886-loading bin:4000 Training: 2021-12-14 14:50:30,174-loading bin:5000 Training: 2021-12-14 14:50:31,463-loading bin:6000 Training: 2021-12-14 14:50:32,744-loading bin:7000 Training: 2021-12-14 14:50:34,029-loading bin:8000 Training: 2021-12-14 14:50:35,315-loading bin:9000 Training: 2021-12-14 14:50:36,593-loading bin:10000 Training: 2021-12-14 14:50:37,867-loading bin:11000 Training: 2021-12-14 14:50:39,144-oneflow.Size([12000, 3, 112, 112]) 复制代码
-
训练时的基本信息(速度、loss 变化、预估剩余时间等)
Training: 2021-12-14 14:51:02,452-Speed 883.82 samples/sec Loss 52.6974 LearningRate 0.1000 Epoch: 0 Global Step: 100 Required: 202 hours Training: 2021-12-14 14:51:09,722-Speed 880.33 samples/sec Loss 53.4146 LearningRate 0.1000 Epoch: 0 Global Step: 150 Required: 149 hours Training: 2021-12-14 14:51:16,968-Speed 883.24 samples/sec Loss 51.8446 LearningRate 0.1000 Epoch: 0 Global Step: 200 Required: 122 hours Training: 2021-12-14 14:51:24,237-Speed 880.57 samples/sec Loss 50.9537 LearningRate 0.1000 Epoch: 0 Global Step: 250 Required: 106 hours Training: 2021-12-14 14:51:31,526-Speed 877.99 samples/sec Loss 50.5335 LearningRate 0.1000 Epoch: 0 Global Step: 300 Required: 95 hours Training: 2021-12-14 14:51:38,831-Speed 876.17 samples/sec Loss 49.6624 LearningRate 0.1000 Epoch: 0 Global Step: 350 Required: 87 hours Training: 2021-12-14 14:51:46,151-Speed 874.42 samples/sec Loss 48.9462 LearningRate 0.1000 Epoch: 0 Global Step: 400 Required: 82 hours Training: 2021-12-14 14:51:53,476-Speed 873.76 samples/sec Loss 48.3082 LearningRate 0.1000 Epoch: 0 Global Step: 450 Required: 77 hours Training: 2021-12-14 14:52:00,810-Speed 872.72 samples/sec Loss 48.0000 LearningRate 0.1000 Epoch: 0 Glo