å¿ï¼åä½AIé¢åçæ¢ç´¢è ä»¬ï¼ å¨å½ä»äººå·¥æºè½æµªæ½®ä¸ï¼æä»¬å¸¸å¸¸è¢«åç§é ·ç«ç模ååç®æ³æå¸å¼ãç¶èï¼å½æä»¬æ·±å ¥å°AI模åçæ£è½å°çå®è·µä¸ï¼æä»¬æåç°ï¼æåå®çåºç³å¾å¾ä¸æ¯ç®æ³æ¬èº«ï¼èæ¯ââæ°æ®ãå°¤å ¶æ¯é«è´¨éçæ°æ®æ 注ï¼Data Labelingï¼ï¼å®å°±åæ¯AI模åçâå£ç²®âï¼å ¶è´¨éç´æ¥å³å®äºæ¨¡åå¦ä¹ çä¸éåæç»åºç¨çæè´¥ã
ä½ æ¯å¦éå°è¿è¿æ ·çåºæ¯ï¼æ¨¡åæææ»æ¯å·®å¼ºäººæï¼æ 论æä¹è°å齿 æ³çªç ´ç¶é¢ï¼å¾å¯è½é®é¢å°±åºå¨ä½ çè®ç»æ°æ®ä¸ï¼ä½è´¨éãä¸ä¸è´æä¸åç¡®çæ æ³¨æ°æ®ï¼ä¼è®©æ¨¡åå¦å°éè¯¯çæ¨¡å¼ï¼æç»å¯¼è´å®é åºç¨ä¸çç¾é¾æ§åæã
é®é¢ä»£ç 示ä¾ï¼å½æ°æ®è´¨éä¸è¿å ³æ¶ï¼æ¨¡åä¼âç¯å»â
# 伪代ç ï¼ä¸ä¸ªåºäºä½è´¨éæ æ³¨æ°æ®è®ç»ç模å
def train_model_with_bad_data(features, labels):
print(" æ£å¨ä½¿ç¨ä½è´¨éæ æ³¨æ°æ®è®ç»æ¨¡å...")
# å设è¿éç labels å
å«å¤§éé误æåªé³
# å®é
ä¸ï¼è¿éä¼å®ä¾åå¹¶è®ç»ä¸ä¸ªæºå¨å¦ä¹ 模å
class SomeSupervisedLearningModel:
def fit(self, X, y):
# 模æè®ç»è¿ç¨
print("æ¨¡åæ£å¨ä»ä½è´¨éæ°æ®ä¸'å¦ä¹ 'éè¯¯çæ¨¡å¼ã")
def predict(self, x):
# 模æä¸ä¸ªç³ç³ç颿µç»æ
return "é误ç颿µ" # å®é
å¯è½æ¯é¢æµåç¡®çä½
model = SomeSupervisedLearningModel()
model.fit(features, labels)
print("模åè®ç»å®æï¼ä½ææå¯è½ä¸å°½å¦äººæã")
return model
# 示ä¾ï¼å建ä¸äºèææ°æ®
import numpy as np
X_train = np.random.rand(100, 10) # 100ä¸ªæ ·æ¬ï¼10个ç¹å¾
y_train_noisy = np.random.randint(0, 2, 100) # éæºæ ç¾ï¼æ¨¡æåªé³ï¼å¯¼è´æ°æ®è´¨éä½ä¸
X_test = np.random.rand(10, 10)
bad_model = train_model_with_bad_data(X_train, y_train_noisy)
prediction = bad_model.predict(X_test[0])
print(f" ç³ç³æ¨¡åç颿µç»æï¼{prediction} (ä¸å®é
ææå¯è½ç¸å»çè¿)
")
# å¯¹æ¯æèï¼å¦ææ°æ®æ¯å¹²åçï¼æ¨¡åä¼å¦ä¹ å°æ£ç¡®ç模å¼
# good_model = train_model_with_good_data(X_train_clean, y_train_clean)
# print(f" ä¼ç§æ¨¡åç颿µç»æï¼{good_model.predict(X_test[0])} (æ¥è¿çå®å¼)")
为äºé¿å ä¸è¿°å°´å°¬ï¼æä»¬å°±éè¦ä¸ä¸ªå¼ºå¤§èå¯é çæ°æ®æ 注系ç»ï¼Data Labeling Systemï¼ãå®ä¸ä» ä» æ¯ä¸ä¸ªå·¥å ·ï¼æ´æ¯ä¸å¥éæµç¨ã人åãææ¯äºä¸ä½ç综åè§£å³æ¹æ¡ï¼æ¨å¨é«æãåç¡®å°ä¸ºAI模åç产é«è´¨éçè®ç»æ°æ®ãä»å¤©ï¼å°±è®©æä»¬ä¸èµ·æ·±å ¥çè§£æ°æ®æ 注系ç»ï¼ä»åçå°å®æï¼ææ¡æé AIæååºç³çå ³é®ç§è¯ï¼
ä¸ãä»ä¹æ¯æ°æ®æ 注系ç»ï¼æ ¸å¿ç»ä»¶ä¸è¿è¡æºå¶
æ°æ®æ 注系ç»ï¼é¡¾åæä¹ï¼å°±æ¯ä¸å¥ç¨äºå¯¹åå§æ°æ®ï¼å¦å¾çãè§é¢ãææ¬ãé³é¢çï¼è¿è¡åç±»ãè¯å«ãæ è®°ææ³¨éç软件æå¹³å°ãå ¶æ ¸å¿ç®æ æ¯ï¼å°éç»æåæåç»æåçåå§æ°æ®è½¬å为AI模åå¯çè§£åå¦ä¹ çç»æåãé«ä»·å¼çâçå¼ï¼Ground Truthï¼âæ°æ®ã
å®ç主è¦ç»æé¨åéå¸¸å æ¬ï¼
- æ°æ®ç®¡ç模åï¼ è´è´£åå§æ°æ®çä¸ä¼ ãåå¨ãååãçæ¬æ§å¶ã
- æ æ³¨å·¥å ·æ¨¡åï¼ æä¾åç§ç±»åçæ æ³¨çé¢ååè½ï¼å¦ç»æ¡ãç¹éãææ¬é«äº®ãè¯ä¹åå²ç¬å·çï¼ã
- ä»»å¡ç®¡ç模åï¼ è´è´£æ 注任å¡çå建ãåé ãè¿åº¦è·è¸ªãæªæ¢æ¥æç®¡çã
- è´¨éæ§å¶æ¨¡åï¼ ç¨äºå®¡æ ¸ãçº éãè¯ä¼°æ æ³¨æ°æ®è´¨éãè®¡ç®æ 注åä¸è´æ§ã
- ç¨æ·ä¸æé管çï¼ ç®¡çæ æ³¨åãå®¡æ ¸åã项ç®ç»ççä¸åè§è²åå ¶æä½æéã
- æ°æ®å¯¼åºæ¨¡åï¼ å°æ æ³¨å®æçæ°æ®ä»¥ç¹å®æ ¼å¼ï¼å¦COCOãPascal VOCãJSONçï¼å¯¼åºä¾æ¨¡åè®ç»ã
让æä»¬éè¿ä¸ä¸ªç®åçPythonç±»æ¥æå»ºä¸ä¸ªæ°æ®æ 注系ç»çæ ¸å¿éª¨æ¶ï¼çç宿¯å¦ä½ååå·¥ä½çã
# æ ¸å¿ä»£ç ï¼ä¸ä¸ªç®åçæ°æ®æ æ³¨ç³»ç»æ¡æ¶ (DataLabelingSystem)
import json
import datetime
class DataLabelingSystem:
"""
ä¸ä¸ªç®åçæ°æ®æ 注系ç»ï¼å
å«ç¨æ·ç®¡çãæ°æ®ä¸ä¼ ãä»»å¡å建ã
æ°æ®åé
ãæ æ³¨æäº¤ãå®¡æ ¸åæ°æ®å¯¼åºçæ ¸å¿åè½ã
"""
def __init__(self, admin_user_id="admin_001", admin_user_name="ç³»ç»ç®¡çå"):
self.raw_data_pool = [] # åæ¾åå§æ°æ®é¡¹
self.labeling_tasks = {} # åæ¾ææä»»å¡ä¿¡æ¯
self.labeled_data = [] # åæ¾å·²å®æå¹¶å®¡æ ¸éè¿çæ æ³¨æ°æ®
self.users = {admin_user_id: {"name": admin_user_name, "role": "admin", "capacity": -1, "current_tasks": 0}}
self.next_data_id = 0
print(f" æ°æ®æ 注系ç»åå§å宿ï¼ç®¡çå {admin_user_name} å·²å建ã")
def add_user(self, user_id: str, user_name: str, role: str = "labeler", capacity: int = 10):
"""
æ·»å æ°ç¨æ·å°ç³»ç»ã
Args:
user_id (str): ç¨æ·å¯ä¸IDã
user_name (str): ç¨æ·åã
role (str): ç¨æ·è§è²ï¼å¯é "labeler", "reviewer", "admin"ã
capacity (int): æ æ³¨åæå¤§åæ¶å¤ç任塿°ï¼-1表示æ éå¶ã
"""
if user_id in self.users:
print(f" ç¨æ·ID '{user_id}' å·²åå¨ã")
return False
self.users[user_id] = {"name": user_name, "role": role, "capacity": capacity, "current_tasks": 0}
print(f" ç¨æ· '{user_name}' ({user_id}) 已添å 为 {role}ã")
return True
def upload_raw_data(self, data_sources: list):
"""
ä¸ä¼ åå§æ°æ®å°å¾
æ æ³¨æ± ã
Args:
data_sources (list): åå§æ°æ®æºå表ï¼å¯ä»¥æ¯æä»¶è·¯å¾ãURLææ°æ®å
容ã
"""
uploaded_count = 0
for source in data_sources:
data_item = {
"id": self.next_data_id,
"source": source, # æ¯å¦æä»¶åãURLææ°æ®å
容æ¬èº«
"status": "pending", # pending, assigned, in_progress, submitted_for_review, labeled, rejected
"assigned_to": None,
"annotations": [],
"history": [] # è®°å½æä½åå²
}
self.raw_data_pool.append(data_item)
self.next_data_id += 1
uploaded_count += 1
print(f"⬠æåä¸ä¼ {uploaded_count} æ¡åå§æ°æ®ãå½åæ°æ®æ± 大å°: {len(self.raw_data_pool)}")
def create_labeling_task(self, task_name: str, data_ids: list, label_schema_json: str, project_manager_id: str = "admin_001"):
"""
åå»ºæ æ³¨ä»»å¡ï¼å®ä¹æ 注è§èã
Args:
task_name (str): ä»»å¡åç§°ã
data_ids (list): ä»»å¡å
å«çæ°æ®IDå表ã
label_schema_json (str): å®ä¹æ 注类å«ãç±»åçè§èçJSONå符串ã
project_manager_id (str): 项ç®ç»çIDã
"""
if task_name in self.labeling_tasks:
print(f" ä»»å¡å '{task_name}' å·²åå¨ã")
return False
try:
schema = json.loads(label_schema_json)
print(f" æ æ³¨è§è已解æ: {schema}")
except json.JSONDecodeError:
print(" æ æ³¨è§èJSONæ ¼å¼é误ã")
return False
task = {
"task_name": task_name,
"data_ids": data_ids,
"schema": schema,
"status": "open", # open, in_progress, completed
"manager": project_manager_id,
"created_at": datetime.datetime.now().isoformat()
}
self.labeling_tasks[task_name] = task
print(f" ä»»å¡ '{task_name}' å·²å建ï¼å
å« {len(data_ids)} æ¡æ°æ®ã")
return True
def assign_data_to_labeler(self, task_name: str, data_id: int, labeler_id: str):
"""
å°ç¹å®æ°æ®åé
ç»æ 注åã
"""
if task_name not in self.labeling_tasks:
print(f" ä»»å¡ '{task_name}' ä¸åå¨ã")
return False
if labeler_id not in self.users or self.users[labeler_id]["role"] != "labeler":
print(f" æ æ³¨å '{labeler_id}' ä¸åå¨ææ æéã")
return False
if self.users[labeler_id]["capacity"] != -1 and self.users[labeler_id]["current_tasks"] >= self.users[labeler_id]["capacity"]:
print(f" æ æ³¨å '{labeler_id}' 已达å°ä»»å¡ä¸éã")
return False
for item in self.raw_data_pool:
if item["id"] == data_id and item["status"] in ["pending", "rejected"]: # å¾
åé
æè¢«æç»çæ°æ®å¯éæ°åé
item["status"] = "assigned"
item["assigned_to"] = labeler_id
item["history"].append(f"assigned to {labeler_id} at {datetime.datetime.now().isoformat()}")
self.users[labeler_id]["current_tasks"] += 1
print(f" æ°æ® {data_id} ä»ä»»å¡ '{task_name}' åé
ç»æ 注å {self.users[labeler_id]['name']}ã")
return True
print(f" æ°æ® {data_id} æ æ³åé
ï¼å¯è½å·²è¢«åé
æä¸åå¨ã")
return False
def submit_annotations(self, data_id: int, annotations: list, labeler_id: str):
"""
æ æ³¨åæäº¤æ æ³¨ç»æã
Args:
data_id (int): æ°æ®é¡¹IDã
annotations (list): æ æ³¨ç»æåè¡¨ï¼æ ¼å¼åå³äºæ 注schemaã
labeler_id (str): æäº¤æ æ³¨çæ æ³¨åIDã
"""
for item in self.raw_data_pool:
if item["id"] == data_id and item["assigned_to"] == labeler_id and item["status"] == "assigned":
# æ ¸å¿ï¼æ ¹æ®ä»»å¡çschemaè¿è¡å¤ææ ¡éª
task_schema = None
for task_name, task_info in self.labeling_tasks.items():
if data_id in task_info["data_ids"]:
task_schema = task_info["schema"]
break
if not task_schema:
print(f" æ æ³æ¾å°æ°æ® {data_id} 对åºçä»»å¡schemaã")
return False
if not self._validate_annotations(annotations, task_schema, data_id):
print(f" æ æ³¨å {labeler_id} æäº¤çæ°æ® {data_id} æ æ³¨æ ¼å¼ä¸ç¬¦æå
å®¹æ æã")
return False
item["annotations"] = annotations
item["status"] = "submitted_for_review"
item["history"].append(f"submitted by {labeler_id} at {datetime.datetime.now().isoformat()}")
self.users[labeler_id]["current_tasks"] -= 1 # ä»»å¡æäº¤åï¼æ 注å容ééæ¾
print(f" æ æ³¨å {self.users[labeler_id]['name']} æäº¤äºæ°æ® {data_id} çæ æ³¨ï¼çå¾
å®¡æ ¸ã")
return True
print(f" æ æ³æ¾å°ææäº¤æ°æ® {data_id} çæ æ³¨ï¼è¯·æ£æ¥æ°æ®ç¶æååé
æ
åµã")
return False
def _validate_annotations(self, annotations: list, schema: dict, data_id: int) -> bool:
"""
å
鍿¹æ³ï¼æ ¹æ®ä»»å¡schemaæ ¡éªæ æ³¨ç»æçæææ§ã
Args:
annotations (list): æ æ³¨ç»æã
schema (dict): ä»»å¡çæ æ³¨schemaã
data_id (int): æ°æ®é¡¹IDã
Returns:
bool: æ ¡éªç»æã
"""
if not isinstance(annotations, list):
print(f" [å
鍿 ¡éª] æ°æ® {data_id}: æ æ³¨å¿
é¡»æ¯å表类åã")
return False
if not annotations and schema.get("required_annotations", True): # å
è®¸ç©ºæ æ³¨ï¼å¦æschemaæç¡®å
许
print(f" [å
鍿 ¡éª] æ°æ® {data_id}: æ æ³¨ä¸è½ä¸ºç©ºã")
return False
# 示ä¾ï¼æ ¹æ®schemaä¸ç 'task_type' å 'labels' è¿è¡éªè¯
task_type = schema.get("task_type")
allowed_labels = schema.get("labels", [])
for ann in annotations:
if not isinstance(ann, dict):
print(f" [å
鍿 ¡éª] æ°æ® {data_id}: æ¯ä¸ªæ 注项å¿
é¡»æ¯åå
¸ã")
return False
if task_type == "image_classification":
if 'label' not in ann:
print(f" [å
鍿 ¡éª] æ°æ® {data_id}: å¾ååç±»æ æ³¨ç¼ºå° 'label' åæ®µã")
return False
if allowed_labels and ann['label'] not in allowed_labels:
print(f" [å
鍿 ¡éª] æ°æ® {data_id}: æ ç¾ '{ann['label']}' ä¸å¨å
许çåç±»å表ä¸ã")
return False
elif task_type == "object_detection":
if not all(k in ann for k in ['label', 'box']):
print(f" [å
鍿 ¡éª] æ°æ® {data_id}: ç®æ æ£æµæ æ³¨ç¼ºå° 'label' æ 'box' åæ®µã")
return False
if not (isinstance(ann['box'], list) and len(ann['box']) == 4 and all(isinstance(coord, (int, float)) for coord in ann['box'])):
print(f" [å
鍿 ¡éª] æ°æ® {data_id}: è¾¹çæ¡æ ¼å¼ä¸æ£ç¡® (åºä¸º [x1, y1, x2, y2])ã")
return False
# è¿ä¸æ¥éªè¯è¾¹çæ¡æ¯å¦å¨å¾åèå´å
(è¿ééè¦å¾å尺寸信æ¯ï¼ç®åå¤ç)
if not (0 <= ann['box'][0] < ann['box'][2] and 0 <= ann['box'][1] < ann['box'][3]):
print(f" [å
鍿 ¡éª] æ°æ® {data_id}: è¾¹çæ¡åæ æ æã")
return False
if allowed_labels and ann['label'] not in allowed_labels:
print(f" [å
鍿 ¡éª] æ°æ® {data_id}: ç®æ æ£æµæ ç¾ '{ann['label']}' ä¸å¨å
许çåç±»å表ä¸ã")
return False
# å
¶ä»ä»»å¡ç±»åå¯ä»¥ç»§ç»æ·»å éªè¯é»è¾
else:
print(f" [å
鍿 ¡éª] æ°æ® {data_id}: æªç¥ææªå®ç°æ ¡éªé»è¾çä»»å¡ç±»å '{task_type}'ã")
return True
def review_and_approve(self, data_id: int, reviewer_id: str, passed: bool = True, feedback: str = ""):
"""
å®¡æ ¸åå®¡æ ¸å¹¶æ¹åææç»æ æ³¨ç»æã
"""
if reviewer_id not in self.users or self.users[reviewer_id]["role"] not in ["admin", "reviewer"]:
print(f" å®¡æ ¸å '{reviewer_id}' ä¸åå¨ææ æéã")
return False
for item in self.raw_data_pool:
if item["id"] == data_id and item["status"] == "submitted_for_review":
if passed:
item["status"] = "labeled"
self.labeled_data.append({"data_id": data_id, "annotations": item["annotations"], "source": item["source"], "reviewed_by": reviewer_id})
item["history"].append(f"approved by {reviewer_id} at {datetime.datetime.now().isoformat()}")
print(f" æ°æ® {data_id} æ æ³¨éè¿å®¡æ ¸ï¼ç± {self.users[reviewer_id]['name']} æ¹åã")
else:
item["status"] = "rejected" # éè¦éæ°åé
æä¿®æ£
item["assigned_to"] = None # ç§»é¤åæ æ³¨åçåé
# item["annotations"] = [] # é常ä¿çåæ æ³¨ï¼è®©æ 注åä¿®æ¹
item["history"].append(f"rejected by {reviewer_id} at {datetime.datetime.now().isoformat()} with feedback: '{feedback}'")
print(f" æ°æ® {data_id} æ æ³¨è¢« {self.users[reviewer_id]['name']} æç»ãåå : {feedback}ã")
return True
print(f" æ æ³æ¾å°æå®¡æ ¸æ°æ® {data_id}ï¼è¯·æ£æ¥æ°æ®ç¶æã")
return False
def export_labeled_data(self, format_type: str = "json") -> str:
"""
å¯¼åºææå·²æ¹åçæ æ³¨æ°æ®ã
Args:
format_type (str): å¯¼åºæ ¼å¼ï¼ç®åæ¯æ "json"ã
Returns:
str: å¯¼åºæ°æ®çJSONå符串ã
"""
if format_type == "json":
print(f" æ£å¨å¯¼åº {len(self.labeled_data)} æ¡å·²æ¹åçæ æ³¨æ°æ®ä¸º JSON æ ¼å¼...")
return json.dumps(self.labeled_data, indent=2, ensure_ascii=False)
else:
print(f" æä¸æ¯æ '{format_type}' æ ¼å¼å¯¼åºã")
return ""
# --- 模æä½¿ç¨æ°æ®æ æ³¨ç³»ç» ---
print("--- åå§åä¸ç¨æ·/æ°æ®ç®¡ç ---")
my_dls = DataLabelingSystem()
my_dls.add_user("L001", "Alice", "labeler", capacity=2) # Aliceæå¤åæ¶å¤ç2个任å¡
my_dls.add_user("R001", "Bob", "reviewer")
my_dls.add_user("L002", "Charlie", "labeler", capacity=3)
my_dls.upload_raw_data(["image_001.jpg", "image_002.jpg", "image_003.jpg", "image_004.jpg", "image_005.jpg"])
# å®ä¹å¾åå类任å¡çæ æ³¨è§è (JSONå符串)
image_classification_schema = json.dumps({
"task_type": "image_classification",
"labels": ["cat", "dog", "bird", "car"],
"required_annotations": True # å¿
é¡»ææ æ³¨
})
my_dls.create_labeling_task("Image_Classification_Project", [0, 1, 2, 3, 4], image_classification_schema)
# å®ä¹ç®æ æ£æµä»»å¡çæ æ³¨è§è
object_detection_schema = json.dumps({
"task_type": "object_detection",
"labels": ["person", "bicycle", "car", "motorcycle", "bus", "truck"],
"required_annotations": False # å
许æ ç®æ æ¶ç©ºæ 注
})
my_dls.create_labeling_task("Vehicle_Detection_Project", [0, 1], object_detection_schema) # å设é¨åæ°æ®ä¹ç¨äºæ£æµ
print("
--- ä»»å¡åé
䏿 注æäº¤ ---")
my_dls.assign_data_to_labeler("Image_Classification_Project", 0, "L001")
my_dls.assign_data_to_labeler("Image_Classification_Project", 1, "L001")
my_dls.assign_data_to_labeler("Image_Classification_Project", 2, "L001") # Alice已达ä¸éï¼åé
失败
my_dls.submit_annotations(0, [{"label": "cat", "confidence": 0.98}], "L001")
my_dls.submit_annotations(1, [{"label": "dog"}], "L001")
my_dls.submit_annotations(2, [{"label": "lion"}], "L001") # æäº¤ä¸ä¸ªä¸ç¬¦åè§èçæ ç¾ï¼éªè¯å¤±è´¥
my_dls.assign_data_to_labeler("Image_Classification_Project", 2, "L002") # åé
ç»Charlie
my_dls.submit_annotations(2, [{"label": "bird"}], "L002") # Charlieæäº¤æ£ç¡®æ ç¾
# 模æç®æ æ£æµä»»å¡
my_dls.assign_data_to_labeler("Vehicle_Detection_Project", 3, "L002")
my_dls.submit_annotations(3, [{"label": "car", "box": [10, 20, 100, 200]}, {"label": "bicycle", "box": [150, 120, 250, 220]}], "L002")
print("
--- å®¡æ ¸é¶æ®µ ---")
my_dls.review_and_approve(0, "R001", True)
my_dls.review_and_approve(1, "R001", True)
my_dls.review_and_approve(2, "R001", False, "æ ç¾ 'bird' å¾å¥½ï¼ä½è¯·æ£æ¥å¾å䏿¯å¦æå
¶ä»éè¦åç±»ç对象ã") # æç»ï¼è¦æ±è¡¥å
my_dls.review_and_approve(3, "R001", True)
# éæ°åé
被æç»çæ°æ®ï¼å¹¶è®© Charlie ä¿®æ£
my_dls.assign_data_to_labeler("Image_Classification_Project", 2, "L002")
my_dls.submit_annotations(2, [{"label": "bird"}, {"label": "tree"}], "L002") # è¡¥å
äºæ ç¾
my_dls.review_and_approve(2, "R001", True)
print("
--- æç»ç»æä¸å¯¼åº ---")
print(f" å·²æ¹åçæ æ³¨æ°æ®éï¼{len(my_dls.labeled_data)}")
print("é¨åå·²æ¹åæ°æ®æ ·æ¬:")
for item in my_dls.labeled_data[:3]:
print(f" Data ID: {item['data_id']}, Annotations: {item['annotations']}")
exported_json = my_dls.export_labeled_data()
# print("
--- 导åºçJSONæ°æ® ---")
# print(exported_json[:500] + "..." if len(exported_json) > 500 else exported_json)
åºç¨åºæ¯: æ°æ®æ 注系ç»å¹¿æ³åºç¨äºï¼
- è®¡ç®æºè§è§ï¼Computer Visionï¼ï¼ç®æ æ£æµï¼Object Detectionï¼ãå¾ååå²ï¼Image Segmentationï¼ãå¾ååç±»ï¼Image Classificationï¼ã人è¸è¯å«ãè¡ä¸ºè¯å«çï¼æ¯èªå¨é©¾é©¶ãæºè½å®é²ãå»çå½±ååæçåºç¡ã
- èªç¶è¯è¨å¤çï¼Natural Language Processing, NLPï¼ï¼å½åå®ä½è¯å«ï¼Named Entity Recognition, NERï¼ãæ æåæï¼Sentiment Analysisï¼ãææ¬åç±»ãå ³ç³»æ½åãæºå¨ç¿»è¯çï¼æ¯ææºè½å®¢æãå å®¹å®¡æ ¸ãèæ åæã
- è¯é³è¯å«ï¼Speech Recognitionï¼ï¼è¯é³è½¬ææ¬ï¼Speech-to-Textï¼æ 注ã声纹è¯å«ãè¯é/è¯è°åæçï¼å¹¿æ³ç¨äºæºè½é³ç®±ãè¯é³å©æãä¼è®®è®°å½ã
- èªå¨é©¾é©¶ï¼Autonomous Drivingï¼ï¼æ¿å é·è¾¾ï¼LiDARï¼ç¹äºï¼Point Cloudï¼æ 注ãéè·¯å ç´ ï¼è½¦é线ãäº¤éæ å¿ï¼æ 注ã车è¾è¡ä¸ºé¢æµçï¼å¯¹å®å ¨æ§è¦æ±æé«ã
äºãæ°æ®æ 注系ç»çæ ¸å¿åè½ä¸å®ææå»º
ä¸ä¸ªé«æãå¯é çæ°æ®æ 注系ç»ï¼å ¶æ ¸å¿åè½è¿ä¸æ¢ç®åçâç»æ¡âé£ä¹ç®åãå®éè¦è§£å³å¤§è§æ¨¡æ°æ®ãå¤äººåä½ãè´¨éæ§å¶ç夿¹é¢çææ¯ææã
2.1 æ ¸å¿åè½æ¨¡ååæ
-
æ°æ®å¯¼å ¥ä¸ç®¡çï¼
- æ¯æå¤ç§æ°æ®æ ¼å¼ï¼å¾çãè§é¢ãææ¬ãé³é¢ã3Dç¹äºçï¼ã
- æä¾é«æçä¸ä¼ ãåå¨ï¼é常éæå¯¹è±¡åå¨å¦AWS S3, é¿éäºOSSï¼ãç´¢å¼åæç´¢åè½ã
- æ¯ææ°æ®éçæ¬ç®¡çï¼ç¡®ä¿è®ç»æ°æ®å¯è¿½æº¯ã
-
丰å¯çæ æ³¨å·¥å · ï¼
- é对ä¸åæ°æ®ç±»ååæ æ³¨ä»»å¡ï¼æä¾å¤æ ·åçæ æ³¨çé¢åå·¥å ·ã
- ä¾å¦ï¼ç©å½¢æ¡ï¼Bounding Boxï¼ãå¤è¾¹å½¢ï¼Polygonï¼ãå ³é®ç¹ï¼Keypointï¼ãè¯ä¹åå²ç¬å·ï¼Semantic Segmentation Brushï¼ãææ¬é«äº®ãæ¶é´æ³ã3Dç«æ¹ä½ï¼Cuboidï¼çã
- æä½³å®è·µï¼ æ æ³¨å·¥å ·åºæ¯æå¿«æ·é®æä½ãæ¾å¤§ç¼©å°ãæ¤ééåãæ¨¡æ¿é¢è®¾çåè½ï¼æå¤§æåæ æ³¨æçã
-
ä»»å¡åé ä¸è¿åº¦è·è¸ªï¼
-
è½å¤å建项ç®ï¼å°æ°æ®æ¹éåé ç»æ 注åï¼å®æ¶çæ§æ¯ä¸ªä»»å¡åæ æ³¨åçè¿åº¦ã
-
æ¯æä»»å¡ä¼å 级ãç´§æ¥ç¨åº¦è®¾ç½®ï¼ç¡®ä¿å ³é®æ°æ®ä¼å å¤çã
-
代ç 示ä¾ï¼ä¸ä¸ªç®åçä»»å¡ååé»è¾
æä»¬æ¥çä¸ä¸ªå¦ä½å®ç°æ 注任å¡è´è½½åè¡¡çä¾åãimport collections import random class TaskDistributor: """ ä¸ä¸ªæ¨¡æä»»å¡ååå¨ï¼å®ç°è´è½½åè¡¡ã å®è½æ ¹æ®æ 注åç容éåå½å任塿°ï¼æºè½å°åé å¾ æ æ³¨æ°æ®ã """ def **init**(self, data_items_pool: list, labelers_info: dict): self.data_items = data_items_pool # åå§æ°æ®æ± ï¼å å«å¾ åé çæ°æ®é¡¹ # æ æ³¨åä¿¡æ¯ï¼ID -> {"name", "capacity", "current_tasks"} self.labelers = {l_id: {"name": l_name, "capacity": capacity, "current_tasks": 0} for l_id, (l_name, capacity) in labelers_info.items()} # 使ç¨éå管çå¾ åé çæ°æ®ç´¢å¼ï¼ä¿è¯å ¬å¹³æ§ self.pending_data_indices = collections.deque([i for i, item in enumerate(self.data_items) if item["status"] in ["pending", "rejected"]]) print(f" ä»»å¡åé å¨åå§åï¼å¾ å¤çæ°æ®ï¼{len(self.pending_data_indices)} æ¡ï¼æ 注åï¼{len(self.labelers)} 个ã")def assign_next_task(self): """å°è¯ç»å¯ç¨æ 注ååé ä¸ä¸ªä»»å¡ã""" if not self.pending_data_indices: print(" æææ°æ®å·²åé 宿¯ï¼") return False, None
# çéåºå½åæªè¾¾å°ä»»å¡ä¸éçæ æ³¨å available_labelers = [ l_id for l_id, info in self.labelers.items() if info["capacity"] == -1 or info["current_tasks"] < info["capacity"] ] if not available_labelers: print(" 没æå¯ç¨çæ æ³¨åï¼å已达å°ä»»å¡ä¸éï¼ã请çå¾ ä»»å¡å®æã") return False, None # ç®åçè´è½½åè¡¡çç¥ï¼éæºéæ©ä¸ä¸ªå¯ç¨æ 注å selected_labeler_id = random.choice(available_labelers) # ä»éå头é¨ååºä¸ä¸ªå¾ åé æ°æ®çç´¢å¼ data_idx = self.pending_data_indices.popleft() data_item = self.data_items[data_idx] # æ´æ°æ°æ®ç¶æåæ æ³¨å任塿° data_item["status"] = "assigned" data_item["assigned_to"] = selected_labeler_id self.labelers[selected_labeler_id]["current_tasks"] += 1 print(f" æ°æ® {data_item['id']} åé ç» {self.labelers[selected_labeler_id]['name']} ({selected_labeler_id})ã") return True, {"data_id": data_item["id"], "labeler_id": selected_labeler_id}def complete_task(self, data_id: int, labeler_id: str): """ 模æä»»å¡å®æï¼åæ¶æ 注å容éã å¨å®é ç³»ç»ä¸ï¼è¿é叏卿 注åæäº¤æ 注å触åã """ if labeler_id not in self.labelers: print(f" æ æ³¨å '{labeler_id}' ä¸åå¨ã") return False
for item in self.data_items: if item["id"] == data_id and item["assigned_to"] == labeler_id: if item["status"] == "assigned": # å®é ä¸ä¼æ¯ 'submitted_for_review'ï¼è¿éç®åä¸ºç´æ¥å®æ item["status"] = "submitted_for_review" self.labelers[labeler_id]["current_tasks"] -= 1 print(f" æ æ³¨å {self.labelers[labeler_id]['name']} å®æäºæ°æ® {data_id} çæ æ³¨ã") return True else: print(f" æ°æ® {data_id} ç¶æä¸æ£ç¡®ï¼æ æ³æ è®°ä¸ºå®æãå½åç¶æ: {item['status']}") return False print(f" æ æ³æ¾å°æå®ææ°æ® {data_id} çä»»å¡ (å¯è½æªåé ç»è¯¥æ 注å)ã") return Falsedef get_labeler_status(self): """è·åæææ 注åçå½å任塿 åµã""" status = {l_id: f"{info['name']}: {info['current_tasks']} / {info['capacity']} 个任å¡" for l_id, info in self.labelers.items()} return status
# æ¨¡ææ°æ®æ± (ä¸ DataLabelingSystem ä¸ç raw_data_pool ç»æä¸è´) sample_raw_data_pool_for_distributor = [ {"id": i, "source": f"img_{i:03d}.jpg", "status": "pending", "assigned_to": None, "annotations": []} for i in range(20) ] # æ¨¡ææ æ³¨åä¿¡æ¯ (ID: (å§å, 容é)) sample_labelers_info = {"L001": ("å¼ ä¸", 5), "L002": ("æå", 8), "L003": ("çäº", 3)} distributor = TaskDistributor(sample_raw_data_pool_for_distributor, sample_labelers_info) print(" --- 模æä»»å¡åé è¿ç¨ ---") for _ in range(15): # å°è¯åé 15ä¸ªä»»å¡ success, task_info = distributor.assign_next_task() if not success: print(" æ æ³åé æ´å¤ä»»å¡ï¼æå·²åé 宿¯ã") break print(f" å½åæ æ³¨å任塿 åµï¼{distributor.get_labeler_status()}") # 模æå¼ ä¸å®æä¸ä¸ªä»»å¡ print(" --- 模æä»»å¡å®æï¼éæ¾å®¹é ---") distributor.complete_task(sample_raw_data_pool_for_distributor[0]["id"], "L001") print(f"å¼ ä¸å®æä»»å¡åï¼{distributor.get_labeler_status()['L001']}") # å°è¯å次åé ä»»å¡ç»å¼ ä¸ (ç°å¨ä»æç©ºä½å®¹éäº) print(" --- 忬¡å°è¯åé ä»»å¡ ---") distributor.assign_next_task() print(f"忬¡åé åå¼ ä¸ï¼{distributor.get_labeler_status()['L001']}")
-
-
æ°æ®å¯¼åºä¸éæï¼
- æ¯æå¤ç§ä¸»æµçæ æ³¨æ ¼å¼ï¼å¦COCOãPascal VOCãYOLOãJSONçï¼ï¼æ¹ä¾¿ä¸AIè®ç»æ¡æ¶ï¼TensorFlowãPyTorchãPaddlePaddleï¼å¯¹æ¥ã
- æä¾APIæ¥å£ï¼å®ç°ä¸MLOpsæµç¨çæ ç¼éæã
2.2 æå»ºä¸çææ¯ææä¸è§£å³æ¹æ¡
-
å¤§è§æ¨¡æ°æ®å¤çï¼
- ææï¼ å¦ä½é«æåå¨ãå è½½åå¤çTB级å«çè³PB级å«çæ°æ®ï¼ç¡®ä¿ç³»ç»ååºé度ã
- è§£å³æ¹æ¡ï¼ éç¨åå¸å¼åå¨ï¼å¦HDFSãCephã对象åå¨ï¼ï¼ç»åæ°æ®æµå¼å¤çåæ¹éå è½½ææ¯ãå端åªå è½½å¿ è¦çæ°æ®åçï¼å端è¿è¡æ°æ®é¢å¤çåç´¢å¼ä¼åã
-
å¹¶åä¸åä½ï¼
- ææï¼ å¤åæ æ³¨ååæ¶å¨çº¿æ 注ï¼å¦ä½ä¿è¯æ°æ®ä¸è´æ§ãé¿å å²çªåç³»ç»ç¨³å®æ§ã
- è§£å³æ¹æ¡ï¼ éç¨æ²è§éæä¹è§éæºå¶å¤çå¹¶åç¼è¾ï¼ä½¿ç¨æ¶æ¯éåï¼å¦Kafkaï¼è¿è¡äºä»¶éç¥ï¼è®¾è®¡å¾®æå¡æ¶ææåç³»ç»å¯ä¼¸ç¼©æ§å容鿧ã
-
**æ æ³¨å·¥å ·çæç¨æ§ä¸æ©å±æ§**ï¼
- ææï¼ 设计ç´è§ã髿çUI/UXï¼å¹¶è½å¿«ééé æ°çæ æ³¨éæ±åæ°æ®ç±»åã
- è§£å³æ¹æ¡ï¼ 模ååå端ç»ä»¶è®¾è®¡ï¼æä¾å¯é ç½®çæ æ³¨æ¨¡æ¿åæä»¶æºå¶ãä¾å¦ï¼å¼åä¸å¥åºç¡çæ æ³¨ç»ä»¶åºï¼é对ä¸åä»»å¡ç±»åè¿è¡ç»ååæ©å±ã
-
**éæä¸å ¼å®¹æ§**ï¼
- ææï¼ ä¸ä¸åçAIæ¡æ¶ãæ°æ®åå¨ï¼å¯¹è±¡åå¨ãæ°æ®åºï¼ãæéç³»ç»çè¿è¡æ ç¼éæã
- è§£å³æ¹æ¡ï¼ æä¾æ ååçRESTful APIæSDKï¼æ¯æå¤ç§æ°æ®å¯¼å ¥/å¯¼åºæ ¼å¼ï¼å©ç¨OAuth/JWTçå®ç°ç»ä¸è®¤è¯ã
2.3 对æ¯ä»£ç ï¼ç¡¬ç¼ç æ ç¾ vs. 卿Schema
å¨å®é 项ç®ä¸ï¼æä»¬ç»å¸¸éè¦è°æ´æ 注类å«ã妿æ ç¾æ¯ç¡¬ç¼ç çï¼ä¿®æ¹èµ·æ¥ä¼é常麻ç¦ä¸å®¹æåºéã使ç¨å¨æSchemaå¯ä»¥å¤§å¤§æé«çµæ´»æ§ã
# 䏿¨èçåæ³ï¼ç¡¬ç¼ç æ ç¾ (Bad Practice: Hardcoded Labels)
def process_hardcoded_label(label_text: str):
"""æ ¹æ®ç¡¬ç¼ç çæ ç¾å符串è¿è¡å¤çã"""
if label_text == "cat":
return "å¨ç©-ç«"
elif label_text == "dog":
return "å¨ç©-ç"
else:
return "æªç¥"
print(f" 硬ç¼ç å¤ç 'cat': {process_hardcoded_label('cat')}")
print(f" 硬ç¼ç å¤ç 'car': {process_hardcoded_label('car')}
")
# æ¨èçåæ³ï¼åºäºå¨æSchemaçæ ç¾å¤ç (Good Practice: Schema-driven Labels)
def process_schema_label(label_text: str, schema: dict):
"""æ ¹æ®å¨æSchemaæ å°æ ç¾ã"""
label_map = schema.get("label_mapping", {})
return label_map.get(label_text, "æªç¥")
# åè®¾è¿æ¯ä»ä»»å¡é
ç½®ä¸å è½½çschema
dynamic_schema = {
"task_type": "image_classification",
"labels": ["cat", "dog", "bird", "car"],
"label_mapping": {
"cat": "å¨ç©-ç«",
"dog": "å¨ç©-ç",
"bird": "å¨ç©-é¸",
"car": "交éå·¥å
·-轿车"
}
}
print(f" Schemaå¤ç 'cat': {process_schema_label('cat', dynamic_schema)}")
print(f" Schemaå¤ç 'car': {process_schema_label('car', dynamic_schema)}")
# 妿éè¦æ°å¢æ ç¾ï¼åªéä¿®æ¹schemaï¼æ éæ¹å¨å¤çé»è¾
new_schema = {**dynamic_schema, "label_mapping": {**dynamic_schema["label_mapping"], "truck": "交éå·¥å
·-å¡è½¦"}}
print(f" æ°Schemaå¤ç 'truck': {process_schema_label('truck', new_schema)}")
ä¸ãè¿é¶å®è·µï¼è´¨éæ§å¶ä¸æçä¼å
æ°æ®æ 注çè´¨éæ¯AI模åæåçå ³é®ãæä»¬ä¸è½ä» ä» ä¾èµæ 注åçèªè§æ§ï¼å¿ 须建ç«ä¸å¥å®åçè´¨éæ§å¶ï¼Quality Control, QCï¼æºå¶ï¼å¹¶éè¿èªå¨åååèªå¨åææ®µæåæ æ³¨æçã
3.1 é«ç²¾åº¦è´¨éæ§å¶çç¥
-
å¤äººäº¤åå®¡æ ¸ï¼Cross-Reviewï¼ï¼
- æ¯ä¸ªæ°æ®é¡¹ç±å¤åæ æ³¨åç¬ç«æ 注ï¼åç±å®¡æ ¸åè¿è¡æ¯å¯¹åå³çã
- æç±ä¸åæ æ³¨åæ æ³¨ï¼å¦ä¸åæ æ³¨åå®¡æ ¸ã
-
ä¸è´æ§æ£æ¥ï¼Agreement Checkï¼ï¼
- éè¿è®¡ç®ä¸åæ æ³¨åä¹é´å¯¹å䏿°æ®é¡¹æ æ³¨ç»æçä¸è´æ§å¾åï¼è¯ä¼°æ 注质éåæ æ³¨å表ç°ã
- 常ç¨çææ æCohen's KappaãFleiss' Kappaçã
-
éæ åï¼Gold Standardï¼/æ æ³¨åèè¯ï¼
- é¢å åå¤ä¸å°é¨åé«è´¨éçâéæ åâæ°æ®ï¼ç¨äºè¯ä¼°æ°æ 注åçè½åæå®ææ½æ¥æ 注质éã
-
驳åä¸éæ æµç¨ï¼
- 彿 æ³¨ç»æä¸ç¬¦åè¦æ±æ¶ï¼å®¡æ ¸åå¯ä»¥é©³åä»»å¡å¹¶éä¸è¯¦ç»åé¦ï¼è¦æ±æ 注åä¿®æ£ã
代ç 示ä¾ï¼ç®åçæ æ³¨ä¸è´æ§å¾å计ç®
è¿éæä»¬å®ç°ä¸ä¸ªç®åç夿°æç¥¨ä¸è´æ§å¾å计ç®å½æ°ãå¨å®é åºç¨ä¸ï¼ä¼ä½¿ç¨æ´ä¸ä¸çç»è®¡æ¹æ³ã
from collections import Counter
def calculate_agreement_score(annotations_list: list) -> tuple:
"""
计ç®ä¸ç»æ æ³¨ç»æçä¸è´æ§å¾åãè¿é使ç¨ç®åç夿°æç¥¨ä½ä¸ºä¸è´æ§å¤æï¼
并计ç®å¤æ°æ ç¾çæ¯ä¾ã
Args:
annotations_list (list): å
å«å¤ä¸ªæ æ³¨ç»æçå表ã
æ¯ä¸ªç»æå¯ä»¥æ¯ç®åçæ ç¾åç¬¦ä¸²ï¼æå
å«'label'åæ®µçåå
¸ã
示ä¾ï¼[["cat"], ["dog"], ["cat"]] (ç®ååç±»)
æï¼[[{"label": "cat"}], [{"label": "dog"}], [{"label": "cat"}]] (å¤ææ æ³¨)
Returns:
tuple: (夿°æ ç¾, 夿°æ ç¾çæ¯ä¾)
"""
if not annotations_list:
return None, 1.0 # 空å表è§ä¸ºå®å
¨ä¸è´ï¼ä½æ 夿°æ ç¾
# æåæææ 注å¼ï¼è¿éå设æ¯ä¸ªå
ç´ æ¯ä¸ä¸ªå表ï¼å
鍿¯æ 注对象
# é对ç®ååç±»ï¼å¯ä»¥ç´æ¥æåæ ç¾å符串
# éå¯¹å¤ææ æ³¨ï¼éè¦å°æ 注对象转æ¢ä¸ºå¯åå¸çå½¢å¼è¿è¡æ¯è¾ï¼ä¾å¦JSONå符串åï¼
all_labels = []
for ann_set in annotations_list:
if not ann_set: # å
è®¸ç©ºæ æ³¨é
all_labels.append("NO_ANNOTATION") # ç¨ç¹æ®æ è®°è¡¨ç¤ºæ æ 注
continue
# ç®åå¤çï¼å¯¹äºå¤ææ 注ï¼åªå第ä¸ä¸ªæ ç¾ä½ä¸ºä»£è¡¨è¿è¡ä¸è´æ§å¤æ
# å®é
ä¸éè¦æ´å¤æç对象æ¯è¾é»è¾ï¼ä¾å¦è®¡ç®IOUï¼æè
å¯¹æ¯ææå±æ§
if isinstance(ann_set[0], dict) and 'label' in ann_set[0]:
all_labels.append(ann_set[0]['label'])
elif isinstance(ann_set[0], str):
all_labels.append(ann_set[0])
else:
all_labels.append("UNSUPPORTED_FORMAT") # æ æ³è¯å«çæ ¼å¼
if not all_labels:
return None, 1.0
label_counts = Counter(all_labels)
most_common_label, count = label_counts.most_common(1)[0]
agreement_ratio = count / len(all_labels)
print(f" [QC] 夿°æ ç¾: '{most_common_label}', åºç°æ¬¡æ°: {count}, æ»æ 注æ°: {len(all_labels)}")
return most_common_label, agreement_ratio
# 示ä¾1: ç®åçå类任å¡ä¸è´æ§
print("
--- æ æ³¨ä¸è´æ§æ£æ¥ç¤ºä¾ ---")
annotations_set_1 = [
[{"label": "cat"}],
[{"label": "dog"}],
[{"label": "cat"}],
[{"label": "cat"}]
]
majority_label_1, score_1 = calculate_agreement_score(annotations_set_1)
print(f"ä¸è´æ§å¾å (åç±»): {score_1:.2f} (夿°æ ç¾: {majority_label_1})
")
# 示ä¾2: å¤ä¸ªæ 注å对åä¸å¾åçæ£æµç»æï¼ç®å为æ ç¾ä¸è´æ§ï¼
annotations_set_2 = [
[{"label": "car", "box": [10,20,30,40]}],
[{"label": "car", "box": [12,22,32,42]}], # å³ä½¿boxç¥æä¸åï¼æ ç¾ä»ä¸è´
[{"label": "truck", "box": [50,60,70,80]}]
]
majority_label_2, score_2 = calculate_agreement_score(annotations_set_2)
print(f"ä¸è´æ§å¾å (æ£æµæ ç¾): {score_2:.2f} (夿°æ ç¾: {majority_label_2})
")
# 示ä¾3: åå¨ç©ºæ 注
annotations_set_3 = [
[{"label": "person"}],
[{"label": "person"}],
[] # æä¸ªæ 注å认为没æç®æ
]
majority_label_3, score_3 = calculate_agreement_score(annotations_set_3)
print(f"ä¸è´æ§å¾å (å«ç©ºæ 注): {score_3:.2f} (夿°æ ç¾: {majority_label_3})
")
3.2 æçä¼åæå·§
-
颿 注ï¼Pre-labelingï¼ï¼
- å©ç¨ç°ææ¨¡åæå°éäººå·¥æ æ³¨çæ°æ®è®ç»ä¸ä¸ªåæ¥æ¨¡åï¼å¯¹æ°æ°æ®è¿è¡é¢æ 注ã
- æ æ³¨å卿¤åºç¡ä¸è¿è¡ä¿®æ£ï¼èéä»é¶å¼å§ï¼å¤§å¹ æé«æçã
-
主å¨å¦ä¹ ï¼Active Learningï¼ï¼
- 模å鿩坹èªèº«âæä¸ç¡®å®âæâææä»·å¼âçæ°æ®äº¤ç»äººå·¥æ 注ã
- è¿è½ç¨æå°çäººå·¥æ æ³¨ææ¬ï¼æå¤§åæ¨¡åæ§è½æåã
-
èªå¨åè§åæ æ³¨ï¼
- å¯¹äºæäºå ·æææ¾æ¨¡å¼çæ°æ®ï¼å¦ææ¬ä¸çç¹å®å ³é®è¯ï¼ï¼å¯ä»¥ç¼åè§åè¿è¡èªå¨åæ æ³¨ã
- ä¾å¦ï¼éè¿æ£å表达å¼è¯å«èº«ä»½è¯å·ãææºå·çã
-
å¿«æ·é®ä¸æ¹éæä½ï¼
- æ æ³¨å·¥å ·åºæä¾ä¸°å¯çå¿«æ·é®æä½åæ¹éå¤çåè½ï¼åå°é夿§å·¥ä½ã
- ä¾å¦ï¼ä¸é®å¤å¶ä¸ä¸ä¸ªæ æ³¨ãæ¹éå é¤ãæ¹éä¿®æ¹å±æ§çã
3.3 常è§é·é±ä¸è§£å³æ¹æ¡
-
é·é±1ï¼æ 注è§è䏿¸ æ°æä¸ä¸è´
- é®é¢ï¼ æ æ³¨å对å䏿¦å¿µçè§£ä¸åï¼å¯¼è´æ æ³¨ç»æäºè±å «é¨ã
- è§£å³æ¹æ¡ï¼ å¶å®è¯¦ç»ã徿并èçæ æ³¨è§èææ¡£ï¼å®æè¿è¡æ 注åå¹è®åèæ ¸ï¼å»ºç«å¸¸è§é®é¢FAQ忲鿏 éã
-
é·é±2ï¼æ 注åç²å³ä¸åå·®
- é®é¢ï¼ é¿æé夿§å·¥ä½å¯¼è´æ æ³¨åæ³¨æåä¸éï¼äº§çç³»ç»æ§é误ã
- è§£å³æ¹æ¡ï¼ åç宿工ä½éå伿¯æ¶é´ï¼å¼å ¥âéæ åâåä¸è´æ§æ£æ¥ï¼åæ¶åç°åçº æ£åå·®ï¼è½®æ¢æ 注任å¡ç±»åã
-
é·é±3ï¼æ°æ®æ¼ç§»ï¼Data Driftï¼
- é®é¢ï¼ éçæ¶é´æ¨ç§»ï¼ç°å®ä¸çæ°æ®åå¸åçååï¼æ§çæ æ³¨è§èåæ°æ®ä¸åéç¨ã
- è§£å³æ¹æ¡ï¼ 宿审æ¥åæ´æ°æ 注è§èï¼å¯¹æ°æ°æ®è¿è¡å°æ¹éæ æ³¨å模ååè®ç»ï¼çæ§æ§è½ååã
æä½³å®è·µæ¸
åï¼
1. å
å®è§èï¼ååæ æ³¨ï¼ æ æ³¨åå¿
é¡»ææ¸
æ°ãè¯¦å°½çæ æ³¨è§èã
2. å°æ¹éè¯æ ï¼åæ¶åé¦ï¼ æ¹éæ æ³¨åè¿è¡å°èå´è¯æ ï¼å¿«éè¿ä»£è§èåå·¥å
·ã
3. å¤è½®è´¨æ£ï¼å±å±æå
³ï¼ å¼å
¥äº¤åå®¡æ ¸ãéæ åãä¸è´æ§æ£æ¥çå¤éè´¨æ£æºå¶ã
4. æ æ³¨åæ¯ä¼ä¼´ï¼ä¸æ¯å·¥å
·ï¼ éè§æ 注åå¹è®ãæ²éååé¦ï¼æåå
¶ä¸ä¸æ§ã
5. æ¥æ±èªå¨åï¼æåæçï¼ ç§¯æå¼å
¥é¢æ 注ã主å¨å¦ä¹ çææ¯ï¼åè½»äººå·¥è´æ
ã
6. æ°æ®å®å
¨ä¸éç§å
è¡ï¼ ç¡®ä¿æææ°æ®çå¿ååãè±æå¤çååå¨å®å
¨ã
7. çæ¬ç®¡çï¼å¯è¿½æº¯æ§ï¼ æ æ³¨æ°æ®ãæ æ³¨å·¥å
·ãæ æ³¨è§èé½åºæçæ¬ç®¡çã
å·¥å
·æ¨èï¼
å¸é¢ä¸æå¾å¤ä¼ç§çæ°æ®æ 注工å
·ï¼ä¾å¦ï¼
* 弿ºå·¥å
·ï¼ LabelImgï¼å¾åç®æ æ£æµï¼ãLabel Studioï¼å¤æ¨¡æéç¨ï¼ãDoccanoï¼ææ¬æ 注ï¼ã
* åä¸å¹³å°ï¼ AWS SageMaker Ground TruthãGoogle Cloud AI Platform Data Labelingãç¾åº¦EasyDataãDatafinitiçï¼å®ä»¬é常æä¾æ´å®åçæµç¨ç®¡çãè´¨éæ§å¶åå¤§è§æ¨¡å¢éåä½åè½ã
åãæ»ç»ä¸å±æ
æä»¬ä»å¤©æ·±å ¥æ¢è®¨äºæ°æ®æ 注系ç»ï¼ä»å ¶æ ¸å¿æ¦å¿µãç»æé¨åï¼å°å®é æå»ºä¸çææ¯ææåè§£å³æ¹æ¡ï¼åå°å¦ä½éè¿è´¨éæ§å¶åæçä¼åæ¥ç产é«è´¨éçâAIé£ç²®âãæä»¬äºè§£å°ï¼ä¸ä¸ªä¼ç§çæ°æ®æ æ³¨ç³»ç»æ¯AIé¡¹ç®æåçåºç³ï¼å®ä¸ä» è½æé«æ°æ®ç产æçï¼æ´è½ä¿éæ°æ®è´¨éï¼ä»è让æä»¬çAI模åçæ£âèªæâèµ·æ¥ã
æ ¸å¿ç¥è¯ç¹å顾ï¼
- æ°æ®æ æ³¨ç³»ç»æ¯AI模åè·åé«è´¨éè®ç»æ°æ®çå ³é®ã
- å®å 嫿°æ®ç®¡çãæ æ³¨å·¥å ·ãä»»å¡åé ãè´¨éæ§å¶ãç¨æ·ç®¡çåæ°æ®å¯¼åºçæ ¸å¿æ¨¡åã
- é¢å¯¹å¤§è§æ¨¡æ°æ®ãå¹¶ååä½åé«ç²¾åº¦è´¨éæ§å¶çææï¼æä»¬éè¦éç¨åå¸å¼ã模åååæºè½åçè§£å³æ¹æ¡ã
- éè¿é¢æ 注ã主å¨å¦ä¹ ãä¸è´æ§æ£æ¥åå®åçå®¡æ ¸æµç¨ï¼å¯ä»¥æ¾èæåæ æ³¨æçåæ°æ®è´¨éã
å®æå»ºè®®ï¼
1. æç¡®æ æ³¨ç®æ ï¼ å¨å¯å¨ä»»ä½æ 注项ç®åï¼å¡å¿
ä¸AI模åå¼åå¢éç´§å¯æ²éï¼æç¡®æ¨¡åéæ±ï¼å®ä¹æ¸
æ°çæ æ³¨ç®æ åè§èã
2. ä»å°å¤çæï¼è¿ä»£ä¼åï¼ ä¸è¦è¯å¾ä¸æ¬¡æ§æå»ºä¸ä¸ªå®ç¾çç³»ç»ãä»ä¸ä¸ªæå°å¯è¡äº§åï¼MVPï¼å¼å§ï¼æ ¹æ®å®é
åé¦å鿱䏿è¿ä»£ã
3. éè§æ 注åä½éªï¼ æ æ³¨åæ¯æ°æ®çäº§çæ ¸å¿ï¼æä¾æç¨ãé«æçæ æ³¨å·¥å
·åè¯å¥½çå·¥ä½ç¯å¢ï¼è½æææåæ æ³¨è´¨éåæçã
4. èªå¨åä¸äººå·¥ç»åï¼ å
åå©ç¨æºå¨å¦ä¹ è¿è¡é¢æ 注å主å¨å¦ä¹ ï¼å°äººå·¥å¹²é¢éä¸å¨ææä»·å¼åæå°é¾çæ°æ®ä¸ã
5. å»ºç«æ°æ®é£è½®ï¼ å°æ 注系ç»ä¸æ¨¡åè®ç»ãé¨ç½²å½¢æéç¯ã模å颿µç»æå¯ä»¥ä½ä¸ºé¢æ æ³¨ï¼æ¨¡å表ç°ä¸ä½³çæ°æ®å¯ä»¥ä¼å
念尿 注系ç»è¿è¡ç²¾ä¿®ï¼å½¢æè¯æ§å¾ªç¯ã
æªæ¥å±æï¼
éçAIææ¯çåå±ï¼æ°æ®æ 注系ç»ä¹å¨ä¸ææ¼è¿ï¼
* AIè¾
婿 æ³¨çæºè½åï¼ å¤§è¯è¨æ¨¡åï¼LLMï¼å夿¨¡ææ¨¡åå°æ´æ·±å
¥å°èå
¥æ 注æµç¨ï¼å®ç°æ´æºè½ç颿 æ³¨ãæ ç¾æ¨èåè´¨éæ£æ¥ã
* MaaSï¼Model as a Serviceï¼ä¸æ°æ®æå¡åï¼ æ æ³¨ç³»ç»å°è¿ä¸æ¥ä¸MLOpså¹³å°èåï¼æä¾ç«¯å°ç«¯çæ°æ®æå¡ï¼è®©æ°æ®åå¤æä¸ºAIå¼åæµç¨ä¸æ´æ ç¼ãæ´èªå¨åçä¸ç¯ã
* åææ°æ®ï¼Synthetic Dataï¼çå
´èµ·ï¼ 卿äºåºæ¯ä¸ï¼éè¿çæå¼AIåå»ºåææ°æ®æ¥è¾
婿æ¿ä»£é¨åç宿°æ®æ 注ï¼å°æ¯æªæ¥çéè¦è¶å¿ã
æ°æ®æ¯AIçå½èï¼èæ°æ®æ æ³¨ç³»ç»æ£æ¯è¿æ¡å½èçå¿èãææ¡å¹¶æå»ºé«æçæ°æ®æ 注系ç»ï¼å°ä½¿æä»¬è½å¤ä¸ºAI模åæä¾æºæºä¸æçé«è´¨éå »åï¼å ±åæ¨å¨äººå·¥æºè½ææ¯èµ°åæ´å¹¿éçæªæ¥ï¼