2.3 Extracting Information from Data
1. Exam Points
- Identify the data or data sets needed to get the desired information.
- Identify problems or challenges with data processing for a given scenario.
- What information can be extracted from given data sets.
2. Knowledge Points
(1) Data and Metadata (数据和元数据)
Metadata are data about data, providing additional information about data.
- Data: e.g. image itself. (数据本身)
- Metadata: e.g. date of creation,file size, file name, etc. (数据的数据)
- Metadata are used for finding, organizing, and managing information.
- Changes and deletions made to metadata do not change the primary data.
(2) Process Data
Information is the collection of facts and patterns (trend) extracted from data.
- So we process data to extract information.
- Common
steps of processing Data:
Combining (数据合并): combine data from different sources.
Cleaning (数据清洗): remove corrupt data, incomplete data, make data uniform. (处理无效值、缺失值等,使数据一致)
- Example: remove invalid ages, change cn to China to use China for all the places.
Filtering (数据筛选): identify and extract useful subsets.
- Example: filter records of females.
Classifying (数据分类): group data based on common features.
- Example: group data based on categories.
Pattens (发现模式): identify patterns (trends) in data.
- Example: find weather trend for prediction.
- Scalability of systems is an important consideration when processing data.
- Scalability is the ability to increase the capacity of a resource without having to go to a completely new solution.
3. Exercises