数据清洗(删除重复行、map映射添加列)

116 阅读8分钟

携手创作,共同成长!这是我参与「掘金日新计划 · 8 月更文挑战」的第1天,点击查看活动详情

 


data_1=pd.DataFrame({'food':['bacon火腿','sausage红肠','pulled pork手撕肉','sirloin牛里脊肉','beef jerky牛肉干','mutton shashlik烤羊肉串','Chicken salad鸡肉','Minced chicken鸡肉泥','dried squids鱿鱼干','dried fish鱼干','sausage红肠','bacon火腿','sausage红肠','bacon火腿','Chicken salad鸡肉','dried squids鱿鱼干','dried fish鱼干','mutton shashlik烤羊肉串','beef jerky牛肉干','Minced chicken鸡肉泥','pulled pork手撕肉','sirloin牛里脊肉','mutton shashlik烤羊肉串','Chicken salad鸡肉','Minced chicken鸡肉泥','bacon火腿','sausage红肠','pulled pork手撕肉','sirloin牛里脊肉','beef jerky牛肉干','mutton shashlik烤羊肉串','Chicken salad鸡肉','Minced chicken鸡肉泥','dried squids鱿鱼干','dried fish鱼干','sausage红肠','bacon火腿','sausage红肠','bacon火腿','Chicken salad鸡肉','dried squids鱿鱼干','dried fish鱼干','mutton shashlik烤羊肉串','beef jerky牛肉干','sirloin牛里脊肉','mutton shashlik烤羊肉串','Chicken salad鸡肉','Minced chicken鸡肉泥'],'weight(g)':abs((np.random.normal(0,1,size=48)*24+10)).astype(np.int32)})

print(data_1)

                   food  weight(g)

0               bacon火腿         47

1             sausage红肠         48

2        pulled pork手撕肉          5

3           sirloin牛里脊肉         21

4         beef jerky牛肉干         35

5   mutton shashlik烤羊肉串         23

6       Chicken salad鸡肉         11

7     Minced chicken鸡肉泥          8

8       dried squids鱿鱼干          3

9          dried fish鱼干         13

10            sausage红肠         20

11              bacon火腿         26

12            sausage红肠          5

13              bacon火腿         10

14      Chicken salad鸡肉         31

15      dried squids鱿鱼干         19

16         dried fish鱼干          4

17  mutton shashlik烤羊肉串          4

18        beef jerky牛肉干         18

19    Minced chicken鸡肉泥         21

20       pulled pork手撕肉          6

21          sirloin牛里脊肉          6

22  mutton shashlik烤羊肉串          0

23      Chicken salad鸡肉         17

24    Minced chicken鸡肉泥         46

25              bacon火腿         22

26            sausage红肠          4

27       pulled pork手撕肉          2

28          sirloin牛里脊肉          4

29        beef jerky牛肉干         67

30  mutton shashlik烤羊肉串          3

31      Chicken salad鸡肉         15

32    Minced chicken鸡肉泥         43

33      dried squids鱿鱼干          3

34         dried fish鱼干         10

35            sausage红肠         42

36              bacon火腿         35

37            sausage红肠          5

38              bacon火腿          2

39      Chicken salad鸡肉         18

40      dried squids鱿鱼干          0

41         dried fish鱼干         19

42  mutton shashlik烤羊肉串          4

43        beef jerky牛肉干         20

44          sirloin牛里脊肉         38

45  mutton shashlik烤羊肉串         14

46      Chicken salad鸡肉          1

47    Minced chicken鸡肉泥         46

 

发现数据框‘food’列中有很多重复。如果我们给数据假如新列,并要求新列与food存在某种逻辑关系。例如给‘food’标记食材来源。

 

通过map函数可以很轻松实现上面的要求。

map的实质是把一种映射法则应用到所制定的列上,比如这里的‘food’列。这种映射法则通常通过字典与函数来体现。即map(lambda x:meat_source[x]

即从food列中取值与meat_source的键对应,返回该键对应的值,animal的值即为对应的值

DataFrame中对应的是applymap()函数,DataFrame还有apply()函数

 


meat_source={'bacon火腿':'猪','sausage红肠':'猪','pulled pork手撕肉':'猪','sirloin牛里脊肉':'牛','beef jerky牛肉干':'牛','mutton shashlik烤羊肉串':'羊','chicken salad鸡肉':'鸡','minced chicken鸡肉泥':'鸡','dried squids鱿鱼干':'鱿鱼','dried fish鱼干':'鱼'}

data_1['animal']=data_1['food'].map(meat_source)

print(data_1)

                   food  weight(g) animal

0               bacon火腿         47      猪

1             sausage红肠         48      猪

2        pulled pork手撕肉          5      猪

3           sirloin牛里脊肉         21      牛

4         beef jerky牛肉干         35      牛

5   mutton shashlik烤羊肉串         23      羊

6       Chicken salad鸡肉         11    NaN

7     Minced chicken鸡肉泥          8    NaN

8       dried squids鱿鱼干          3     鱿鱼

9          dried fish鱼干         13      鱼

10            sausage红肠         20      猪

11              bacon火腿         26      猪

12            sausage红肠          5      猪

13              bacon火腿         10      猪

14      Chicken salad鸡肉         31    NaN

15      dried squids鱿鱼干         19     鱿鱼

16         dried fish鱼干          4      鱼

17  mutton shashlik烤羊肉串          4      羊

18        beef jerky牛肉干         18      牛

19    Minced chicken鸡肉泥         21    NaN

20       pulled pork手撕肉          6      猪

21          sirloin牛里脊肉          6      牛

22  mutton shashlik烤羊肉串          0      羊

23      Chicken salad鸡肉         17    NaN

24    Minced chicken鸡肉泥         46    NaN

25              bacon火腿         22      猪

26            sausage红肠          4      猪

27       pulled pork手撕肉          2      猪

28          sirloin牛里脊肉          4      牛

29        beef jerky牛肉干         67      牛

30  mutton shashlik烤羊肉串          3      羊

31      Chicken salad鸡肉         15    NaN

32    Minced chicken鸡肉泥         43    NaN

33      dried squids鱿鱼干          3     鱿鱼

34         dried fish鱼干         10      鱼

35            sausage红肠         42      猪

36              bacon火腿         35      猪

37            sausage红肠          5      猪

38              bacon火腿          2      猪

39      Chicken salad鸡肉         18    NaN

40      dried squids鱿鱼干          0     鱿鱼

41         dried fish鱼干         19      鱼

42  mutton shashlik烤羊肉串          4      羊

43        beef jerky牛肉干         20      牛

44          sirloin牛里脊肉         38      牛

45  mutton shashlik烤羊肉串         14      羊

46      Chicken salad鸡肉          1    NaN

47    Minced chicken鸡肉泥         46    NaN

 

发现‘animal’列有空值。原因是meat_source中的键与‘food’列值不完全一致。

通过map(str.lower)把food列中的值的第一个字母转变成小写,这样meat_source中的键与‘food’列值完全一致,这样Nan会自动消失。

 


data_1['animal']=data_1['food'].map(str.lower).map(meat_source)

或data_1['animal']=data_1['food'].map(lambda x:meat_source[x.lower()])

print(data_1)

                   food  weight(g) animal

0               bacon火腿         47      猪

1             sausage红肠         48      猪

2        pulled pork手撕肉          5      猪

3           sirloin牛里脊肉         21      牛

4         beef jerky牛肉干         35      牛

5   mutton shashlik烤羊肉串         23      羊

6       Chicken salad鸡肉         11      鸡

7     Minced chicken鸡肉泥          8      鸡

8       dried squids鱿鱼干          3     鱿鱼

9          dried fish鱼干         13      鱼

10            sausage红肠         20      猪

11              bacon火腿         26      猪

12            sausage红肠          5      猪

13              bacon火腿         10      猪

14      Chicken salad鸡肉         31      鸡

15      dried squids鱿鱼干         19     鱿鱼

16         dried fish鱼干          4      鱼

17  mutton shashlik烤羊肉串          4      羊

18        beef jerky牛肉干         18      牛

19    Minced chicken鸡肉泥         21      鸡

20       pulled pork手撕肉          6      猪

21          sirloin牛里脊肉          6      牛

22  mutton shashlik烤羊肉串          0      羊

23      Chicken salad鸡肉         17      鸡

24    Minced chicken鸡肉泥         46      鸡

25              bacon火腿         22      猪

26            sausage红肠          4      猪

27       pulled pork手撕肉          2      猪

28          sirloin牛里脊肉          4      牛

29        beef jerky牛肉干         67      牛

30  mutton shashlik烤羊肉串          3      羊

31      Chicken salad鸡肉         15      鸡

32    Minced chicken鸡肉泥         43      鸡

33      dried squids鱿鱼干          3     鱿鱼

34         dried fish鱼干         10      鱼

35            sausage红肠         42      猪

36              bacon火腿         35      猪

37            sausage红肠          5      猪

38              bacon火腿          2      猪

39      Chicken salad鸡肉         18      鸡

40      dried squids鱿鱼干          0     鱿鱼

41         dried fish鱼干         19      鱼

42  mutton shashlik烤羊肉串          4      羊

43        beef jerky牛肉干         20      牛

44          sirloin牛里脊肉         38      牛

45  mutton shashlik烤羊肉串         14      羊

46      Chicken salad鸡肉          1      鸡

47    Minced chicken鸡肉泥         46      鸡

 

添加amount=weigh×price项,看看就行,不懂也罢

 


product=['bacon火腿','sausage红肠','pulled pork手撕肉','sirloin牛里脊肉','beef jerky牛肉干','mutton shashlik烤羊肉串',

         'Chicken salad鸡肉','Minced chicken鸡肉泥','dried squids鱿鱼干','dried fish鱼干']

price=[1,1.5,2,2.3,2.6,3.6,2,2.6,8,9.9]

get_Data={k:p*i for i,k in zip(list(data_1['weight(g)'].values),range(len(data_1['weight(g)'])))\

          for m,p in zip(product,price)\

          if m==data_1['food'][k]}

data_1['amount']=data_1.index.map(get_Data)

print(data_1.reindex(columns=['food','weight(g)','animal','amount']))

                   food  weight(g) animal  amount

0               bacon火腿         47      猪    47.0

1             sausage红肠         48      猪    72.0

2        pulled pork手撕肉          5      猪    10.0

3           sirloin牛里脊肉         21      牛    48.3

4         beef jerky牛肉干         35      牛    91.0

5   mutton shashlik烤羊肉串         23      羊    82.8

6       Chicken salad鸡肉         11      鸡    22.0

7     Minced chicken鸡肉泥          8      鸡    20.8

8       dried squids鱿鱼干          3     鱿鱼    24.0

9          dried fish鱼干         13      鱼   128.7

10            sausage红肠         20      猪    30.0

11              bacon火腿         26      猪    26.0

12            sausage红肠          5      猪     7.5

13              bacon火腿         10      猪    10.0

14      Chicken salad鸡肉         31      鸡    62.0

15      dried squids鱿鱼干         19     鱿鱼   152.0

16         dried fish鱼干          4      鱼    39.6

17  mutton shashlik烤羊肉串          4      羊    14.4

18        beef jerky牛肉干         18      牛    46.8

19    Minced chicken鸡肉泥         21      鸡    54.6

20       pulled pork手撕肉          6      猪    12.0

21          sirloin牛里脊肉          6      牛    13.8

22  mutton shashlik烤羊肉串          0      羊     0.0

23      Chicken salad鸡肉         17      鸡    34.0

24    Minced chicken鸡肉泥         46      鸡   119.6

25              bacon火腿         22      猪    22.0

26            sausage红肠          4      猪     6.0

27       pulled pork手撕肉          2      猪     4.0

28          sirloin牛里脊肉          4      牛     9.2

29        beef jerky牛肉干         67      牛   174.2

30  mutton shashlik烤羊肉串          3      羊    10.8

31      Chicken salad鸡肉         15      鸡    30.0

32    Minced chicken鸡肉泥         43      鸡   111.8

33      dried squids鱿鱼干          3     鱿鱼    24.0

34         dried fish鱼干         10      鱼    99.0

35            sausage红肠         42      猪    63.0

36              bacon火腿         35      猪    35.0

37            sausage红肠          5      猪     7.5

38              bacon火腿          2      猪     2.0

39      Chicken salad鸡肉         18      鸡    36.0

40      dried squids鱿鱼干          0     鱿鱼     0.0

41         dried fish鱼干         19      鱼   188.1

42  mutton shashlik烤羊肉串          4      羊    14.4

43        beef jerky牛肉干         20      牛    52.0

44          sirloin牛里脊肉         38      牛    87.4

45  mutton shashlik烤羊肉串         14      羊    50.4

46      Chicken salad鸡肉          1      鸡     2.0

47    Minced chicken鸡肉泥         46      鸡   119.6