从 Pandas dataframe 中去除非数字字符

29 阅读1分钟

我们有一组数据,以以下格式存储:

Accuracy 26.15%, error rate 0.00%, not classified 73.85% Accuracy 29.68%, error rate 0.00%, not classified 70.32% Accuracy 33.98%, error rate 0.00%, not classified 66.02% Accuracy 35.34%, error rate 0.00%, not classified 64.66% Accuracy 35.75%, error rate 0.00%, not classified 64.25% Accuracy 37.51%, error rate 0.00%, not classified 62.49% Accuracy 38.63%, error rate 0.00%, not classified 61.37% Accuracy 40.81%, error rate 0.00%, not classified 59.19% Accuracy 41.22%, error rate 0.00%, not classified 58.78% Accuracy 41.99%, error rate 0.00%, not classified 58.01% Accuracy 42.34%, error rate 0.00%, not classified 57.66% Accuracy 42.40%, error rate 0.00%, not classified 57.60% Accuracy 43.05%, error rate 0.00%, not classified 56.95% Accuracy 44.29%, error rate 0.00%, not classified 55.71% Accuracy 44.35%, error rate 0.00%, not classified 55.65% Accuracy 44.76%, error rate 0.00%, not classified 55.24% Accuracy 45.29%, error rate 0.00%, not classified 54.71% Accuracy 45.35%, error rate 0.00%, not classified 54.65% Accuracy 95.35%, error rate 4.24%, not classified 0.41% Accuracy 95.76%, error rate 4.24%, not classified 0.00% Stats on test data Accuracy 94.74%, error rate 5.26%, not classified 0.00%

我们希望将这些数据加载到 Pandas dataframe 中,并命名为 'Accuracy', 'Error rate' 和 'Not classified'。同时,我们希望从数据字段中去除非数字字符。

2、解决方案 方法1:使用 pandas.DataFrame.replace()

import pandas as pd

df = pd.read_csv("test.csv", names=['Accuracy', 'Error rate', 'Not classified'])

df.replace(r'[a-zA-Z%]', '', regex=True, inplace=True)

if your ultimate goal is to convert those values to numbers perform
df.apply(pd.to_numeric)

# or do it column by column
df['Accuracy'] = pd.to_numeric(df['Accuracy']) # and so on

方法2:使用 str.replace(r"[a-zA-Z]",'')

import pandas as pd

df = pd.read_csv("test.csv", names=['Accuracy', 'Error rate', 'Not classified'])

df['Accuracy'] = df['Error rate'].str.replace(r"[a-zA-Z]",'')
df['Error rate'] = df['Error rate'].str.replace(r"[a-zA-Z]",'')
df['Not classified'] = df['Not classified'].str.replace(r"[a-zA-Z]",'')

print(df)

演示:

repl.it/@SanyAhmed/…

运行结果:

   Accuracy  Error rate  Not classified
0     26.15        0.00           73.85
1     29.68        0.00           70.32
2     33.98        0.00           66.02
3     35.34        0.00           64.66
4     35.75        0.00           64.25
5     37.51        0.00           62.49
6     38.63        0.00           61.37
7     40.81        0.00           59.19
8     41.22        0.00           58.78
9     41.99        0.00           58.01
10    42.34        0.00           57.66
11    42.40        0.00           57.60
12    43.05        0.00           56.95
13    44.29        0.00           55.71
14    44.35        0.00           55.65
15    44.76        0.00           55.24
16    45.29        0.00           54.71
17    45.35        0.00           54.65
18    95.35        4.24            0.41
19    95.76        4.24            0.00
20    94.74        5.26            0.00