在今天的Python练习中,我们将创建一个一次性编码算法,将分类数据转换为数字格式。
我们需要这样做的原因是,机器学习算法只能处理数字,而不是字符串。
什么是一热编码?
热编码是一种以二进制格式表示分类信息的方式,但其方式是二进制数字中只有一个数字被设置为1,这就是为什么它被称为一热,因为二进制数字中任何时候都只有一个比特是ON的。
我们正在讨论的分类数据类型是顺序不适用的类型(名义)。
如果分类有一个自然的顺序,例如(一周中的一天。周日、周一、周二、周三、周四、周五、周六),那么你就不需要使用单热编码。你可以直接给一周中的每一天从零开始分配一个整数。
一般来说,给一个名义上的类别分配一个序数值是个坏主意(例如,猫或狗),因为机器学习算法会认为这个类别有一个自然的顺序。
在这个案例中,我将向你展示几种为你可能遇到的不同类型的数据创建单热编码的方法。
首先,让我们看看一个简单的单次编码的例子
这个数字是我们的二进制数字的长度。
categories=["cat", "fox", "badger", "person"]
正如你所看到的,这个类别是名义上的,意味着没有可以应用的自然顺序。所以这是一个很好的使用一热编码的候选者。
为了应用这种算法,我们需要计算该类别中的元素数量。有了这个,我们就有了我们的二进制数的长度。
length_of_binary_number = len(categories)
print(length_of_binary_number)
4
手动一热编码
为了用一热编码表示这个类别,我们可以简单地手动定义每个类别。
one_hot_encodings = dict()
one_hot_encodings["cat"] = 0b1000
one_hot_encodings["fox"] = 0b0100
one_hot_encodings["badger"] = 0b0010
one_hot_encodings["person"] = 0b0001
print(one_hot_encodings)
{'cat': 8, 'fox': 4, 'badger': 2, 'person': 1}
你会注意到,我们没有给任何元素分配0b0000。那是故意的。
大类别的一热编码
要对有少量元素的类别进行编码,手动进行编码并使用二进制数字就可以了。但是当你要处理有成千上万个项目的大类别时,只用一个二进制数来表示所有可能的项目是不可能的。
例如考虑一个32位的整数,这就是计算机通常用来存储整数的方法。
一热编码只允许我们使用一个比特来代表一个类别中的项目。因此,如果我们被限制在一个32位的数字上,我们最多可以对一个类别的32个元素进行编码!
为了表示有大量项目的类别,我们需要使用数组。
使用单次编码对《圣经》中的单词进行编码
在这个练习中,我们将使用一个NumPy数组来存储通过寻找《创世纪》英文文本中每一个不同的词而产生的一热编码格式的特征。
首先,我们使用HTTP请求库从创世纪下载文本。
import requests
r = requests.get("http://www.stewartonbibleschool.org/bible/text/genesis.txt", stream = True)
# Check if the image was retrieved successfully
if r.status_code == 200:
bible_genesis_text = r.text
print(bible_genesis_text[0:2000])
The First Book of Moses called
GENESIS
1:1: In the beginning God created the heaven and the earth.
1:2: And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters.
1:3: And God said, Let there be light: and there was light.
1:4: And God saw the light, that it was good: and God divided the light from the darkness.
1:5: And God called the light Day, and the darkness he called Night. And the evening and the morning were the first day.
1:6: And God said, Let there be a firmament in the midst of the waters, and let it divide the waters from the waters.
1:7: And God made the firmament, and divided the waters which were under the firmament from the waters which were above the firmament: and it was so.
1:8: And God called the firmament Heaven. And the evening and the morning were the second day.
1:9: And God said, Let the waters under the heaven be gathered together unto one place, and let the dry land appear: and it was so.
1:10: And God called the dry land Earth; and the gathering together of the waters called he Seas: and God saw that it was good.
1:11: And God said, Let the earth bring forth grass, the herb yielding seed, and the fruit tree yielding fruit after his kind, whose seed is in itself, upon the earth: and it was so.
1:12: And the earth brought forth grass, and herb yielding seed after his kind, and the tree yielding fruit, whose seed was in itself, after his kind: and God saw that it was good.
1:13: And the evening and the morning were the third day.
1:14: And God said, Let there be lights in the firmament of the heaven to divide the day from the night; and let them be for signs, and for seasons, and for days, and years:
1:15: And let them be for lights in the firmament of the heaven to give light upon the earth: and it was so.
1:16: And God made two great lights; the greater light to rule the day, and the lesser light to rule the night: he made the stars also.
1:17: And
使用Python正则表达式删除诗句数字
我们要做一些非常轻微的数据清洗。这是在任何数据科学项目中你都应该做的事情。
现在我们需要遍历Generis中的每一节经文,小心地丢弃经文编号。我们不希望这些数字被用于单热编码。
由于诗句编号的长度不同,我们不能只是从每一行中删除一个固定数量的字符。
这是一个使用简单正则表达式的好地方。
import re
lines = bible_genesis_text.splitlines()
verses_without_number=""
for line in lines:
verse_without_number = re.sub("d+:d+:", "", line )
verses_without_number += (verse_without_number + "n")
print(verses_without_number[0:2000])
The First Book of Moses called
GENESIS
In the beginning God created the heaven and the earth.
And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters.
And God said, Let there be light: and there was light.
And God saw the light, that it was good: and God divided the light from the darkness.
And God called the light Day, and the darkness he called Night. And the evening and the morning were the first day.
And God said, Let there be a firmament in the midst of the waters, and let it divide the waters from the waters.
And God made the firmament, and divided the waters which were under the firmament from the waters which were above the firmament: and it was so.
And God called the firmament Heaven. And the evening and the morning were the second day.
And God said, Let the waters under the heaven be gathered together unto one place, and let the dry land appear: and it was so.
And God called the dry land Earth; and the gathering together of the waters called he Seas: and God saw that it was good.
And God said, Let the earth bring forth grass, the herb yielding seed, and the fruit tree yielding fruit after his kind, whose seed is in itself, upon the earth: and it was so.
And the earth brought forth grass, and herb yielding seed after his kind, and the tree yielding fruit, whose seed was in itself, after his kind: and God saw that it was good.
And the evening and the morning were the third day.
And God said, Let there be lights in the firmament of the heaven to divide the day from the night; and let them be for signs, and for seasons, and for days, and years:
And let them be for lights in the firmament of the heaven to give light upon the earth: and it was so.
And God made two great lights; the greater light to rule the day, and the lesser light to rule the night: he made the stars also.
And God set them in the firmament of the heaven to give light upon the earth,
为《创世纪》中的每一节经文创建一个NumPy数组,并对其进行一次性编码
print(verses_without_number.split()[0:25])
['The', 'First', 'Book', 'of', 'Moses', 'called', 'GENESIS', 'In', 'the', 'beginning', 'God', 'created', 'the', 'heaven', 'and', 'the', 'earth.', 'And', 'the', 'earth', 'was', 'without', 'form,', 'and', 'void;']
注意我们在每个词中仍然有标点符号。我们不希望它出现!
幸运的是,我们不需要为此想出一个正则表达式。Python 袖子里有一个秘密
import string
string.punctuation
'!"#$%&'()*+,-./:;<=>[email protected][\]^_`{|}~'
如果我们导入字符串库,我们就可以访问一个方便的属性,它包含了你在文本中通常会发现的所有标点符号。
使用str.strip(),我们就可以在应用单热编码之前将所有的标点符号从我们的文字中删除。
for word in verses_without_number.split()[0:25]:
word_without_punctuuation = word.strip(string.punctuation)
print(word_without_punctuuation)
The
First
Book
of
Moses
called
GENESIS
In
the
beginning
God
created
the
heaven
and
the
earth
And
the
earth
was
without
form
and
void
使用Numpy数组生成一热编码
现在我们已经清理了我们的数据,我们已经准备好生成我们的一热编码了
words_index= {}
words_in_genesis_list = verses_without_number.split()
for word in words_in_genesis_list:
word_without_punctuation = word.strip(string.punctuation)
if word not in words_index:
words_index[word_without_punctuation] = len(words_index)
list(words_index.items())[0:10]
[('The', 0), ('First', 1), ('Book', 2), ('of', 1799), ('Moses', 4), ('called', 1413), ('GENESIS', 6), ('In', 7), ('the', 1056), ('beginning', 2331)]
现在是时候使用Numpy生成一热编码了。
import numpy as np
result = np.zeros((len(words_in_genesis_list), len(words_index)))
for index, word in enumerate(words_in_genesis_list):
word_without_punctuation = word.strip(string.punctuation)
hot_index = words_index[word_without_punctuation]
result[index][hot_index-1] = 1
print(result.shape)
result[0:100]
(38267, 2670)
array([[0., 0., 0., ..., 0., 0., 1.],
[1., 0., 0., ..., 0., 0., 0.],
[0., 1., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]])
使用Pandas的一热编码
由于一热编码在数据科学中使用得相当频繁,你会发现它已经在最流行的数据科学库中为你实现。
在Pandas库中,你可以使用get_dummies()方法将一热编码应用于熊猫数据框架中的某一列。
首先,让我们从现有的数据集中创建一个带有单列的熊猫数据框。
import pandas as pd
# we use a lambda function and map to strip all the punctuation
words_in_genesis_list = list(map( lambda x: x.strip(string.punctuation), words_in_genesis_list))
df = pd.DataFrame(words_in_genesis_list, columns=["Word"])
df
| 字 | |
|---|---|
| 0 | 的 |
| 1 | 第一个 |
| 2 | 书 |
| 3 | 的 |
| 4 | 摩西 |
| ... | ... |
| 38262 | 在 |
| 38263 | a |
| 38264 | 棺材 |
| 38265 | 在 |
| 38266 | 埃及 |
38267行×1列
要应用一热编码,我们只需做以下工作。
pd.get_dummies(df["Word"], prefix="word")
| word_A | word_Abel | word_Abel-mizraim | word_Abida | word_Abide | 字:阿比玛勒 | 字:阿比米勒 | 亚比米勒的字 | 亚伯拉罕 | 字_阿布拉罕的 | ... | 词_yoke | 词_yonder | word_you | 词_young | 词_younger | 最年轻的 | 你的 | 你的 | 词_yourselves | 字_youth | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 38262 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 38263 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 38264 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 38265 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 38266 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
38267行 × 2670列
我们已经为熊猫表添加了大量的列。很明显,这不是给人类使用的。但是对于机器学习算法来说,这正是它需要的,以便对数据进行一些理解。
用Sklearn进行单热编码
import sklearn.preprocessing as preprocessing
labelEncoder = preprocessing.LabelEncoder()
sk_words_index = labelEncoder.fit_transform(words_in_genesis_list)
print(sk_words_index)
onehotEnc = preprocessing.OneHotEncoder()
onehotEnc.fit(sk_words_index.reshape(-1, 1))
one_hot_encoded_words = onehotEnc.transform(sk_words_index.reshape(-1, 1))
print("The One-Hot-Encoded verses")
print(one_hot_encoded_words.toarray())
print(one_hot_encoded_words.shape)
[ 600 208 102 ... 973 1535 159]
The One-Hot-Encoded verses
[[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
...
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]
(38267, 2670)
用Keras进行单热编码
Keras的单热编码带有一些额外的功能。例如,默认情况下,所有的词都被转换为小写,以避免重复的词被区别对待。此外,你还可以根据词频,指定你要考虑的建立word_index的最大词数。如果你想去除异常值,这可能很有用。你还可以自定义Keras用来从文本中剥离不需要的字符(如标点符号、数字等)的词语过滤器。
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(
lower=False,
num_words=None
)
tokenizer.fit_on_texts(words_in_genesis_list)
#sequences = tokenizer.texts_to_sequences(verses_without_number_list)
one_hot_results = tokenizer.texts_to_matrix(words_in_genesis_list, mode='binary')
one_hot_results.shape
(38267, 2686)
有趣的是,来自Keras One-Hot-Encoding的结果与其他结果略有不同。这是为什么呢?
keras_word_index = tokenizer.word_index
for word in words_index:
if word not in keras_word_index:
print(word)
Tubal-cain
Hazar-maveth
El-paran
En-mishpat
Hazezon-tamar
Beer-lahai-roi
Beer-sheba
Jehovah-jireh
Kirjath-arba
Lahai-roi
Padan-aram
Jegar-sahadutha
El-elohe-Israel
El-beth-el
Allon-bachuth
Ben-oni
Baal-hanan
Zaphnath-paaneah
Poti-pherah
Abel-mizraim
啊哈。似乎Keras对待带有破折号(-)的单词的方式有点不同。
keras_word_index = tokenizer.word_index
for word in keras_word_index:
if word not in words_index:
print(word)
Beer
sheba
aram
El
roi
Poti
pherah
cain
Lahai
Baal
hanan
Hazar
maveth
paran
En
mishpat
Hazezon
tamar
lahai
Jehovah
jireh
Kirjath
arba
Jegar
sahadutha
elohe
beth
el
Allon
bachuth
Ben
oni
Zaphnath
paaneah
mizraim
我认为罪魁祸首是过滤器列表,它剥离了所有的-。让我们从过滤器属性中移除"-",看看会发生什么。
tokenizer = Tokenizer(
lower=False,
num_words=None,
filters='!"#$%&()*+,./:;<=>[email protected][\]^_`{|}~tn',
split=' '
)
tokenizer.fit_on_texts(words_in_genesis_list)
#sequences = tokenizer.texts_to_sequences(verses_without_number_list)
one_hot_results = tokenizer.texts_to_matrix(words_in_genesis_list, mode='binary')
one_hot_results.shape
(38267, 2671)
似乎我们在Keras的单词索引中仍然有一个额外的列(2671 vs 2670)。
keras_word_index = tokenizer.word_index
for word in keras_word_index:
if word not in words_index:
print(word)
for word in words_index:
if word not in keras_word_index:
print(word)
print(len(keras_word_index))
print(len(words_index))
2670
2670
词的索引似乎是相同的。但为什么会有额外的一列呢?
线索就在Keras的文档中。似乎他们使用索引0作为内部使用。因此,没有一个词的第一列被分配为1。