DataFrame是pandas的重要数据结构之一,也是使用pandas进行数据分析过程中最常用的结构之一。可以说,只要掌握了DataFrame的用法,就具备了学习数据分析的基本能力。
1.什么是Pandas的DataFrame结构
- DataFrame是一种表格式的数据结构,有行标签和列标签。数据以行和列表示,每一列代表实体的一个属性,每一行代表一个实体的数据。
- 它也被称为异质数据表格。所谓异质性是指表中每一列的数据类型可以是不同的,如字符串、整数或浮点等。
- DataFrame中的每一行数据都可以被看作是一个系列结构,但DataFrame为这些行中的每一个数据值都添加了一个列标签。因此,DataFrame实际上是从Series演变而来的。DataFrame的结构类似于EXECL的表格。
- 和Series一样,DataFrame也有自己的行标签索引,默认为 "隐式索引",即从0开始连续增加,行标签与DataFrame中的数据项逐一对应。当然,你也可以使用 "显式索引"来设置行标签。
- DataFrame中的每个数据值都可以被修改,DataFrame结构的行和列的数量可以增加或删除,DataFrame有两个标签轴即行标签和列标签,DataFrame可以对行和列进行算术运算。
2.如何创建Pandas DataFrame对象
-
创建一个DataFrame对象的语法格式如下:
# import the pandas library import pandas as pd # call the DataFrame() mehtod to create a pandas DataFrame object. pd.DataFrame( data, index, columns, dtype, copy) data : The input data can be ndarray, series, list, dict, scalar and another DataFrame object. index : Row label, if no index value is passed, the default row label is np.arange(n), and n represents the number of data elements. columns : Column label, if the columns value is not passed, the default column label is np.arange(n). dtype : Dtype represents the data type of each column. copy : The default value is false, which means data is copied. -
创建一个空的DataFrame对象。
>>> import pandas as pd # import the pandas module. >>> >>> df = pd.DataFrame() # create an empty DataFrame object. >>> >>> print(df) # print out the empty DataFrame object. Empty DataFrame Columns: [] Index: [] -
用一个数组对象创建一个DataFrame。
>>> import pandas as pd >>> >>> data = ['python', 100, 'javascript', 'java', 199] # create a one dimension array. >>> >>> df = pd.DataFrame(data) # create a DataFrame object based on the above array. >>> >>> print(df) # print out the above DataFrame object. 0 0 python 1 100 2 javascript 3 java 4 199 >>> -
使用一个嵌套的列表对象创建一个DataFrame对象。
>>> import pandas as pd >>> >>> data = [['Tom', 'Developer', 10000],['Bob', 'QA', 12000],['Jerry', 'Manager', 13000]] # define the 2 dimension array to save employee data ( name, title, salary ). >>> >>> columns_array = ['Name','Title','Salary'] >>> >>> df = pd.DataFrame(data,columns = columns_array) # create the DataFrame object with the above data, and set the data columns with the columns_arrary. >>> >>> print(df) Name Title Salary 0 Tom Developer 10000 1 Bob QA 12000 2 Jerry Manager 13000 >>> -
指定数字元素的数据类型为float。
>>> import pandas as pd >>> >>> data = [['Tom', 'Developer', 10000],['Bob', 'QA', 12000],['Jerry', 'Manager', 13000]] # define the 2 dimension array to save employee data ( name, title, salary ). >>> >>> columns_array = ['Name','Title','Salary'] >>> >>> df = pd.DataFrame(data,columns = columns_array, dtype = float) # create the DataFrame object with the above data, and set the data columns with the columns_arrary, specify the data type to float sys:1: FutureWarning: Could not cast to float64, falling back to object. This behavior is deprecated. In a future version, when a dtype is passed to 'DataFrame', either all columns will be cast to that dtype, or a TypeError will be raised >>> >>> print(df) Name Title Salary 0 Tom Developer 10000.0 1 Bob QA 12000.0 2 Jerry Manager 13000.0 >>> -
用python字典对象创建DataFrame。在数据字典中,每个键所对应的值的元素长度必须相同(也就是说,值列表的长度必须相同)。如果传递了索引,索引的长度应该等于数组的长度。如果没有传递索引,默认情况下,索引将是range(n),其中n是数组的长度。
>>> import pandas as pd >>> >>> data = {'Name':['Tom', 'Jerry', 'Bill', 'Bob'], 'Title':['Developer', 'QA', 'PM', 'Manager'], 'Salary':[12800,13400,12900,14200]} # define a python dictionary object, the key is the DataFrame column & the list value is the column data. >>> >>> df = pd.DataFrame(data) # create a DataFrame object based on the above python dictionary object. >>> >>> print(df) # print out the DataFrame object. Name Title Salary 0 Tom Developer 12800 1 Jerry QA 13400 2 Bill PM 12900 3 Bob Manager 14200 >>> >>># The above example uses the default row label, which is generated by the function range(n). It generates 0,1,2,3 and corresponds to each element value in the list. >>> >>># We can also add custom row labels to the above example like below. >>> >>> import pandas as pd >>> >>> data = {'Name':['Tom', 'Jerry', 'Bill', 'Bob'], 'Title':['Developer', 'QA', 'PM', 'Manager'], 'Salary':[12800,13400,12900,14200]} >>> >>> index = ['employee_1', 'employee_2', 'employee_3', 'employee_4'] # define a python list array. >>> >>> df = pd.DataFrame(data, index = index) # use the above list array as the DataFrame row label. >>> >>> print(df) Name Title Salary employee_1 Tom Developer 12800 employee_2 Jerry QA 13400 employee_3 Bill PM 12900 employee_4 Bob Manager 14200 -
列表嵌套的字典可以作为输入数据传递给DataFrame构造函数。默认情况下,字典的键被用作列名。
>>> import pandas as pd >>> >>> data_list_dict = [{'python': 90, 'java': 88},{'python': 99, 'java': 68, 'javascript': 100}] >>> >>> df = pd.DataFrame(data_list_dict) >>> >>> print(df) # the first row does not contain the 'javascript' column, so the 'javascript' column's value is NaN. python java javascript 0 90 88 NaN 1 99 68 100.0 >>> -
使用字典嵌套的列表创建一个DataFrame对象,并提供行和列的标签。
>>> import pandas as pd >>> >>> data_list_dict = [{'python': 90, 'java': 88},{'python': 99, 'java': 68, 'javascript': 100}] >>> >>> index_label = ['Tom', 'Jerry'] # because the data_list_dict only contains 2 rows of data, so the row index label should have 2 elements also, otherwise it will throw the ValueError: Shape of passed values is (2, 3), indices imply (3, 3). >>> >>> column_label = ['python', 'java', 'javascript'] >>> >>> df = pd.DataFrame(data, index = index_label, columns = column_label) >>> >>> print(df) # the first row does not contain the 'javascript' column, so the 'javascript' column's value is NaN. python java javascript Tom 90 88 NaN Jerry 99 68 100.0 >>> -
你也可以在字典中传递一个Series来创建一个DataFrame对象,输出结果的行索引标签是字典对象中所有Series对象的索引集合。
>>> import pandas as pd >>> >>> series_1 = pd.Series(['Tom', 'Jerry', 'Bill', 'Bob'], index=['name_row_1', 'name_row_2', 'row_3', 'row_4']) >>> >>> series_2 = pd.Series(['Developer', 'QA', 'PM', 'Manager'], index=['row_a', 'row_2', 'row_c', 'row_d']) >>> >>> series_3 = pd.Series([12800,13400,12900,14200], index=['row_', 'row_2', 'row_c', 'row_d']) >>> >>> data = {'Name':series_1, 'Title':series_2, 'Salary':series_3 } >>> >>> df = pd.DataFrame(data) >>> >>> print(df) Name Title Salary name_row_1 Tom NaN NaN name_row_2 Jerry NaN NaN row_ NaN NaN 12800.0 row_2 NaN QA 13400.0 row_3 Bill NaN NaN row_4 Bob NaN NaN row_a NaN Developer NaN row_c NaN PM 12900.0 row_d NaN Manager 14200.0
2.如何查询、添加、删除Pandas DataFrame数据
2.1 用列索引标签操纵DataFrame对象
-
查询列数据:你可以使用列索引标签来轻松查询DataFrame对象的列数据。下面是这个例子。
>>> import pandas as pd >>> >>> series_1 = pd.Series(['Tom', 'Jerry', 'Bill', 'Bob']) >>> >>> series_2 = pd.Series(['Developer', 'QA', 'PM', 'Manager']) >>> >>> series_3 = pd.Series([12800,13400,12900,14200]) >>> >>> data = {'Name':series_1, 'Title':series_2, 'Salary':series_3 } >>> >>> print(data) {'Name': 0 Tom 1 Jerry 2 Bill 3 Bob dtype: object, 'Title': 0 Developer 1 QA 2 PM 3 Manager dtype: object, 'Salary': 0 12800 1 13400 2 12900 3 14200 dtype: int64} >>> >>> >>> df = pd.DataFrame(data) >>> >>> print(df) Name Title Salary 0 Tom Developer 12800 1 Jerry QA 13400 2 Bill PM 12900 3 Bob Manager 14200 >>> >>> print(df['Name']) # print out the 'Name' column. 0 Tom 1 Jerry 2 Bill 3 Bob Name: Name, dtype: object -
添加列数据:使用列索引标签或DataFrame对象的插入功能,你可以添加新的数据列,下面是例子。
>>> import pandas as pd >>> >>> series_1 = pd.Series(['Tom', 'Jerry', 'Bill', 'Bob']) >>> >>> series_2 = pd.Series(['Developer', 'QA', 'PM', 'Manager']) >>> >>> data = {'Name':series_1} >>> >>> df = pd.DataFrame(data) # create the DataFrame object. >>> >>> print(df) Name 0 Tom 1 Jerry 2 Bill 3 Bob >>> >>> df['Title'] = series_2 # add the Title column. >>> >>> print(df) Name Title 0 Tom Developer 1 Jerry QA 2 Bill PM 3 Bob Manager >>> >>> df['Name - Title'] = df['Name'] + ' - ' + df['Title'] # add a new column based on the Name & Title column. >>> >>> print(df) Name Title Name - Title 0 Tom Developer Tom - Developer 1 Jerry QA Jerry - QA 2 Bill PM Bill - PM 3 Bob Manager Bob - Manager >>> >>> series_3 = pd.Series([12800,13400,12900,14200]) # define the third column Series object. >>> >>> df.insert(2, column = 'Salary', value = series_3) # insert the third column into the DataFrame object. >>> >>> print(df) Name Title Salary Name - Title 0 Tom Developer 12800 Tom - Developer 1 Jerry QA 13400 Jerry - QA 2 Bill PM 12900 Bill - PM 3 Bob Manager 14200 Bob - Manager -
删除列数据:使用pythondel命令或DataFrame对象的pop()函数,可以很容易地删除DataFrame对象的数据列。下面是这个例子。
>>> import pandas as pd >>> >>> series_1 = pd.Series(['Tom', 'Jerry', 'Bill', 'Bob']) >>> >>> series_2 = pd.Series(['Developer', 'QA', 'PM', 'Manager']) >>> >>> series_3 = pd.Series([12800,13400,12900,14200]) >>> >>> data = {'Name':series_1, 'Title':series_2, 'Salary':series_3 } >>> >>> df = pd.DataFrame(data) # create the DataFrame object. >>> >>> print(df) Name Title Salary 0 Tom Developer 12800 1 Jerry QA 13400 2 Bill PM 12900 3 Bob Manager 14200 >>> >>> del df['Salary'] # delete the Salary column with the del command. >>> >>> print(df) # we can see that the Salary column has been removed from the original DataFrame object. Name Title 0 Tom Developer 1 Jerry QA 2 Bill PM 3 Bob Manager >>> >>> >>> df.pop('Title') # delete the Title column with the pop() function. 0 Developer 1 QA 2 PM 3 Manager Name: Title, dtype: object >>> >>> print(df) # the Title column has been removed also. Name 0 Tom 1 Jerry 2 Bill 3 Bob >>>
2.2 用行索引标签操纵DataFrame对象
-
查询行数据:你可以将行标签传递给DataFrame对象的loc属性或者将行索引号传递给DataFrame 对象的iloc属性来查询行的数据。下面是这个例子。
>>> import pandas as pd >>> >>> data = [['Tom', 'Developer', 10000],['Bob', 'QA', 12000],['Jerry', 'Manager', 13000]] >>> >>> columns_array = ['Name','Title','Salary'] >>> >>> row_index_label_arr = ['a', 'b', 'c'] >>> >>> df = pd.DataFrame(data,columns = columns_array, index = row_index_label_arr) >>> >>> print(df) Name Title Salary a Tom Developer 10000 b Bob QA 12000 c Jerry Manager 13000 >>> >>> >>> print(df.loc['a']) # call the DataFrame object's loc attribute to get one row by row index label. Name Tom Title Developer Salary 10000 Name: a, dtype: object >>> >>> print(df.iloc[2]) # call the DataFrame object's iloc attribute to get one row by row index number. Name Jerry Title Manager Salary 13000 Name: c, dtype: object >>> -
你也可以使用切分来同时选择多条记录。下面是这个例子。
>>> print(df[1:3]) # return 2 rows from the DataFrame object. Name Title Salary b Bob QA 12000 c Jerry Manager 13000 -
添加行数据:使用append()函数,你可以将另一个DataFrame对象的行添加到当前DataFrame对象中,它将把数据行附加在行的末尾。下面是这个例子。
>>> import pandas as pd >>> >>> data = [['Tom', 'Developer', 10000],['Bob', 'QA', 12000]] >>> >>> columns_array = ['Name','Title','Salary'] >>> >>> df = pd.DataFrame(data,columns = columns_array) # create the first DataFrame object. >>> >>> print(df) Name Title Salary 0 Tom Developer 10000 1 Bob QA 12000 >>> >>> data1 = [['Jerry', 'Manager', 13000]] >>> >>> df1 = pd.DataFrame(data1,columns = columns_array) # create the second DataFrame object. >>> >>> print(df1) Name Title Salary 0 Jerry Manager 13000 >>> >>> df = df.append(df1) # append df1 to the end of df. >>> >>> print(df) Name Title Salary 0 Tom Developer 10000 1 Bob QA 12000 0 Jerry Manager 13000 >>> >>> df = df.append(df1, ignore_index = True, sort = True) # append df1 to df with the parameters, ignore_index means ignore the original index and create new index >>> >>> print(df) Name Salary Title 0 Tom 10000 Developer 1 Bob 12000 QA 2 Jerry 13000 Manager -
删除行数据:你可以使用DataFrame对象的drop()方法,并将行的索引标签传递给它,从DataFrame对象中删除一行数据。如果有重复的索引标签,它们将被一起删除。下面是这个例子。
>>> print(df) # print out the original DataFrame object. Name Salary Title 0 Tom 10000 Developer 1 Bob 12000 QA 2 Jerry 13000 Manager >>> >>> df.drop(0) # drop the first row. Name Salary Title 1 Bob 12000 QA 2 Jerry 13000 Manager >>> >>> print(df) # the original DataFrame object is not changed. Name Salary Title 0 Tom 10000 Developer 1 Bob 12000 QA 2 Jerry 13000 Manager >>> >>> df.drop(0, inplace = True) # add the inplace = True argument when invoke the DataFrame object's drop() method to modify the original DataFrame object. >>> >>> print(df) Name Salary Title 1 Bob 12000 QA 2 Jerry 13000 Manager
3.DataFrame的属性和方法
-
T(转置):这个属性将返回一个DataFrame对象的转置,也就是交换DataFrame对象的行和列。
>>> import pandas as pd >>> >>> data = [['Tom', 'Developer', 10000],['Bob', 'QA', 12000],['Jerry', 'Manager', 13000]] >>> >>> columns_array = ['Name','Title','Salary'] >>> >>> row_index_label_arr = ['a', 'b', 'c'] >>> >>> df = pd.DataFrame(data,columns = columns_array, index = row_index_label_arr) >>> >>> print(df) Name Title Salary a Tom Developer 10000 b Bob QA 12000 c Jerry Manager 13000 >>> >>> print(df.T) # exchange the DataFrame object's rows and columns a b c Name Tom Bob Jerry Title Developer QA Manager Salary 10000 12000 13000 >>> -
dtypes:返回每一列的数据类型。
>>> print(df.dtypes) Name object Title object Salary int64 dtype: object -
axes:返回行标签和列标签的列表。
>>> print(df.axes) [Index(['a', 'b', 'c'], dtype='object'), Index(['Name', 'Title', 'Salary'], dtype='object')] -
empty:返回一个布尔值,以判断是否为空。返回一个布尔值来判断输出数据对象是否为空。如果为真,意味着该对象为空。
>>> print(df) Name Title Salary a Tom Developer 10000 b Bob QA 12000 c Jerry Manager 13000 >>> >>> print(df.empty) False >>> >>> df.drop('a', inplace=True) >>> >>> df.drop('b', inplace=True) >>> >>> df.drop('c', inplace=True) >>> >>> print(df) Empty DataFrame Columns: [Name, Title, Salary] Index: [] >>> >>> print(df.empty) # now the DataFrame object's empty attribute returns True. True -
ndim:返回数据对象的尺寸。Dataframe是一个二维的数据结构。
>>> print(df.ndim) 2 -
size(尺寸):返回DataFrame对象中元素的数量。
>>> print(df.size) 0 -
shape:返回代表DataFrame对象的元组。返回一个代表DataFrame维度的元组。返回一个值元组**(a,b),其中a**代表行的数量,b代表列的数量。
>>> df = pd.DataFrame(data,columns = columns_array, index = row_index_label_arr) >>> >>> print(df) Name Title Salary a Tom Developer 10000 b Bob QA 12000 c Jerry Manager 13000 >>> >>> print(df.shape) (3, 3) -
值:将DataFrame对象中的数据作为一个2维数组对象返回。
>>> df = pd.DataFrame(data,columns = columns_array, index = row_index_label_arr) >>> >>> print(df) Name Title Salary a Tom Developer 10000 b Bob QA 12000 c Jerry Manager 13000 >>> >>> print(df.values) [['Tom' 'Developer' 10000] ['Bob' 'QA' 12000] ['Jerry' 'Manager' 13000]] -
head(n):返回前n行的数据,如果没有提供n,则返回前5行的数据。
-
tail(n):返回最后的n行数据, n的默认值是5。
```
>>> print(df)
Name Title Salary
a Tom Developer 10000
b Bob QA 12000
c Jerry Manager 13000
>>>
>>> print(df.head(1))
Name Title Salary
a Tom Developer 10000
>>>
>>> print(df.tail(1))
Name Title Salary
c Jerry Manager 13000
```
11. shift():移动行或列。它提供了一个时期参数,代表在特定轴上的移动步骤。下面是**shift()**函数的语法格式。
```
DataFrame.shift(periods=1, freq=None, axis=0):
1. periods : The type is int, which indicates the moving steps. It can be positive or negative. The default value is 1.
2. freq : Date offset. The default value is None. It is applicable to time sequence. The value is a string that conforms to the time rule.
3. axis : If it is 0 or "index", it will move up and down. If it is 1 or "columns", it will move left and right.
4. fill_value : This parameter is used to fill in missing values, it can also be used to replace the original data.
```
12. 下面是shift()方法的例子。
```
>>> print(df)
Name Title Salary
a Tom Developer 10000
b Bob QA 12000
c Jerry Manager 13000
>>>
>>> df.shift(axis=0, periods=1) # you can find the first row has been shifted to the top.
Name Title Salary
a NaN NaN NaN
b Tom Developer 10000.0
c Bob QA 12000.0
>>>
>>> df1 = df.shift(axis=1, periods=1) # you can find the first column has been shifted to the right.
>>>
>>> df1
Name Title Salary
a NaN Tom Developer
b NaN Bob QA
c NaN Jerry Manager
>>> df1 = df.shift(axis=1, periods=1, fill_value='') # use empty string to replace the NaN value.
>>>
>>> df1
Name Title Salary
a Tom Developer
b Bob QA
c Jerry Manager
```