举例说明什么是Pandas DataFrameDataFrame是pandas的重要数据结构之一，也是使用pandas进

DataFrame是pandas的重要数据结构之一，也是使用pandas进行数据分析过程中最常用的结构之一。可以说，只要掌握了DataFrame的用法，就具备了学习数据分析的基本能力。

1.什么是Pandas的DataFrame结构

DataFrame是一种表格式的数据结构，有行标签和列标签。数据以行和列表示，每一列代表实体的一个属性，每一行代表一个实体的数据。
它也被称为异质数据表格。所谓异质性是指表中每一列的数据类型可以是不同的，如字符串、整数或浮点等。
DataFrame中的每一行数据都可以被看作是一个系列结构，但DataFrame为这些行中的每一个数据值都添加了一个列标签。因此，DataFrame实际上是从Series演变而来的。DataFrame的结构类似于EXECL的表格。
和Series一样，DataFrame也有自己的行标签索引，默认为 "隐式索引"，即从0开始连续增加，行标签与DataFrame中的数据项逐一对应。当然，你也可以使用 "显式索引"来设置行标签。
DataFrame中的每个数据值都可以被修改，DataFrame结构的行和列的数量可以增加或删除，DataFrame有两个标签轴即行标签和列标签，DataFrame可以对行和列进行算术运算。

2.如何创建Pandas DataFrame对象

创建一个DataFrame对象的语法格式如下：

# import the pandas library
import pandas as pd

# call the DataFrame() mehtod to create a pandas DataFrame object.
pd.DataFrame( data, index, columns, dtype, copy)

data  : The input data can be ndarray, series, list, dict, scalar and another DataFrame object.

index : Row label, if no index value is passed, the default row label is np.arange(n), and n represents the number of data elements.

columns : Column label, if the columns value is not passed, the default column label is np.arange(n).

dtype : Dtype represents the data type of each column.

copy ： The default value is false, which means data is copied.

创建一个空的DataFrame对象。

>>> import pandas as pd # import the pandas module.
>>>
>>> df = pd.DataFrame() # create an empty DataFrame object.
>>>
>>> print(df) # print out the empty DataFrame object.
Empty DataFrame
Columns: []
Index: []

用一个数组对象创建一个DataFrame。

>>> import pandas as pd
>>>
>>> data = ['python', 100, 'javascript', 'java', 199] # create a one dimension array.
>>>
>>> df = pd.DataFrame(data) # create a DataFrame object based on the above array.
>>>
>>> print(df) # print out the above DataFrame object.
            0
0      python
1         100
2  javascript
3        java
4         199
>>>

使用一个嵌套的列表对象创建一个DataFrame对象。

>>> import pandas as pd
>>>
>>> data = [['Tom', 'Developer', 10000],['Bob', 'QA', 12000],['Jerry', 'Manager', 13000]] # define the 2 dimension array to save employee data ( name, title, salary ).
>>>
>>> columns_array = ['Name','Title','Salary']
>>>
>>> df = pd.DataFrame(data,columns = columns_array) # create the DataFrame object with the above data, and set the data columns with the columns_arrary.
>>>
>>> print(df)
    Name      Title  Salary
0    Tom  Developer   10000
1    Bob         QA   12000
2  Jerry    Manager   13000
>>>

指定数字元素的数据类型为float。

>>> import pandas as pd
>>>
>>> data = [['Tom', 'Developer', 10000],['Bob', 'QA', 12000],['Jerry', 'Manager', 13000]] # define the 2 dimension array to save employee data ( name, title, salary ).
>>>
>>> columns_array = ['Name','Title','Salary']
>>>
>>> df = pd.DataFrame(data,columns = columns_array, dtype = float) # create the DataFrame object with the above data, and set the data columns with the columns_arrary, specify the data type to float
sys:1: FutureWarning: Could not cast to float64, falling back to object. This behavior is deprecated. In a future version, when a dtype is passed to 'DataFrame', either all columns will be cast to that dtype, or a TypeError will be raised
>>>
>>> print(df)
    Name      Title   Salary
0    Tom  Developer  10000.0
1    Bob         QA  12000.0
2  Jerry    Manager  13000.0
>>>

用python字典对象创建DataFrame。在数据字典中，每个键所对应的值的元素长度必须相同（也就是说，值列表的长度必须相同）。如果传递了索引，索引的长度应该等于数组的长度。如果没有传递索引，默认情况下，索引将是range(n)，其中n是数组的长度。

>>> import pandas as pd
>>>
>>> data = {'Name':['Tom', 'Jerry', 'Bill', 'Bob'], 'Title':['Developer', 'QA', 'PM', 'Manager'], 'Salary':[12800,13400,12900,14200]} # define a python dictionary object, the key is the DataFrame column & the list value is the column data.
>>>
>>> df = pd.DataFrame(data) # create a DataFrame object based on the above python dictionary object.
>>>
>>> print(df) # print out the DataFrame object.
    Name      Title  Salary
0    Tom  Developer   12800
1  Jerry         QA   13400
2   Bill         PM   12900
3    Bob    Manager   14200
>>>
>>># The above example uses the default row label, which is generated by the function range(n). It generates 0,1,2,3 and corresponds to each element value in the list.
>>>
>>># We can also add custom row labels to the above example like below.
>>>
>>> import pandas as pd
>>>
>>> data = {'Name':['Tom', 'Jerry', 'Bill', 'Bob'], 'Title':['Developer', 'QA', 'PM', 'Manager'], 'Salary':[12800,13400,12900,14200]}
>>>
>>> index = ['employee_1', 'employee_2', 'employee_3', 'employee_4'] # define a python list array.
>>>
>>> df = pd.DataFrame(data, index = index) # use the above list array as the DataFrame row label.
>>>
>>> print(df)
             Name      Title  Salary
employee_1    Tom  Developer   12800
employee_2  Jerry         QA   13400
employee_3   Bill         PM   12900
employee_4    Bob    Manager   14200

列表嵌套的字典可以作为输入数据传递给DataFrame构造函数。默认情况下，字典的键被用作列名。

>>> import pandas as pd
>>>
>>> data_list_dict = [{'python': 90, 'java': 88},{'python': 99, 'java': 68, 'javascript': 100}]
>>>
>>> df = pd.DataFrame(data_list_dict)
>>>
>>> print(df) # the first row does not contain the 'javascript' column, so the 'javascript' column's value is NaN.
   python  java  javascript
0      90    88         NaN
1      99    68       100.0
>>>

使用字典嵌套的列表创建一个DataFrame对象，并提供行和列的标签。

>>> import pandas as pd
>>>
>>> data_list_dict = [{'python': 90, 'java': 88},{'python': 99, 'java': 68, 'javascript': 100}]
>>>
>>> index_label = ['Tom', 'Jerry'] # because the data_list_dict only contains 2 rows of data, so the row index label should have 2 elements also, otherwise it will throw the ValueError: Shape of passed values is (2, 3), indices imply (3, 3).
>>>
>>> column_label = ['python', 'java', 'javascript']
>>>
>>> df = pd.DataFrame(data, index = index_label, columns = column_label)
>>>
>>> print(df) # the first row does not contain the 'javascript' column, so the 'javascript' column's value is NaN.
        python  java  javascript
Tom        90    88         NaN
Jerry      99    68       100.0
>>>

你也可以在字典中传递一个Series来创建一个DataFrame对象，输出结果的行索引标签是字典对象中所有Series对象的索引集合。

>>> import pandas as pd
>>>
>>> series_1 = pd.Series(['Tom', 'Jerry', 'Bill', 'Bob'], index=['name_row_1', 'name_row_2', 'row_3', 'row_4'])
>>>
>>> series_2 = pd.Series(['Developer', 'QA', 'PM', 'Manager'], index=['row_a', 'row_2', 'row_c', 'row_d'])
>>>
>>> series_3 = pd.Series([12800,13400,12900,14200], index=['row_', 'row_2', 'row_c', 'row_d'])
>>>
>>> data = {'Name':series_1, 'Title':series_2, 'Salary':series_3 }
>>>
>>> df = pd.DataFrame(data)
>>>
>>> print(df)
             Name      Title   Salary
name_row_1    Tom        NaN      NaN
name_row_2  Jerry        NaN      NaN
row_          NaN        NaN  12800.0
row_2         NaN         QA  13400.0
row_3        Bill        NaN      NaN
row_4         Bob        NaN      NaN
row_a         NaN  Developer      NaN
row_c         NaN         PM  12900.0
row_d         NaN    Manager  14200.0

2.如何查询、添加、删除Pandas DataFrame数据

2.1 用列索引标签操纵DataFrame对象

查询列数据：你可以使用列索引标签来轻松查询DataFrame对象的列数据。下面是这个例子。

>>> import pandas as pd
>>>
>>> series_1 = pd.Series(['Tom', 'Jerry', 'Bill', 'Bob'])
>>>
>>> series_2 = pd.Series(['Developer', 'QA', 'PM', 'Manager'])
>>>
>>> series_3 = pd.Series([12800,13400,12900,14200])
>>>
>>> data = {'Name':series_1, 'Title':series_2, 'Salary':series_3 }
>>>
>>> print(data)
{'Name': 0      Tom
1    Jerry
2     Bill
3      Bob
dtype: object, 'Title': 0    Developer
1           QA
2           PM
3      Manager
dtype: object, 'Salary': 0    12800
1    13400
2    12900
3    14200
dtype: int64}
>>>
>>>
>>> df = pd.DataFrame(data)
>>>
>>> print(df)
    Name      Title  Salary
0    Tom  Developer   12800
1  Jerry         QA   13400
2   Bill         PM   12900
3    Bob    Manager   14200
>>>
>>> print(df['Name']) # print out the 'Name' column.
0      Tom
1    Jerry
2     Bill
3      Bob
Name: Name, dtype: object

添加列数据：使用列索引标签或DataFrame对象的插入功能，你可以添加新的数据列，下面是例子。

>>> import pandas as pd
>>>
>>> series_1 = pd.Series(['Tom', 'Jerry', 'Bill', 'Bob'])
>>>
>>> series_2 = pd.Series(['Developer', 'QA', 'PM', 'Manager'])
>>>
>>> data = {'Name':series_1}
>>>
>>> df = pd.DataFrame(data) # create the DataFrame object.
>>>
>>> print(df)
    Name
0    Tom
1  Jerry
2   Bill
3    Bob
>>>
>>> df['Title'] = series_2 # add the Title column.
>>>
>>> print(df)
    Name      Title
0    Tom  Developer
1  Jerry         QA
2   Bill         PM
3    Bob    Manager
>>>
>>> df['Name - Title'] = df['Name'] + ' - ' +  df['Title'] # add a new column based on the Name & Title column.
>>>
>>> print(df)
    Name      Title     Name - Title
0    Tom  Developer  Tom - Developer
1  Jerry         QA       Jerry - QA
2   Bill         PM        Bill - PM
3    Bob    Manager    Bob - Manager
>>>
>>> series_3 = pd.Series([12800,13400,12900,14200]) # define the third column Series object.
>>>
>>> df.insert(2, column = 'Salary', value = series_3) # insert the third column into the DataFrame object.
>>>
>>> print(df)
    Name      Title  Salary     Name - Title
0    Tom  Developer   12800  Tom - Developer
1  Jerry         QA   13400       Jerry - QA
2   Bill         PM   12900        Bill - PM
3    Bob    Manager   14200    Bob - Manager

删除列数据：使用pythondel命令或DataFrame对象的pop()函数，可以很容易地删除DataFrame对象的数据列。下面是这个例子。

>>> import pandas as pd
>>>
>>> series_1 = pd.Series(['Tom', 'Jerry', 'Bill', 'Bob'])
>>>
>>> series_2 = pd.Series(['Developer', 'QA', 'PM', 'Manager'])
>>>
>>> series_3 = pd.Series([12800,13400,12900,14200])
>>>
>>> data = {'Name':series_1, 'Title':series_2, 'Salary':series_3 }
>>>
>>> df = pd.DataFrame(data) # create the DataFrame object.
>>>
>>> print(df)
    Name      Title  Salary
0    Tom  Developer   12800
1  Jerry         QA   13400
2   Bill         PM   12900
3    Bob    Manager   14200
>>>
>>> del df['Salary'] # delete the Salary column with the del command.
>>>
>>> print(df) # we can see that the Salary column has been removed from the original DataFrame object.
    Name      Title
0    Tom  Developer
1  Jerry         QA
2   Bill         PM
3    Bob    Manager
>>>
>>>
>>> df.pop('Title') # delete the Title column with the pop() function.
0    Developer
1           QA
2           PM
3      Manager
Name: Title, dtype: object
>>>
>>> print(df) # the Title column has been removed also.
    Name
0    Tom
1  Jerry
2   Bill
3    Bob
>>>

2.2 用行索引标签操纵DataFrame对象

查询行数据：你可以将行标签传递给DataFrame对象的loc属性或者将行索引号传递给DataFrame 对象的iloc属性来查询行的数据。下面是这个例子。

>>> import pandas as pd
>>> 
>>> data = [['Tom', 'Developer', 10000],['Bob', 'QA', 12000],['Jerry', 'Manager', 13000]]
>>> 
>>> columns_array = ['Name','Title','Salary']
>>> 
>>> row_index_label_arr = ['a', 'b', 'c']
>>> 
>>> df = pd.DataFrame(data,columns = columns_array, index = row_index_label_arr)
>>> 
>>> print(df)
    Name      Title  Salary
a    Tom  Developer   10000
b    Bob         QA   12000
c  Jerry    Manager   13000
>>> 
>>> 
>>> print(df.loc['a']) # call the DataFrame object's loc attribute to get one row by row index label.
Name            Tom
Title     Developer
Salary        10000
Name: a, dtype: object
>>> 
>>> print(df.iloc[2]) # call the DataFrame object's iloc attribute to get one row by row index number.
Name        Jerry
Title     Manager
Salary      13000
Name: c, dtype: object
>>>

你也可以使用切分来同时选择多条记录。下面是这个例子。

>>> print(df[1:3]) # return 2 rows from the DataFrame object. 
    Name    Title  Salary
b    Bob       QA   12000
c  Jerry  Manager   13000

添加行数据：使用append()函数，你可以将另一个DataFrame对象的行添加到当前DataFrame对象中，它将把数据行附加在行的末尾。下面是这个例子。

>>> import pandas as pd
>>> 
>>> data = [['Tom', 'Developer', 10000],['Bob', 'QA', 12000]]
>>> 
>>> columns_array = ['Name','Title','Salary']
>>> 
>>> df = pd.DataFrame(data,columns = columns_array) # create the first DataFrame object.
>>> 
>>> print(df)
  Name      Title  Salary
0  Tom  Developer   10000
1  Bob         QA   12000
>>> 
>>> data1 = [['Jerry', 'Manager', 13000]] 
>>> 
>>> df1 = pd.DataFrame(data1,columns = columns_array) # create the second DataFrame object.
>>> 
>>> print(df1)
    Name    Title  Salary
0  Jerry  Manager   13000
>>> 
>>> df = df.append(df1) # append df1 to the end of df.
>>>  
>>> print(df)
    Name      Title  Salary
0    Tom  Developer   10000
1    Bob         QA   12000
0  Jerry    Manager   13000
>>>
>>> df = df.append(df1, ignore_index = True, sort = True) # append df1 to df with the parameters, ignore_index means ignore the original index and create new index
>>> 
>>> print(df)
    Name  Salary      Title
0    Tom   10000  Developer
1    Bob   12000         QA
2  Jerry   13000    Manager

删除行数据：你可以使用DataFrame对象的drop()方法，并将行的索引标签传递给它，从DataFrame对象中删除一行数据。如果有重复的索引标签，它们将被一起删除。下面是这个例子。

>>> print(df) # print out the original DataFrame object.
    Name  Salary      Title
0    Tom   10000  Developer
1    Bob   12000         QA
2  Jerry   13000    Manager
>>> 
>>> df.drop(0)  # drop the first row.
    Name  Salary    Title
1    Bob   12000       QA
2  Jerry   13000  Manager
>>> 
>>> print(df) # the original DataFrame object is not changed.
    Name  Salary      Title
0    Tom   10000  Developer
1    Bob   12000         QA
2  Jerry   13000    Manager
>>> 
>>> df.drop(0, inplace = True) # add the inplace = True argument when invoke the DataFrame object's drop() method to modify the original DataFrame object.
>>> 
>>> print(df)
    Name  Salary    Title
1    Bob   12000       QA
2  Jerry   13000  Manager

3.DataFrame的属性和方法

T（转置）：这个属性将返回一个DataFrame对象的转置，也就是交换DataFrame对象的行和列。

>>> import pandas as pd
>>> 
>>> data = [['Tom', 'Developer', 10000],['Bob', 'QA', 12000],['Jerry', 'Manager', 13000]]
>>> 
>>> columns_array = ['Name','Title','Salary']
>>> 
>>> row_index_label_arr = ['a', 'b', 'c']
>>> 
>>> df = pd.DataFrame(data,columns = columns_array, index = row_index_label_arr)
>>> 
>>> print(df)
    Name      Title  Salary
a    Tom  Developer   10000
b    Bob         QA   12000
c  Jerry    Manager   13000
>>> 
>>> print(df.T) # exchange the DataFrame object's rows and columns
                a      b        c
Name          Tom    Bob    Jerry
Title   Developer     QA  Manager
Salary      10000  12000    13000
>>>

dtypes：返回每一列的数据类型。

>>> print(df.dtypes)
Name      object
Title     object
Salary     int64
dtype: object

axes：返回行标签和列标签的列表。

>>> print(df.axes)
[Index(['a', 'b', 'c'], dtype='object'), Index(['Name', 'Title', 'Salary'], dtype='object')]

empty：返回一个布尔值，以判断是否为空。返回一个布尔值来判断输出数据对象是否为空。如果为真，意味着该对象为空。

>>> print(df)
    Name      Title  Salary
a    Tom  Developer   10000
b    Bob         QA   12000
c  Jerry    Manager   13000
>>> 
>>> print(df.empty)
False
>>> 
>>> df.drop('a', inplace=True)
>>> 
>>> df.drop('b', inplace=True)
>>> 
>>> df.drop('c', inplace=True)
>>> 
>>> print(df)
Empty DataFrame
Columns: [Name, Title, Salary]
Index: []
>>> 
>>> print(df.empty) # now the DataFrame object's empty attribute returns True.
True

ndim：返回数据对象的尺寸。Dataframe是一个二维的数据结构。
```
>>> print(df.ndim)
2
```
size（尺寸）：返回DataFrame对象中元素的数量。
```
>>> print(df.size)
0
```

shape：返回代表DataFrame对象的元组。返回一个代表DataFrame维度的元组。返回一个值元组**（a，b），其中a**代表行的数量，b代表列的数量。

>>> df = pd.DataFrame(data,columns = columns_array, index = row_index_label_arr)
>>> 
>>> print(df)
    Name      Title  Salary
a    Tom  Developer   10000
b    Bob         QA   12000
c  Jerry    Manager   13000
>>> 
>>> print(df.shape)
(3, 3)

值：将DataFrame对象中的数据作为一个2维数组对象返回。

>>> df = pd.DataFrame(data,columns = columns_array, index = row_index_label_arr)
>>> 
>>> print(df)
    Name      Title  Salary
a    Tom  Developer   10000
b    Bob         QA   12000
c  Jerry    Manager   13000
>>> 
>>> print(df.values)
[['Tom' 'Developer' 10000]
 ['Bob' 'QA' 12000]
 ['Jerry' 'Manager' 13000]]

head(n)：返回前n行的数据，如果没有提供n，则返回前5行的数据。
tail(n):返回最后的n行数据， n的默认值是5。

```

>>> print(df)
    Name      Title  Salary
a    Tom  Developer   10000
b    Bob         QA   12000
c  Jerry    Manager   13000
>>> 
>>> print(df.head(1))
  Name      Title  Salary
a  Tom  Developer   10000
>>> 
>>> print(df.tail(1))
    Name    Title  Salary
c  Jerry  Manager   13000
```

11. shift()：移动行或列。它提供了一个时期参数，代表在特定轴上的移动步骤。下面是**shift()**函数的语法格式。

```
DataFrame.shift(periods=1, freq=None, axis=0):

1. periods :  The type is int, which indicates the moving steps. It can be positive or negative. The default value is 1.

2. freq : Date offset. The default value is None. It is applicable to time sequence. The value is a string that conforms to the time rule.

3. axis : If it is 0 or "index", it will move up and down. If it is 1 or "columns", it will move left and right.

4. fill_value : This parameter is used to fill in missing values, it can also be used to replace the original data.
```

12. 下面是shift()方法的例子。

```
>>> print(df)
    Name      Title  Salary
a    Tom  Developer   10000
b    Bob         QA   12000
c  Jerry    Manager   13000
>>> 
>>> df.shift(axis=0, periods=1) # you can find the first row has been shifted to the top.
  Name      Title   Salary
a  NaN        NaN      NaN
b  Tom  Developer  10000.0
c  Bob         QA  12000.0
>>> 
>>> df1 = df.shift(axis=1, periods=1) # you can find the first column has been shifted to the right.
>>> 
>>> df1
  Name  Title     Salary
a  NaN    Tom  Developer
b  NaN    Bob         QA
c  NaN  Jerry    Manager
>>> df1 = df.shift(axis=1, periods=1, fill_value='') # use empty string to replace the NaN value.
>>> 
>>> df1
  Name  Title     Salary
a         Tom  Developer
b         Bob         QA
c       Jerry    Manager
```