举例说明什么是Pandas DataFrame

248 阅读12分钟

DataFrame是pandas的重要数据结构之一,也是使用pandas进行数据分析过程中最常用的结构之一。可以说,只要掌握了DataFrame的用法,就具备了学习数据分析的基本能力。

1.什么是Pandas的DataFrame结构

  1. DataFrame是一种表格式的数据结构,有行标签和列标签。数据以行和列表示,每一列代表实体的一个属性,每一行代表一个实体的数据。
  2. 它也被称为异质数据表格。所谓异质性是指表中每一列的数据类型可以是不同的,如字符串、整数或浮点等。
  3. DataFrame中的每一行数据都可以被看作是一个系列结构,但DataFrame为这些行中的每一个数据值都添加了一个列标签。因此,DataFrame实际上是从Series演变而来的。DataFrame的结构类似于EXECL的表格。
  4. Series一样,DataFrame也有自己的行标签索引,默认为 "隐式索引",即从0开始连续增加,行标签与DataFrame中的数据项逐一对应。当然,你也可以使用 "显式索引"来设置行标签。
  5. DataFrame中的每个数据值都可以被修改,DataFrame结构的行和列的数量可以增加或删除,DataFrame有两个标签轴即行标签和列标签,DataFrame可以对行和列进行算术运算。

2.如何创建Pandas DataFrame对象

  1. 创建一个DataFrame对象的语法格式如下:

    # import the pandas library
    import pandas as pd
    
    # call the DataFrame() mehtod to create a pandas DataFrame object.
    pd.DataFrame( data, index, columns, dtype, copy)
    
    data  : The input data can be ndarray, series, list, dict, scalar and another DataFrame object.
    
    index : Row label, if no index value is passed, the default row label is np.arange(n), and n represents the number of data elements.
    
    columns : Column label, if the columns value is not passed, the default column label is np.arange(n).
    
    dtype : Dtype represents the data type of each column.
    
    copy : The default value is false, which means data is copied.
    
  2. 创建一个空的DataFrame对象。

    >>> import pandas as pd # import the pandas module.
    >>>
    >>> df = pd.DataFrame() # create an empty DataFrame object.
    >>>
    >>> print(df) # print out the empty DataFrame object.
    Empty DataFrame
    Columns: []
    Index: []
    
  3. 用一个数组对象创建一个DataFrame

    >>> import pandas as pd
    >>>
    >>> data = ['python', 100, 'javascript', 'java', 199] # create a one dimension array.
    >>>
    >>> df = pd.DataFrame(data) # create a DataFrame object based on the above array.
    >>>
    >>> print(df) # print out the above DataFrame object.
                0
    0      python
    1         100
    2  javascript
    3        java
    4         199
    >>>
    
  4. 使用一个嵌套的列表对象创建一个DataFrame对象。

    >>> import pandas as pd
    >>>
    >>> data = [['Tom', 'Developer', 10000],['Bob', 'QA', 12000],['Jerry', 'Manager', 13000]] # define the 2 dimension array to save employee data ( name, title, salary ).
    >>>
    >>> columns_array = ['Name','Title','Salary']
    >>>
    >>> df = pd.DataFrame(data,columns = columns_array) # create the DataFrame object with the above data, and set the data columns with the columns_arrary.
    >>>
    >>> print(df)
        Name      Title  Salary
    0    Tom  Developer   10000
    1    Bob         QA   12000
    2  Jerry    Manager   13000
    >>>
    
  5. 指定数字元素的数据类型为float。

    >>> import pandas as pd
    >>>
    >>> data = [['Tom', 'Developer', 10000],['Bob', 'QA', 12000],['Jerry', 'Manager', 13000]] # define the 2 dimension array to save employee data ( name, title, salary ).
    >>>
    >>> columns_array = ['Name','Title','Salary']
    >>>
    >>> df = pd.DataFrame(data,columns = columns_array, dtype = float) # create the DataFrame object with the above data, and set the data columns with the columns_arrary, specify the data type to float
    sys:1: FutureWarning: Could not cast to float64, falling back to object. This behavior is deprecated. In a future version, when a dtype is passed to 'DataFrame', either all columns will be cast to that dtype, or a TypeError will be raised
    >>>
    >>> print(df)
        Name      Title   Salary
    0    Tom  Developer  10000.0
    1    Bob         QA  12000.0
    2  Jerry    Manager  13000.0
    >>>
    
    
  6. 用python字典对象创建DataFrame。在数据字典中,每个键所对应的值的元素长度必须相同(也就是说,值列表的长度必须相同)。如果传递了索引索引的长度应该等于数组的长度。如果没有传递索引,默认情况下,索引将是range(n),其中n是数组的长度。

    >>> import pandas as pd
    >>>
    >>> data = {'Name':['Tom', 'Jerry', 'Bill', 'Bob'], 'Title':['Developer', 'QA', 'PM', 'Manager'], 'Salary':[12800,13400,12900,14200]} # define a python dictionary object, the key is the DataFrame column & the list value is the column data.
    >>>
    >>> df = pd.DataFrame(data) # create a DataFrame object based on the above python dictionary object.
    >>>
    >>> print(df) # print out the DataFrame object.
        Name      Title  Salary
    0    Tom  Developer   12800
    1  Jerry         QA   13400
    2   Bill         PM   12900
    3    Bob    Manager   14200
    >>>
    >>># The above example uses the default row label, which is generated by the function range(n). It generates 0,1,2,3 and corresponds to each element value in the list.
    >>>
    >>># We can also add custom row labels to the above example like below.
    >>>
    >>> import pandas as pd
    >>>
    >>> data = {'Name':['Tom', 'Jerry', 'Bill', 'Bob'], 'Title':['Developer', 'QA', 'PM', 'Manager'], 'Salary':[12800,13400,12900,14200]}
    >>>
    >>> index = ['employee_1', 'employee_2', 'employee_3', 'employee_4'] # define a python list array.
    >>>
    >>> df = pd.DataFrame(data, index = index) # use the above list array as the DataFrame row label.
    >>>
    >>> print(df)
                 Name      Title  Salary
    employee_1    Tom  Developer   12800
    employee_2  Jerry         QA   13400
    employee_3   Bill         PM   12900
    employee_4    Bob    Manager   14200
    
    
  7. 列表嵌套的字典可以作为输入数据传递给DataFrame构造函数。默认情况下,字典的键被用作列名。

    >>> import pandas as pd
    >>>
    >>> data_list_dict = [{'python': 90, 'java': 88},{'python': 99, 'java': 68, 'javascript': 100}]
    >>>
    >>> df = pd.DataFrame(data_list_dict)
    >>>
    >>> print(df) # the first row does not contain the 'javascript' column, so the 'javascript' column's value is NaN.
       python  java  javascript
    0      90    88         NaN
    1      99    68       100.0
    >>>
    
  8. 使用字典嵌套的列表创建一个DataFrame对象,并提供行和列的标签。

    >>> import pandas as pd
    >>>
    >>> data_list_dict = [{'python': 90, 'java': 88},{'python': 99, 'java': 68, 'javascript': 100}]
    >>>
    >>> index_label = ['Tom', 'Jerry'] # because the data_list_dict only contains 2 rows of data, so the row index label should have 2 elements also, otherwise it will throw the ValueError: Shape of passed values is (2, 3), indices imply (3, 3).
    >>>
    >>> column_label = ['python', 'java', 'javascript']
    >>>
    >>> df = pd.DataFrame(data, index = index_label, columns = column_label)
    >>>
    >>> print(df) # the first row does not contain the 'javascript' column, so the 'javascript' column's value is NaN.
            python  java  javascript
    Tom        90    88         NaN
    Jerry      99    68       100.0
    >>>
    
  9. 你也可以在字典中传递一个Series来创建一个DataFrame对象,输出结果的行索引标签是字典对象中所有Series对象的索引集合。

    >>> import pandas as pd
    >>>
    >>> series_1 = pd.Series(['Tom', 'Jerry', 'Bill', 'Bob'], index=['name_row_1', 'name_row_2', 'row_3', 'row_4'])
    >>>
    >>> series_2 = pd.Series(['Developer', 'QA', 'PM', 'Manager'], index=['row_a', 'row_2', 'row_c', 'row_d'])
    >>>
    >>> series_3 = pd.Series([12800,13400,12900,14200], index=['row_', 'row_2', 'row_c', 'row_d'])
    >>>
    >>> data = {'Name':series_1, 'Title':series_2, 'Salary':series_3 }
    >>>
    >>> df = pd.DataFrame(data)
    >>>
    >>> print(df)
                 Name      Title   Salary
    name_row_1    Tom        NaN      NaN
    name_row_2  Jerry        NaN      NaN
    row_          NaN        NaN  12800.0
    row_2         NaN         QA  13400.0
    row_3        Bill        NaN      NaN
    row_4         Bob        NaN      NaN
    row_a         NaN  Developer      NaN
    row_c         NaN         PM  12900.0
    row_d         NaN    Manager  14200.0
    
    

2.如何查询、添加、删除Pandas DataFrame数据

2.1 用列索引标签操纵DataFrame对象

  1. 查询列数据:你可以使用列索引标签来轻松查询DataFrame对象的列数据。下面是这个例子。

    >>> import pandas as pd
    >>>
    >>> series_1 = pd.Series(['Tom', 'Jerry', 'Bill', 'Bob'])
    >>>
    >>> series_2 = pd.Series(['Developer', 'QA', 'PM', 'Manager'])
    >>>
    >>> series_3 = pd.Series([12800,13400,12900,14200])
    >>>
    >>> data = {'Name':series_1, 'Title':series_2, 'Salary':series_3 }
    >>>
    >>> print(data)
    {'Name': 0      Tom
    1    Jerry
    2     Bill
    3      Bob
    dtype: object, 'Title': 0    Developer
    1           QA
    2           PM
    3      Manager
    dtype: object, 'Salary': 0    12800
    1    13400
    2    12900
    3    14200
    dtype: int64}
    >>>
    >>>
    >>> df = pd.DataFrame(data)
    >>>
    >>> print(df)
        Name      Title  Salary
    0    Tom  Developer   12800
    1  Jerry         QA   13400
    2   Bill         PM   12900
    3    Bob    Manager   14200
    >>>
    >>> print(df['Name']) # print out the 'Name' column.
    0      Tom
    1    Jerry
    2     Bill
    3      Bob
    Name: Name, dtype: object
    
  2. 添加列数据:使用列索引标签或DataFrame对象的插入功能,你可以添加新的数据列,下面是例子。

    >>> import pandas as pd
    >>>
    >>> series_1 = pd.Series(['Tom', 'Jerry', 'Bill', 'Bob'])
    >>>
    >>> series_2 = pd.Series(['Developer', 'QA', 'PM', 'Manager'])
    >>>
    >>> data = {'Name':series_1}
    >>>
    >>> df = pd.DataFrame(data) # create the DataFrame object.
    >>>
    >>> print(df)
        Name
    0    Tom
    1  Jerry
    2   Bill
    3    Bob
    >>>
    >>> df['Title'] = series_2 # add the Title column.
    >>>
    >>> print(df)
        Name      Title
    0    Tom  Developer
    1  Jerry         QA
    2   Bill         PM
    3    Bob    Manager
    >>>
    >>> df['Name - Title'] = df['Name'] + ' - ' +  df['Title'] # add a new column based on the Name & Title column.
    >>>
    >>> print(df)
        Name      Title     Name - Title
    0    Tom  Developer  Tom - Developer
    1  Jerry         QA       Jerry - QA
    2   Bill         PM        Bill - PM
    3    Bob    Manager    Bob - Manager
    >>>
    >>> series_3 = pd.Series([12800,13400,12900,14200]) # define the third column Series object.
    >>>
    >>> df.insert(2, column = 'Salary', value = series_3) # insert the third column into the DataFrame object.
    >>>
    >>> print(df)
        Name      Title  Salary     Name - Title
    0    Tom  Developer   12800  Tom - Developer
    1  Jerry         QA   13400       Jerry - QA
    2   Bill         PM   12900        Bill - PM
    3    Bob    Manager   14200    Bob - Manager
    
  3. 删除列数据:使用pythondel命令或DataFrame对象的pop()函数,可以很容易地删除DataFrame对象的数据列。下面是这个例子。

    >>> import pandas as pd
    >>>
    >>> series_1 = pd.Series(['Tom', 'Jerry', 'Bill', 'Bob'])
    >>>
    >>> series_2 = pd.Series(['Developer', 'QA', 'PM', 'Manager'])
    >>>
    >>> series_3 = pd.Series([12800,13400,12900,14200])
    >>>
    >>> data = {'Name':series_1, 'Title':series_2, 'Salary':series_3 }
    >>>
    >>> df = pd.DataFrame(data) # create the DataFrame object.
    >>>
    >>> print(df)
        Name      Title  Salary
    0    Tom  Developer   12800
    1  Jerry         QA   13400
    2   Bill         PM   12900
    3    Bob    Manager   14200
    >>>
    >>> del df['Salary'] # delete the Salary column with the del command.
    >>>
    >>> print(df) # we can see that the Salary column has been removed from the original DataFrame object.
        Name      Title
    0    Tom  Developer
    1  Jerry         QA
    2   Bill         PM
    3    Bob    Manager
    >>>
    >>>
    >>> df.pop('Title') # delete the Title column with the pop() function.
    0    Developer
    1           QA
    2           PM
    3      Manager
    Name: Title, dtype: object
    >>>
    >>> print(df) # the Title column has been removed also.
        Name
    0    Tom
    1  Jerry
    2   Bill
    3    Bob
    >>>
    

2.2 用行索引标签操纵DataFrame对象

  1. 查询行数据:你可以将行标签传递给DataFrame对象的loc属性或者将行索引号传递给DataFrame 对象的iloc属性来查询行的数据。下面是这个例子。

    >>> import pandas as pd
    >>> 
    >>> data = [['Tom', 'Developer', 10000],['Bob', 'QA', 12000],['Jerry', 'Manager', 13000]]
    >>> 
    >>> columns_array = ['Name','Title','Salary']
    >>> 
    >>> row_index_label_arr = ['a', 'b', 'c']
    >>> 
    >>> df = pd.DataFrame(data,columns = columns_array, index = row_index_label_arr)
    >>> 
    >>> print(df)
        Name      Title  Salary
    a    Tom  Developer   10000
    b    Bob         QA   12000
    c  Jerry    Manager   13000
    >>> 
    >>> 
    >>> print(df.loc['a']) # call the DataFrame object's loc attribute to get one row by row index label.
    Name            Tom
    Title     Developer
    Salary        10000
    Name: a, dtype: object
    >>> 
    >>> print(df.iloc[2]) # call the DataFrame object's iloc attribute to get one row by row index number.
    Name        Jerry
    Title     Manager
    Salary      13000
    Name: c, dtype: object
    >>>
    
  2. 你也可以使用切分来同时选择多条记录。下面是这个例子。

    >>> print(df[1:3]) # return 2 rows from the DataFrame object. 
        Name    Title  Salary
    b    Bob       QA   12000
    c  Jerry  Manager   13000
    
  3. 添加行数据:使用append()函数,你可以将另一个DataFrame对象的行添加到当前DataFrame对象中,它将把数据行附加在行的末尾。下面是这个例子。

    >>> import pandas as pd
    >>> 
    >>> data = [['Tom', 'Developer', 10000],['Bob', 'QA', 12000]]
    >>> 
    >>> columns_array = ['Name','Title','Salary']
    >>> 
    >>> df = pd.DataFrame(data,columns = columns_array) # create the first DataFrame object.
    >>> 
    >>> print(df)
      Name      Title  Salary
    0  Tom  Developer   10000
    1  Bob         QA   12000
    >>> 
    >>> data1 = [['Jerry', 'Manager', 13000]] 
    >>> 
    >>> df1 = pd.DataFrame(data1,columns = columns_array) # create the second DataFrame object.
    >>> 
    >>> print(df1)
        Name    Title  Salary
    0  Jerry  Manager   13000
    >>> 
    >>> df = df.append(df1) # append df1 to the end of df.
    >>>  
    >>> print(df)
        Name      Title  Salary
    0    Tom  Developer   10000
    1    Bob         QA   12000
    0  Jerry    Manager   13000
    >>>
    >>> df = df.append(df1, ignore_index = True, sort = True) # append df1 to df with the parameters, ignore_index means ignore the original index and create new index
    >>> 
    >>> print(df)
        Name  Salary      Title
    0    Tom   10000  Developer
    1    Bob   12000         QA
    2  Jerry   13000    Manager
    
  4. 删除行数据:你可以使用DataFrame对象的drop()方法,并将行的索引标签传递给它,从DataFrame对象中删除一行数据。如果有重复的索引标签,它们将被一起删除。下面是这个例子。

    >>> print(df) # print out the original DataFrame object.
        Name  Salary      Title
    0    Tom   10000  Developer
    1    Bob   12000         QA
    2  Jerry   13000    Manager
    >>> 
    >>> df.drop(0)  # drop the first row.
        Name  Salary    Title
    1    Bob   12000       QA
    2  Jerry   13000  Manager
    >>> 
    >>> print(df) # the original DataFrame object is not changed.
        Name  Salary      Title
    0    Tom   10000  Developer
    1    Bob   12000         QA
    2  Jerry   13000    Manager
    >>> 
    >>> df.drop(0, inplace = True) # add the inplace = True argument when invoke the DataFrame object's drop() method to modify the original DataFrame object.
    >>> 
    >>> print(df)
        Name  Salary    Title
    1    Bob   12000       QA
    2  Jerry   13000  Manager
    

3.DataFrame的属性和方法

  1. T(转置):这个属性将返回一个DataFrame对象的转置,也就是交换DataFrame对象的行和列。

    >>> import pandas as pd
    >>> 
    >>> data = [['Tom', 'Developer', 10000],['Bob', 'QA', 12000],['Jerry', 'Manager', 13000]]
    >>> 
    >>> columns_array = ['Name','Title','Salary']
    >>> 
    >>> row_index_label_arr = ['a', 'b', 'c']
    >>> 
    >>> df = pd.DataFrame(data,columns = columns_array, index = row_index_label_arr)
    >>> 
    >>> print(df)
        Name      Title  Salary
    a    Tom  Developer   10000
    b    Bob         QA   12000
    c  Jerry    Manager   13000
    >>> 
    >>> print(df.T) # exchange the DataFrame object's rows and columns
                    a      b        c
    Name          Tom    Bob    Jerry
    Title   Developer     QA  Manager
    Salary      10000  12000    13000
    >>>
    
  2. dtypes:返回每一列的数据类型。

    >>> print(df.dtypes)
    Name      object
    Title     object
    Salary     int64
    dtype: object
    
  3. axes:返回行标签和列标签的列表。

    >>> print(df.axes)
    [Index(['a', 'b', 'c'], dtype='object'), Index(['Name', 'Title', 'Salary'], dtype='object')]
    
  4. empty:返回一个布尔值,以判断是否为空。返回一个布尔值来判断输出数据对象是否为空。如果为真,意味着该对象为空。

    >>> print(df)
        Name      Title  Salary
    a    Tom  Developer   10000
    b    Bob         QA   12000
    c  Jerry    Manager   13000
    >>> 
    >>> print(df.empty)
    False
    >>> 
    >>> df.drop('a', inplace=True)
    >>> 
    >>> df.drop('b', inplace=True)
    >>> 
    >>> df.drop('c', inplace=True)
    >>> 
    >>> print(df)
    Empty DataFrame
    Columns: [Name, Title, Salary]
    Index: []
    >>> 
    >>> print(df.empty) # now the DataFrame object's empty attribute returns True.
    True
    
  5. ndim:返回数据对象的尺寸。Dataframe是一个二维的数据结构。

    >>> print(df.ndim)
    2
    
  6. size(尺寸):返回DataFrame对象中元素的数量。

    >>> print(df.size)
    0
    
  7. shape:返回代表DataFrame对象的元组。返回一个代表DataFrame维度的元组。返回一个值元组**(a,b),其中a**代表行的数量,b代表列的数量。

    >>> df = pd.DataFrame(data,columns = columns_array, index = row_index_label_arr)
    >>> 
    >>> print(df)
        Name      Title  Salary
    a    Tom  Developer   10000
    b    Bob         QA   12000
    c  Jerry    Manager   13000
    >>> 
    >>> print(df.shape)
    (3, 3)
    
    
  8. :将DataFrame对象中的数据作为一个2维数组对象返回。

    >>> df = pd.DataFrame(data,columns = columns_array, index = row_index_label_arr)
    >>> 
    >>> print(df)
        Name      Title  Salary
    a    Tom  Developer   10000
    b    Bob         QA   12000
    c  Jerry    Manager   13000
    >>> 
    >>> print(df.values)
    [['Tom' 'Developer' 10000]
     ['Bob' 'QA' 12000]
     ['Jerry' 'Manager' 13000]]
    
  9. head(n):返回前n行的数据,如果没有提供n,则返回前5行的数据。

  10. tail(n):返回最后的n行数据, n的默认值是5。

```
>>> print(df)
    Name      Title  Salary
a    Tom  Developer   10000
b    Bob         QA   12000
c  Jerry    Manager   13000
>>> 
>>> print(df.head(1))
  Name      Title  Salary
a  Tom  Developer   10000
>>> 
>>> print(df.tail(1))
    Name    Title  Salary
c  Jerry  Manager   13000
```

11. shift():移动行或列。它提供了一个时期参数,代表在特定轴上的移动步骤。下面是**shift()**函数的语法格式。

```
DataFrame.shift(periods=1, freq=None, axis=0):

1. periods :  The type is int, which indicates the moving steps. It can be positive or negative. The default value is 1.

2. freq : Date offset. The default value is None. It is applicable to time sequence. The value is a string that conforms to the time rule.

3. axis : If it is 0 or "index", it will move up and down. If it is 1 or "columns", it will move left and right.

4. fill_value : This parameter is used to fill in missing values, it can also be used to replace the original data.
```

12. 下面是shift()方法的例子。

```
>>> print(df)
    Name      Title  Salary
a    Tom  Developer   10000
b    Bob         QA   12000
c  Jerry    Manager   13000
>>> 
>>> df.shift(axis=0, periods=1) # you can find the first row has been shifted to the top.
  Name      Title   Salary
a  NaN        NaN      NaN
b  Tom  Developer  10000.0
c  Bob         QA  12000.0
>>> 
>>> df1 = df.shift(axis=1, periods=1) # you can find the first column has been shifted to the right.
>>> 
>>> df1
  Name  Title     Salary
a  NaN    Tom  Developer
b  NaN    Bob         QA
c  NaN  Jerry    Manager
>>> df1 = df.shift(axis=1, periods=1, fill_value='') # use empty string to replace the NaN value.
>>> 
>>> df1
  Name  Title     Salary
a         Tom  Developer
b         Bob         QA
c       Jerry    Manager
```