上QQ阅读APP看书，第一时间看更新

Putting data together

Finally, pandas DataFrames can be created by merging series or other list-like data. Note that scalars are transformed into lists, as follows:

In: import pandas as pd
    my_own_dataset = pd.DataFrame({'Col1': range(5), 
                                   'Col2': [1.0]*5, 
                                   'Col3': 1.0, 
                                   'Col4': 'Hello World!'})
    my_own_dataset

Here is the output for my_own_dataset:

It can be easily said that for each of the columns you want to be stacked together, you provide their names (as the dictionary key) and values (as the dictionary value for that key). As seen in the preceding example, Col2 and Col3 are created in two different ways, but they provide the same resulting column of values. In this way, you can create a pandas DataFrame that contains multiple types of data with a very simple function.

In this process, please ensure that you don't mix lists of different sizes; otherwise, an exception will be raised, as shown here:

In: my_wrong_own_dataset = pd.DataFrame({'Col1': range(5), 
                          'Col2': 'string', 'Col3': range(2)})

Out: ...
     ValueError: arrays must all be same length

In order to assemble entire already existing DataFrames, you have to use a different approach based on concatenation. The pandas package offers the concat command, which operates on pandas data structures (Series and DataFrames) by stacking rows when working on axis 0 (the default option) or stacking columns when concatenating on axis 1:

In: col5 = pd.Series([4, 3, 2, 1, 0])
    col6 = pd.Series([0, 0, 1, 1, 1])
    a_new_dataset = pd.concat([col5, col6], axis=1,
                              ignore_index = True,
                              keys=['Col5', 'Col6'])

    my_new_dataset = pd.concat([my_own_dataset, a_new_dataset], axis=1)
    my_new_dataset

The resulting dataset is a concatenation of the col5 and col6 series:

In the preceding example, we created a new DataFrame, a_new_dataset, based on two Series. We just stacked the two series together, regardless of their indexes, because we used the ignore_index parameter, which is set to True. If matching accordingly to the indexes is important for your project, just don't use the ignore_index parameter (its default value is False) and you'll have a new DataFrame based on the union of the two indexes or on only the index elements that match as a result.

Joining two distinct datasets on the basis of a common column is achieved in pd.concat by adding the parameter join='inner', which is e quivalent to a SQL inner join, (more on the topic about joins will be dealt with after the following example).

Matching based on indexes could sometimes not be enough for your needs. Sometimes, you may need to match different Series or DataFrames on specific columns or series of columns. In that case, you need the merge method, which can be run from every DataFrame.

In order to see the merge method in action, we will create a reference table containing some values to be matched based on Col5:

In: key = pd.Series([1, 2, 4])
    value = pd.Series(['alpha', 'beta', 'gamma'])
    reference_table = pd.concat([key, value], axis=1,
                                ignore_index = True,
                                keys=['Col5', 'Col7'])
    reference_table

Here is the concatenation between key and value into a DataFrame:

The merge is operated by setting the how parameter to left, thus achieving a SQL left outer join. Apart from left, other possible settings of this parameter are as follows:

right: Equivalent to a SQL right outer join
outer: Equivalent to a SQL full outer join
inner: Equivalent to a SQL inner join (as previously mentioned)

In: my_new_dataset.merge(reference_table,
                         on='Col5', how='left')

The resulting DataFrame is a left outer join:

Getting back to our initial my_own_dataset, in order to check the type of data present in each column, you can check the output of the dtypes attribute:

In: my_own_dataset.dtypes

Out:  Col1      int64
      Col2    float64
      Col3    float64
      Col4     object
      dtype: object

The last method seen in this example is very handy if you wish to check whether a datum is categorical, integer numerical, or floating point, and its precision. In fact, sometimes, it is possible to increase the processing speed by rounding up floats to integers and casting double-precision floats to single-precision floats, or by using only a single type of data. Let's see how you can cast the type in the following example. This example can also be seen as a broad example on how to reassign column data:

In:  my_own_dataset['Col1'] = my_own_dataset['Col1'].astype(float)
     my_own_dataset.dtypes

Out: Col1    float64
     Col2    float64
     Col3    float64
     Col4     object
     dtype: object

You can also obtain information about your DataFrame structure and data types using the info() as shown in this example: my_own_dataset.info().