上QQ阅读APP看书，第一时间看更新

Exploratory analysis

Before starting with data analysis through the classification algorithm, we will conduct an exploratory analysis to understand how the data is distributed and extract preliminary knowledge. To display the first twenty rows of the DataFrame that's been imported, we can use the head() function, as follows:

print(data.head(20))

The following results are returned:

The first 20 rows are displayed. This function returns the first n rows for the object, based on position. This is useful for quickly testing whether your object has the right type of data in it. Now the dataset is available in our Python environment. To extract some information, we can invoke the info() function, as follows:

print(Data.info())

This method prints a concise summary of a DataFrame, including the dtypes index and dtypes column, non-null values, and memory usage. The following results are returned:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 302 entries, 0 to 301
Data columns (total 14 columns):
age 302 non-null int64
sex 302 non-null int64
cp 302 non-null int64
trestbps 302 non-null int64
chol 302 non-null int64
fbs 302 non-null int64
restecg 302 non-null int64
thalach 302 non-null int64
exang 302 non-null int64
oldpeak 302 non-null float64
slope 302 non-null int64
ca 302 non-null object
hal 302 non-null object
HeartDisease 302 non-null int64
dtypes: float64(1), int64(11), object(2)
memory usage: 33.1+ KB
None

Useful information is reported. The number of entries is 302, and the number of data columns is 14. Essentially, with regard to the list of all features with the number of elements, the possible presence of data and the type is returned. In this way, we can already get an idea of the type of variables we are about to analyze. In fact, analyzing the results that we've obtained, we can note that three types have been identified: float64(1), int64(11), and object(2). For the first two, there are no doubts: these are integer and real numbers. This anomaly is represented by the two columns labeled as objects. To understand what happened, it is useful to check the types of data provided by the pandas library, as shown in the following table:

Now, everything is clear: the two columns have been labeled as containing text. Why did this happen? This problem is due to the presence of missing values. Keep this in mind, as we will have to deal with this problem before proceeding with the construction of the model.

To get a preview of the data contained in it, we can calculate a series of basic statistics. To do so, we will use the describe() function in the following way:

summary = Data.describe()
print(summary)

The following results are returned:

The describe() function generates descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset's distribution, excluding NaN values. It analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output will vary, depending on what is provided. To continue, it is therefore necessary to address the problem of missing values.