Spread the love

In this blog post, we’re going to take a look at the common Data Preprocessing tools used in Python. Download the “Titanic” data to follow along.

In this blog post, we’re going to take a look at the common Data Preprocessing tools used in Python. Download the “Titanic” data to follow along.

Libraries we will use;

  • Pandas
  • Numpy
  • Sklearn

Importing Data as a Data Frame

data = pd.read_csv("Data/titanic/train.csv")
data.head()

Now let’s try to see the size of the observation, features, and what type of datatypes we have. This is especially useful when you have a large number of observations and features.

print(len(data))
print(len(data.columns))
print(data.dtypes.unique())

#########################################################

891
12
[dtype('int64') dtype('O') dtype('float64')]

We have a small data with 12 columns (11 potential features) and some of them are the Python objects. Now let’s check if we have any NaN or empty values in our observations.

print(data.isnull().any().sum(), "/", len(data.columns))
print(data.isnull().sum())

##########################################################

3 / 12
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

So, 3 out of 12 columns have NaN values. These are “Age”, “Cabin” and “Embarked”. Here is how I think we can deal with this issue.

  1. 177 out of 891 observations in the “Age” column is empty. We can fill those columns with the mean or median values of the rest that are provided.
  2. Embarked is a categorical feature we cannot calculate the mean or median. We will fill the 2 NaN observations with the most frequent one.
  3. 687 out of 891 observations are missing in the Cabin column. It won’t make sense to rely on it since almost 8 out of 10 values are NaN.
# Drop the Cabin column. We can also drop names since we have ids for them
data = data.drop(['Cabin', 'Name'], axis=1)
# Fill NaN values by calculating the mean
data['Age'] = data['Age'].fillna(data["Age"].mean())
# Use sklearn's SimpleIpmuter to deal with missing data
from sklearn.impute import SimpleImputer
# Since we have a categorical feature, we cannot use the mean or median strategy
imp = SimpleImputer(missing_values=np.nan, strategy="most_frequent")
# Please note that SimpleImputer accepts 2d array as an input
# We will fill one of those dimensions with length of 1 just not to get an error
imp.fit(data['Embarked'].values.reshape(-1,1))
data['Embarked'] = imp.transform(data['Embarked'].values.reshape(-1,1))

Now let’s check if we have any missing values left;

print(data.isnull().any().sum(), "/", len(data.columns))
print(data.isnull().sum())

#########################################################

0 / 10
PassengerId    0
Survived       0
Pclass         0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

Remember we had Python objects in our dataset? Let’s check on them.

data.select_dtypes(include="O").columns.tolist()

#########################################################

['Sex', 'Ticket', 'Embarked']

We have 3 columns as Python objects. We need to turn them into numerical features, so we can use them in our machine learning equations.

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
le_sex = LabelEncoder()
le_embarked = LabelEncoder()
le_ticket = LabelEncoder()
data["Sex"] = label_encoder.fit_transform(data['Sex'].values)
data['Embarked'] = le_embarked.fit_transform(data['Embarked'])
data['Ticket'] = le_ticket.fit_transform(data['Ticket'])

In the code above, we have created the necessary label encoders and transformed our objects into numerical features. We are not done yet tho. If you look at the data, you’ll see that we have 3 or more “Embarked” categories. In a machine learning equation larger values may have more priority even though there is no such a relation. If the Embarked feature would be “Size”, then it would make sense to keep it like that.

In order to prevent this prioritization, we are going to One Hot Encode them.

# Only the third and the last features will be one hot encoded
one_hot_encoder = OneHotEncoder(categorical_features = [2,-1], sparse=False)
data = one_hot_encoder.fit_transform(data)

We have successfully Imported a sample data, Dealt with Missing Values, Labeled Categorical Features and One Hot Encoded them.