Feature Engineering in Machine Learning
What is a feature?
A feature is every individual piece in a sample. I like to think of it as the smallest unit of a dataset, that helps us analyze it and put it to use. For instance, in order to determine whether a patient’s tumor is benign or malignant, features of each cell nucleus, such as radius, perimeter, texture, area, and so on, are considered.
This blog will cover:
- Feature Scaling
- Data Imputation
- Outliers
- Encoding techniques
Feature scaling
Feature scaling is applicable for numerical variables. Often a myriad of features in our dataset contribute towards the final prediction. A tremendous difference between the magnitudes of these features, a very wide range or an inconsistency in the units of available features would result in an incorrect prediction. Take features such as Height in cm (values: 180, 168, etc.) and weight in kg (values: 40,50,etc.), their magnitudes have a vast difference. If we use an algorithm such as K Nearest Neighbor that is based on Euclidean distance, on this data, we will get very vast and inconsistent distances between the data points, resulting in unintentional dominance of certain features. Thus we must try to scale down these features. It is particularly recommended while using algorithms that use concepts such as gradient descent (Linear Regression, Neural networks), Principal Component Analysis or distance measures(KNN, K Means, SVM). Ensemble techniques do not necessarily need feature scaling since they are not distance based.
General terms
Scale: to scale down features to a definite range, usually 0 to 1 or -1 to +1.
Standardisation- It scales data values such that their mean will be 0 and standard deviation will be 1.
Normalisation- It can be applied on features that are not normally distributed in order to convert them into a gaussian or normal distribution. It can be achieved using logarithmic transformation, reciprocal transformation, square root transformation, exponential transformation, boxcox transformtation.
Types of feature scaling:
- Standardization (also called z-score normalization)- This replaces values of the features with their z scores.
Where x is the data point, x̄ is the mean of the entire column for a particular feature, and σ is the standard deviation. After applying this technique, distribution is converted in such a way that mean =0 and std deviation=1. It makes use of the standard scalar library from sklearn.
2. Mean normalization- The scaled down values represent a normal or gaussian distribution (bell curve).
Where x is the original value, x` is normalized value.
3. Min max scaling- It is a type of normalization.
Here average of the value is replaced by its minimum value. Minimum value for x= min(x) will render x’ as 0 and maximum value for x= max(x) will render x’ as 1. Thus this method scales down values from a range of 0 to 1.
4. Robust Scaler- It scales the features based on median and quantiles. It calculates difference between feature value and its median and divides this difference by the Inter Quantile Range (IQR). It is a robust way to scale data with outliers.
5. Unitvector
This also converts values into a range of 0 to 1 and is often used for image data.
Other techniques include Max Abs Scaler, Quantile Transformer Scaler and Power Transformer Scaler.
Example: Consider a titanic dataset, predicting if an individual survived it or not.
This is what it looks like after cleaning.
Notice the range of data and the use of fit_transform instead of fit.
Data Imputation
The presence of missing values in a dataset can often result in unsatisfactory model performance. There are various techniques to handle these missing values.
The first is to delete rows or records that contain missing or null values. This can be fortuitous when we have a large dataset and minimal number of missing records. Doing so on a small dataset would cause a loss of valuable data, thereby adversely affecting model performance. However, it can be used in certain cases when a particular feature has many missing values.
Another option is to create a separate model to handle the values.
Consider a hypothetical dataset with a missing value in case of feature 1. In order to handle it, we will create a model where
Train set= rows without missing values
Test set= row with missing value (I.e. row 1)
The model will be trained on features 2, 3 and 4 (on training data) and will predict the value for feature 1 on test data. This method is quite accurate however a large number of features with missing values would call for numerous models to be designed for each of them, which can be a cumbersome process.
The third option is to replace the missing values with measures of central tendency (mean, median, mode). It is an efficient and widely used data imputation method. Let us explore this concept further with the help of a diabetes dataset.
Initially 5 of the columns had missing values. To begin with, I explored the Glucose column further.
This box plot has Outcome on the x-axis, depicting whether a patient has or does not have diabetes (0 or 1) and Glucose levels on the y-axis. It shows the median Glucose level for patients in both these classes. For records with missing values, I replaced them with the median of the corresponding class they belonged to instead of taking a more arbitrary approach and replacing null values with the median of the entire column.
Heatmap shows that rest 4 columns have missing values but glucose does not.
Dealing with Outliers
Any data point that lies outside the distribution of a dataset falls under the category of outliers. These data points are not always fruitless. In the diabetes dataset we looked at earlier, any record where the patient was diagnosed with diabetes is imperative and must be considered. However, outliers caused as a result of human errors or incorrect data entries could often hinder our model’s performance. This article gives valuable insights on whether to remove or retain outliers in our data. To summarize,
Outliers must be fixed or removed if they are a result of data entry errors or are not significant to the research question at hand. However certain unusual data points that are simply a result of the natural variation in data must be kept since they accurately represent the existing variability and uncertainty in data. Removing them would improve your model’s performance but will make the process appear much more predictable than it actually is.
The encircled region in the graph alongside represents outliers.
Identifying and handling outliers-
1. Inter quartile range (IOR) can be calculated by subtracting the first quartile (Q1) from the third quartile (Q3), i.e. difference between 75th and 25th percentiles. The lower bound value can be calculated by substracting the product of 1.5*IQR from quantile 1, and the upper bound can be calculated by adding the product of 1.5*IQR to quantile 3. Any value beyond these lower and upper bounds classifies as an outlier.
2. Z score-
In the distribution curve, any point that lies beyond -3 standard deviations or +3 standard deviations, classifies as an outlier.
3. Scatter plots, box plots, line graphs
Handling outliers- Certain algorithms like Naive Bayes, Support Vector Machine, Decision Trees, Random Forest, GBC are not sensitive to outliers. However others like Linear regression, Logistic Regression, KMeans, Neural Networks are sensitive to outliers.
Encoding techniques
Encoding techniques deal with categorical variables or features. A feature in which the order is irrelevant is called as a nominal variables (for instance, gender), whereas features in which the rank is important are ordinal (for instance, salary level, education level, and more).
These variables need to be converted into integer or float values in order to be effectively used in our model. This is done using encoding techniques. Types of encoding techniques:
Nominal encoding techniques
- One hot encoding- Consider a dataset with 7 rows and “Country” as one of its columns.
Using dummy variables for each of the countries gives us,
This might seem efficient at the first glance. However, would removing the third column “Japan” make any difference? Or is it derivable from the first two columns?
This is called as a dummy variable trap. In One hot encoding, we only need n-1 columns, where n is the number of classes in the particular categorical variable. One disadvantage of this method is that as the number of distinct classes increase, the number of columns would increase too.
1. One hot encoding with multiple categories
This method entails applying one hot encoding only to the most recurring categories.
2. Mean encoding
Consider each of the distinct categories for the nominal features present in your dataset. Calculate mean of output labels (which can be either 0 or 1) for each of these categories. Say we have classes X, Y and Z and their respective means as 0.7, 0.6, 0.59 . The output values for each of these categories will be replaced by their respective means calculated previously.
Ordinal encoding
1. Label encoding
Education levels such as Bachelors, Masters, PhD and so on will be encoded as 1,2,3 respectively based on their increasing rank.
2. Target guided encoding
This works in a similar manner as mean encoding. However, instead of replacing output values with the calculated mean vales, we use them to assign ranks. We find the mean of output values(0/1) for each of the categories and assign a rank accordingly.
Consider data:
Calculate mean of output values for A, B and C. Say mean of output labels for class A is 0.8, for class B is 0.5 and for class C is 0.45. These will be ranked as 3,2,1 respectively.