DataSklr

View Original

Multi-layer Perceptron

Revised Feb. 16, 2020

TABLE OF CONTENTS:

In order to arrive at the most accurate prediction, machine learning models are built, tuned and compared against each other. The reader can get can click on the links below to assess the models or sections of the exercise. Each section has a short explanation of theory, and a description of applied machine learning with Python:

  1. Exploratory Data Analysis

  2. LDA/QDA/Naive Bayes Classifier

  3. Multi-Layer Perceptron (Current Blog)

  4. K-Nearest Neighbors

  5. Support Vector Machines

  6. Ensemble Learning

  7. Model Comparisons

OBJECTIVES:

This blog is part of a series of models showcasing applied machine learning models in a classification setting. By clicking on any of the tabs above, the reader can navigate to other methods of analysis applied to the same data. This was designed so that one could see what a data scientist would do from soup to nuts when faced with a problem like the one presented here. Note that the overall focus of this blog is Artificial Neural Networks. More specifically,

  • Understand the basics of Artificial Neural Networks;

  • Know that several ANNs exist;

  • Learn about how to fit and evaluate Multi-layer Perceptron; and

  • Use machine learning to tune a Multi-layer Perceptron model.

What are Artificial Neural Networks?

Artificial neural networks mimic the neuronal makeup of the brain. These networks represent the complex sets of switches that transmit electric or chemical impulses. In fact the network is so complex that each neuron may be connected to thousands of other neurons, and these connections are then repeatedly activated. Repeated activation results in learning.

So how do we train an artificial neural network to recognize and distinguish two classes? The answer is repetition. We repeatedly pass data through the ANN and pass back information about the learning performance. The repeated learning and feedback results in true system learning, much like the brain.

Neural networks do have some typical components: (a) an input layer, (b) hidden layers (their number can range from 0 to a lot), (c) an output layer, (d) weights and biases, and (e) an activation function.

Activation Function:

In an artificial neural network there is an activation function that serves the same task as the neuron does in the brain. This activation function is usually a sigmoid function used for classification similar to how the sigmoid function is used for classification in logistic regression. The sigmoid function moves from 0 to 1 as x reaches and surpasses a certain value (in this case 0). Of course other functions are also available. In fact, Scikit-Learn allows the tuning of a feed forward neural network called MLPerceptron (More on that later), and it offers four different activation functions when doing so:

  • ‘identity’, no-op activation, useful to implement linear bottleneck, returns f(x) = x

  • ‘Our current favorite: logistic’, the logistic sigmoid function, returns f(x) = 1 / (1 + exp(-x)). This is the sigmoid function discussed below.

  • ‘tanh’, the hyperbolic tan function, returns f(x) = tanh(x).

  • ‘relu’, the rectified linear unit function, returns f(x) = max(0, x)

See this content in the original post

Perceptron:

The activation functions (or neurons in the brain) are connected with each other through layers of nodes. Nodes are connected with each other so that the output of one node is an input of another. The inputs a node gets are weighted, which then are summed and the activation function is applied to them. The weighted sums of inputs applied to the activation function will then become an output. Take a look at the definition of a PERCEPTRON below.

Node with Inputs:

Bias:

Bias will change the sigmoid function in terms of when it will turn on vis-a-vis the value of x. The example below shows that the activation function gets activated (e.g. turns to 1) at a different value of x, which is caused by bias.

In the Perceptron and Bias sections we talked about weights and bias. These two constructs determine the strength of a predictive model in many models. In fact, computing predicted values is called feedforward, while updating weights and biases is called backpropagation. Backpropagation uses gradient descent to arrive at better updated weights and biases.

See this content in the original post

Layers:

In neural networks, nodes can be connected a myriad of different ways. The most basic connectedness is an input layer, hidden layer and output layer. Layer 1 on the image below is the input layer, while layer 2 is a hidden layer. It is considered hidden because it is neither input nor output. Finally, layer 3 is the output layer.

Source: Adventures in Machine Learning

The structure of the layers determines the type of artificial neural network. The Asimov Institute published a really nice visual depicting a good cross-section of neural networks. The original article can be found at The Neural Network Zoo.

Multi-layer Perceptron:

In the next section, I will be focusing on multi-layer perceptron (MLP), which is available from Scikit-Learn. For other neural networks, other libraries/platforms are needed such as Keras.

MLP is a relatively simple form of neural network because the information travels in one direction only. It enters through the input nodes and exits through output nodes. This is referred as front propagation only. Interestingly, backpropagation is a training algorithm where you feed forward the values, calculate the error and propagate it back to the earlier layers. In other words, forward-propagation is part of the backpropagation algorithm but comes before back-propagating the signals from the nodes.

Also, the network may not even have to have a hidden layer. As a result, MLP belongs to a group of artificial neural networks called feed forward neural networks.

Predict Donations with Python:

As usual, load all required libraries and ingest data for analysis. The first step is to load all libraries and the charity data for classification. Note that I created three separate datasets: 1.) the original data set wit 21 variables that were partitioned into train and test sets, 2.) a dataset that contains second order polynomials and interaction terms also partitioned, and 3.) a a dataset that contains third order polynomials and interaction terms - partitioned into train and test sets. Each dataset was standardized and the variables with VIF scores greater than 5 were removed. All datasets were pickled and those pickles are called and loaded below. The pre-work described above can be seen by navigating to the Linear and Quadratic Discriminant Analysis blog.

See this content in the original post

The first model we’ll fit will be untuned and will serve as a baseline to compare to when assessing the the accuracy of tuned models.

See this content in the original post

Multi-layer Perceptron allows the automatic tuning of parameters. We will tune these using GridSearchCV(). A list of tunable parameters can be found at the MLP Classifier Page of Scikit-Learn. One of the issues that one needs to pay attention to is that the choice of a solver influences which parameter can be tuned. As a result, I split up the task into three tasks with three different solvers.

See this content in the original post

We now fit several models: there are three datasets (1st, 2nd and 3rd degree polynomials) to try and three different solver options (the first grid has three options and we are asking GridSearchCV to pick the best option, while in the second and third grids we are specifying the sgd and adam solvers, respectively) to iterate with:

See this content in the original post

We can compute the accuracy associated with each of the models. The best model appears to be the one using the automatically selected solver based on the original training data. This model had a test accuracy of 90.6%.

See this content in the original post

Let us compute some additional accuracy statistics for the winning models for the three grid searches. A very interesting dilemma appears, since the recall of the model using the adam solver appears to be better:

See this content in the original post