# TINY MACHINE LEARNING LESSON 8

## TOPICS INDEX

**Warnings****Copyright Notice****Deep Learning and the Neural Network****Activation functions of neurons****Sigmoid function****Smoothstep function****Hyperbolic Tangent Function – Tanh****ReLU function****Definition of the Model to train for Hello World with Keras****PAI-019: Hello World Model (Base)****Training the Hello World Model (Basic)****Graphical representation of training data****PAI-020: Hello World Model (Improved)**

## Warnings

With regard to the safety aspects, since the projects are based on a very low voltage power supply supplied by the USB port of the PC or by support batteries or power supplies with a maximum of 9V output, there are no particular risks of an electrical nature. It is however necessary to specify that any short circuits caused during the exercise phase could produce damage to the PC, to the furnishings and in extreme cases even to burns, for this reason every time a circuit is assembled, or changes are made on it, it will be necessary to do so in the absence of power and at the end of the exercise it will be necessary to provide for the disconnection of the circuit by removing both the USB cable connecting to the PC and any batteries from the appropriate compartments or external power connectors. In addition, always for safety reasons, it is strongly recommended to carry out projects on insulating and heat-resistant carpets that can be purchased in any electronics store or even on specialized websites.

At the end of the exercises it is advisable to wash your hands, as the electronic components could have processing residues that could cause damage if ingested or if in contact with eyes, mouth, skin, etc. Although the individual projects have been tested and safe, those who decide to follow what is reported in this document, assume full responsibility for what could happen in the execution of the exercises provided for in the same. For younger children and / or the first experiences in the field of Electronics, it is advisable to perform the exercises with the help and in the presence of an adul

## Copyright Notice

*All trademarks are the property of their respective owners; third-party trademarks, product names, trade names, corporate names and companies mentioned may be trademarks owned by their respective owners or registered trademarks of other companies and have been used for purely explanatory purposes and for the benefit of the owner, without any purpose of violation of the copyright rights in force. What is reported in this document is the property of Roberto Francavilla, Italian and European laws on copyright are applicable to it – any texts taken from other sources are also protected by the Copyright and property of the respective Owners. All the information and contents (texts, graphics and images, etc.) reported are, to the best of my knowledge, in the public domain. If, unintentionally, material subject to copyright or in violation of the law has been published, please notify **info@bemaker.org* by email * and I will promptly remove it.*

## Roberto Francavilla

## Deep Learning and the Neural Network

Before continuing in the realization of our Machine Learning project called Hello World, I need to explain some theory, but, as I always say in these cases, do not worry, I will be as elementary as possible in the explanation (so much so that the super experts will turn up their noses …. but it doesn’t matter, I know they will understand).

First of all we see the term, “**Deep Learning**“, in Italian translates as **Apprendimento Profondo**, that is learning that takes place sequentially on several levels, starting from the most external level and gradually going to lower levels. Each level is a “layer” (ie a layer) made up of “neurons”. Already from this it is easy to understand that the more levels there are, the more my model will learn better to make its predictions with greater precision, to the point that the model will not be able to learn anything.

The first layer of neurons is called the Input layer, the ones below it are called the hidden layers.

Let’s make a graphic representation of the above with an example of a model related to a weather station that has 6 Input data, then has a first layer of 6 neurons, a second layer of 4 neurons and the last layer, with a neuron , with the emission of the Output, that is the forecast: if there will be sun or rain.

This model can be represented as follows:

Each neuron will receive an input and will pull out an output that feeds the neurons of the underlying layer.

By zooming in on the image of the neuron … let’s see what happens inside the neuron itself:

We see that at an x value at the input to the neuron, there is a y value at the output, according to a certain function. This function is called the **neuron activation function** and is represented as a generic mathematical function y = f (x), that is, at an input x, from the neuron there will be an output y. So basically an activation function is used to map the input to the output. This activation feature helps a neural network learn complex relationships and patterns in data.

## Activation functions of neurons

In Machine Learning there are many activation functions, the various researchers and scholars of the ML world now find new activation functions every day depending on the application of the same. It is not my intention to explain how to create an “activation function” because it is complex and you need to have mathematical studies at the university level at the base, but I will explain how to use those already defined for our projects.

Just for general knowledge and drawing from the immense online encyclopedia Wikipedia (of which I am a supporter and I also invite you to make donations) I will show you some examples of neuron activation functions.

## Sigmoid function

[Source: Wikipedia]

## Smoothstep function

[Source: Wikipedia]

## Hyperbolic Tangent Function - Tanh

[Source: Wikipedia]

## ReLU function

[Source: Wikipedia]

…. and many others…. each of them has one or more uses in Machine Learning, for example the Smoothstep function is normally used in those models that deal with graphic processing.

We, however, will focus on one, the simplest, among those seen above, namely the ReLU function.

ReLU stands for Rectified Linear Unit. When an activation function takes an input value (x) and returns a numerical value of a mathematical function (f (x), the model is said to have a regression activation function, or simply has a **regression function** or **even regression model**.

The ReLU is a particular regression function, in fact it assumes a value of 0 for values of x less than 0 and for x = 0 and assumes a value equal to x, for values of x greater than 0.

In the graph below there is the graphical representation with the table of values, it can be observed that for values ranging from -10 to 0 (inclusive), the function returns value 0, for values of x greater than 0, the function returns a value equal to x.

## Definition of the Model to train for Hello World with Keras

At this point we have the theoretical tools to understand how to create our model to train for the Hello World project.

So the steps are as follows:

Fix the number of layers of neurons (usually you go by trial and error, you start with two layers and then increase until the result improves significantly)

Insert in the neuron the activation function, which in our case will be a regression function (we will use the ReLU)

We will analyze the learning results to verify the need for corrective actions and, if necessary, go to the compilation to create the TinyML code.

For the realization of the model we will use another cloud resource, which in essence will be completely transparent for us as it is managed through the Colab Notebook. This resource is called **Keras**, and it is TensorFlow’s high-level **Application Programming Interface (API)** for building deep learning networks.

Note: The following is the result of the elaboration of the examples made available by TensorFlow and in particular at the link:

*https://www.tensorflow.org/lite/microcontrollers/get_started_low_level*

and at the link:

Released under the Apache 2.0 license

## PAI-019: Hello World Model (Base)

So, let’s start with our practical project.

Let’s resume our Seno_Data project.

Once the Seno_Data Notebook is loaded, we copy the cells and paste them into a new Notebook, after which we name the new Notebook as Seno_Function.

Let’s add another cell of code and write the following code:

# We’ll use Keras to create a simple model architecture

from tensorflow.keras import layers

model_1 = tf.keras.Sequential()

# First layer takes a scalar input and feeds it through 16 “neurons”. The

# neurons decide whether to activate based on the ‘relu’ activation function.

model_1.add(layers.Dense(16, activation=’relu’, input_shape=(1,)))

# Final layer is a single neuron, since we want to output a single value

model_1.add(layers.Dense(1))

# Compile the model using a standard optimizer and loss function for regression

model_1.compile(optimizer=’rmsprop’, loss=’mse’, metrics=[‘mae’])

# Print a summary of the model’s architecture

model_1.summary()

We analyze line by line what we are doing.

from tensorflow.keras import layers

In the first row (above) we are activating Keras for the release of the layers of neurons

model_1 = tf.keras.Sequential()

In the second line, we are telling Keras that the model we will build is of a sequential type, meaning that we will have at least two layers, one of which is an input and the other is deep and that the inputs will be processed by the first and passed sequentially to the second.

model_1.add(layers.Dense(16, activation=’relu’, input_shape=(1,)))

With this line we create the first layer and define it “Dense”, that is dense, it means that all 16 neurons are connected (in fact a layer of this type is called “**completely connected**“), the activation function used is the ReLU and the input consists of a single scalar value. It should be noted that with the commands contained in the line just described, the mathematical work is fully managed and carried out by Keras and TensorFlow, so we absolutely must not introduce anything else than what is already written in the code.

model_1.add(layers.Dense(1))

With the line of code above, we add an additional layer consisting of a single neuron completely connected with the neurons of the previous layer. Then the neuron will receive 16 inputs, one for each of the neurons in the previous layer.

As you can see, no activation function has been indicated for this layer, so the neuron will not activate and will only make a weighted average of the values received (obviously this always in a transparent way to our code). So at the end of the processing there will be the output.

The complete cycle from the input (initial value) to the output (final value) is called, as we said in the previous lessons, “**inference**” and at each cycle of inference (for the Hello World project, we have foreseen 1000 input data, for which 1000 cycles of inference) TensorFlow, thanks to the activation function, will automatically modify the weights to be attributed to the single values during neuron activity.

model_1.compile(optimizer=’rmsprop’, loss=’mse’, metrics=[‘mae’])

The line of code above concerns the compilation, essentially with the “optimizer” function we specify the type of algorithm (‘rmsprop’) used, while the arguments ‘mse’ and ‘mae’ are two statistical functions that indicate respectively for the the “loss” function is the criterion for measuring the mean square error, while for the “**metrics**”, the criterion for the mean absolute error. These functions are normally chosen by trial and error, however for regression models, such as ours, the parameters defined in the written line of code are normally used. For more information on these aspects, refer to the bibliography accompanying Keras.

model_1.summary()

With the last line we say to print a summary of the characteristics of the model.

At this point we turn the individual cells by clicking on the play symbol placed on each of them.

In the photo above only the final results of the last cell.

## Training the Hello World Model (Basic)

At this point we move on to training, also in this case we will exploit the power of Keras to simplify the process, in fact it is enough to recall a simple “**fit (….)**” function, passing all the necessary parameters, that you have the training of the model.

Let’s see how to do it …

Add another cell of code to the Notebook and write the following code:

# Train the model on our training data while validating on our validation set

history_1 = model_1.fit(x_train, y_train, epochs=1000, batch_size=16,

validation_data=(x_validate, y_validate))

Let’s see in detail what we wrote:

we assign the output of the fit function applied to our model_1 to a variable named (arbitrarily) history_1. In this variable (which will be a matrix of values) we will have the entire training history, so it will be important to analyze it.

As parameters to the fit () function we insert the two arrays for training: x_train and y_train (I remind you that the two arrays have 600 elements each).

Then we establish the epochs, that is, the number of times the model will run the training data. In this case it is 1,000 (so there will be 600 x 1,000, i.e. 600,0000 inferences). The number of epochs due to from experience and experimentation, in fact we must not put too low a number, otherwise the model does not “learn” well and we must not put too high a number, otherwise we have “overfitting” (that is, when the model is capable of making a perfect prediction on the basis of the training data, but is not capable of making predictions on new input data). So it is advisable to start with a value of 1,000 and then eventually improve this number until there is no improvement in learning by the model.

The next parameter is the batch_size, that is, we establish the number of batches of training data that are entered into the model and therefore, downstream of this inference, verify the deviations of the forecast with respect to the test data.

Also in this case experience suggests to use the number of neurons of the first layer as batch_size, in this way we will have, for the 600 training data, 38 moments of verification (i.e. 600: 16) with automatic change of weights and distortion in the model and considering that we have chosen as epochs the value of 1.000, we will have 38.000 moments of potential improvement of the model. I would like to point out that having set the value of batch_size = 16 is valid in this case, but it may not be a correct value for other models, so even for this value you need to make several attempts and experiment with the results before setting it definitively. For some models, especially when using more than 16 neurons in the first layer, the batch_size is assigned a value of 32.

Finally, the last parameter of the fit function is the validation data set with validation_data.

This dataset is provided and processed by the model, for two reasons:

- in order not to incur overfitting, in fact, after the training data, the model is fed with the predictions to be made on the basis of new inputs
- to measure the deviations of the forecast values, carried out by the model during the training phase, and the real result already validated.

As you can see, for this training, the test data that we had prepared in Lesson 7 are not used in this phase, in fact it is not always necessary to use the test data for training (which are an additional set of data). “New” to avoid overfitting and to improve training).

At this point we launch the code contained in the cell by clicking on the play at the top left and the processing will appear which will take a few minutes:

At the end of the processing, our model was trained.

Let’s analyze the values generated during training and focus our attention on the evolution of the following values:

**Loss**; that is the mean square error (it is a particular mean with always positive values) that the model makes in making its prediction based on the training values

But it’s; that is the value of the absolute error (also always positive) that the model commits in making its prediction based on the training values**Val_loss**; that is, the mean square error that the model commits in making its prediction based on the validation values**Val_mae**; that is, the value of the absolute error that the model commits in making its prediction based on the validation values

As it is possible to observe, the “loss” value substantially decreases, as does the “mae”, this means that the model is learning to make more and more precise predictions. Let’s now observe and compare the values of “mae” and “val_mae” and we can see how the values are very similar, but those of “mae” are smaller than those of “val_mae”. This means that the model is learning to make better predictions with training data than validation data. So the model is going in the direction of overfitting.

However, although our model has learned something and is overfitting, the error is still very high. In fact, making an error, in absolute terms, of 0.3 on a sine value that can have a value that varies between -1 and +1, means making an error of 30% and this is not acceptable to us. So let’s try to improve our model.

Before moving on to improving the model, let’s see how to graphically represent the data that is stored during the training phase.

## Graphical representation of training data

Let’s add a code cell and copy the following code:

# Draw a graph of the loss, which is the distance between

# the predicted and actual values during training and validation.

loss = history_1.history[‘loss’]

val_loss = history_1.history[‘val_loss’]

epochs = range(1, len(loss) + 1)

plt.plot(epochs, loss, ‘g.’, label=’Training loss’)

plt.plot(epochs, val_loss, ‘b’, label=’Validation loss’)

plt.title(‘Training and validation loss’)

plt.xlabel(‘Epochs’)

plt.ylabel(‘Loss’)

plt.legend()

plt.show()

Let’s go and understand what we wrote; the variable history_1 that we had defined in the previous cell with the training function, as it is filled also keeps the error values, so with the rows:

loss = history_1.history[‘loss’]

val_loss = history_1.history[‘val_loss’]

let’s extract the error values.

With the line:

epochs = range(1, len(loss) + 1)

let’s define the range of visibility, that is 1000 + 1.

Then with the commands plot, title, xlabel, ylabel and legend we print the graph that will show the errors as the model is trained.

At this point we launch the code contained in the cell and we obtain:

As you can see from the graph, already after the first 100 epochs values there is a stabilization of the error and the model no longer learns anything (overfitting occurs), so to better understand the point where the model no longer learns nothing, we are going to graphically represent the values of the errors after the first 100 epochs.

Let’s add another cell of code and write:

# Exclude the first few epochs so the graph is easier to read

SKIP = 100

plt.plot(epochs[SKIP:], loss[SKIP:], ‘g.’, label=’Training loss’)

plt.plot(epochs[SKIP:], val_loss[SKIP:], ‘b.’, label=’Validation loss’)

plt.title(‘Training and validation loss’)

plt.xlabel(‘Epochs’)

plt.ylabel(‘Loss’)

plt.legend()

plt.show()

With the “skip” function we just skip the first 100 positions and therefore everything that happens from 101 onwards is graphically represented:

With this representation we are comparing the average values of the error, if you wish to view the error values in absolute terms, then you need to add a few more lines of code, so add another cell and write the following code:

# Draw a graph of mean absolute error, which is another way of

# measuring the amount of error in the prediction.

mae = history_1.history[‘mae’]

val_mae = history_1.history[‘val_mae’]

plt.plot(epochs[SKIP:], mae[SKIP:], ‘g.’, label=’Training MAE’)

plt.plot(epochs[SKIP:], val_mae[SKIP:], ‘b.’, label=’Validation MAE’)

plt.title(‘Training and validation mean absolute error’)

plt.xlabel(‘Epochs’)

plt.ylabel(‘MAE’)

plt.legend()

plt.show()

Basically, what is written above is the same code of the previous cell only that instead of the values of loss and val_loss, the values of “mae” and “val_mae” are graphically represented. By launching the play you get:

This graph represents the absolute value (always positive) of the error given by the difference between the expected value for the training data and the expected values and the absolute value of the error between the validation value and the expected ones. As you can see, the blue curve, after the epochs 600, takes a slightly different direction from the curve with the green points and tends to stay above. This means that the model essentially stops learning, as the error remains on average the same, and that in the forecast data from the training data there is a slightly less error than in the forecasts made with the validation data, i.e. it is highlighting overfitting. Furthermore, since the error has a value of about 0.3 on a real value of 1 (absolute maximum value of the sine wave), this means that the model is making an error of about 30%.

In the data analysis we are doing, it is also advisable to make a direct comparison between the forecast value, based on the input data of the training and the actual value. We use test data for this analysis. Then we add another cell of code and write the code below:

# Use the model to make predictions from our validation data

predictions = model_1.predict(x_train)

# Plot the predictions along with to the test data

plt.clf()

plt.title(‘Training data predicted vs actual values’)

plt.plot(x_test, y_test, ‘b.’, label=’Actual’)

plt.plot(x_train, predictions, ‘r.’, label=’Predicted’)

plt.legend()

plt.show()

The important line I wish to focus on is the following:

predictions = model_1.predict(x_train)

in this row we extract the forecast data that our model has carried out, these data are acquired in a variable called “predictions” and the extraction is carried out with the “predict” function applied to our model which is called “model_1” on base of the input data of the “x_train”.

The comparison values, on the other hand, are simply the pair “x_test” and “y_test”.

By clicking on play, you get:

This graph gives us very important information, as can be seen from the red curve (which is practically a straight line), the model is unable to make a correct forecast (blue curve), this means that the model is unable to learn beyond a certain limit, that is, it does not have enough neurons.

In conclusion, two important things emerge from the analysis of the graphs above:

- The model fails to learn beyond a certain level
- After the 600 epochs it goes into overfitting

## PAI-020: Hello World Model (Improved)

On the basis of the previous considerations we are going to make those improvements required to ensure that the prediction of the model is as close as possible to a sinusoid.

First we take the Seno_Function Notebook and save a copy on Drive, to do this: File -> Save a Copy on Drive

A new notebook called “Copia_Seno_Function” will be opened and then we will rename it with “Seno_Functon_2”.

At this point, the first action to do to improve the model is to implement a new layer of neurons, indeed since the prediction deviates a lot from the expected data, we add two new layers of neurons (it is as if we were giving them to the model more brain matter, i.e. a bigger brain!). To do this we go to the cell where we have defined the model and modify the cell by replacing the code with the following:

# We’ll use Keras to create a simple model architecture

from tensorflow.keras import layers

model_2 = tf.keras.Sequential()

# First layer takes a scalar input and feeds it through 16 “neurons”. The

# neurons decide whether to activate based on the ‘relu’ activation function.

model_2.add(layers.Dense(16, activation=’relu’, input_shape=(1,)))

# The new second layer

model_2.add(layers.Dense(16, activation=’relu’))

# The new third layer

model_2.add(layers.Dense(16, activation=’relu’))

# Final layer is a single neuron, since we want to output a single value

model_2.add(layers.Dense(1))

# Compile the model using a standard optimizer and loss function for regression

model_2.compile(optimizer=’rmsprop’, loss=’mse’, metrics=[‘mae’])

# Show a summary of the model

model_2.summary()

As you can see, compared to the previous model called model_1, this new model called model_2, has two layers of 16 additional neurons and are defined as the previous ones, that is with the line:

# The new second layer

model_2.add(layers.Dense(16, activation=’relu’))

# The new third layer

model_2.add(layers.Dense(16, activation=’relu’))

As you can see, in this definition of the further layers, it is not necessary to specify the input, as being sequential layers, the outputs from the 16 neurons are automatically combined by the model as an input for the next layer of 16 neurons.

At this point we are going to modify the code in the other following cells considering that instead of the model_1 variable, it will be necessary to insert model_2, instead of history_1, it will be necessary to insert history_2 and as regards the epochs, put this value from 1000 to 600.

For the above, the code in the various cells below becomes:

# Train the model on our training data while validating on our validation set

history_2 = model_2.fit(x_train, y_train, epochs=600, batch_size=16,

validation_data=(x_validate, y_validate))

__________________________________________________________

# Draw a graph of the loss, which is the distance between

# the predicted and actual values during training and validation.

loss = history_2.history[‘loss’]

val_loss = history_2.history[‘val_loss’]

epochs = range(1, len(loss) + 1)

plt.plot(epochs, loss, ‘g.’, label=’Training loss’)

plt.plot(epochs, val_loss, ‘b’, label=’Validation loss’)

plt.title(‘Training and validation loss’)

plt.xlabel(‘Epochs’)

plt.ylabel(‘Loss’)

plt.legend()

plt.show()

__________________________________________________________

# Exclude the first few epochs so the graph is easier to read

SKIP = 100

plt.plot(epochs[SKIP:], loss[SKIP:], ‘g.’, label=’Training loss’)

plt.plot(epochs[SKIP:], val_loss[SKIP:], ‘b.’, label=’Validation loss’)

plt.title(‘Training and validation loss’)

plt.xlabel(‘Epochs’)

plt.ylabel(‘Loss’)

plt.legend()

plt.show()

__________________________________________________________

# Draw a graph of mean absolute error, which is another way of

# measuring the amount of error in the prediction.

mae = history_2.history[‘mae’]

val_mae = history_2.history[‘val_mae’]

plt.plot(epochs[SKIP:], mae[SKIP:], ‘g.’, label=’Training MAE’)

plt.plot(epochs[SKIP:], val_mae[SKIP:], ‘b.’, label=’Validation MAE’)

plt.title(‘Training and validation mean absolute error’)

plt.xlabel(‘Epochs’)

plt.ylabel(‘MAE’)

plt.legend()

plt.show()

__________________________________________________________

# Use the model to make predictions from our validation data

predictions = model_2.predict(x_train)

# Plot the predictions along with to the test data

plt.clf()

plt.title(‘Training data predicted vs actual values’)

plt.plot(x_test, y_test, ‘b.’, label=’Actual’)

plt.plot(x_train, predictions, ‘r.’, label=’Predicted’)

plt.legend()

plt.show()

__________________________________________________________

After making the changes, click on: Modify à Delete all data

And then on: Runtime -> Restart and run all [you will be asked to confirm, click on YES]

The result is amazing:

Furthermore, I would like to point out that the distribution of the forecast values deriving from the training data (green curve) are slightly higher than the validation ones (blue curve), this means that overfitting has substantially disappeared.

The very last thing, to conclude the lesson that I would like you to observe, is that to create a functioning self-learning model, as you have seen, basically we proceed by trial and error, on the basis of experimentation … so I leave you with your insights and any further improvements.