Basics of Machine Learning, Artificial Neural Networks and AI

WIHL_ML&AI

Principles from Theory and Fundamentals¶

Machine Learning (ML) is a field of study that uses algorithms to learn patterns from data. An Artificial Neural Network (ANN) is a type of computer architecture that processes data in order to explicitly programmed on how to solve it (see basics of ML and AI.ipynb:Table 1)

An ANN is said to learn $(L)$ from experience $(E)$ for a specific task $(T)$, based on Performance Measure $(P)$, if $P(T)$ improves as $E$ increases where $E$, Experience $\Rightarrow$ is the acquisition of new data representative of T(Data)

$\therefore$ $AI$ := ML optimized ANN capable of high P(T) for a specific Task

Where
$\quad$$\quad$$L$ := optimization of network output with respect to the loss function
$\quad$$\quad$$T$ is some specific algorithmic process
Note $\because$ $T$ is for a specific algorithmic process only $\Rightarrow$ modern (2024) AI systems are not (yet) capable of general optimization $ \&\therefore$ AI $!=$ General AI or General Intelligence but rather very specific task focused systems

The Challenges to AI¶

Despite all the marvelous implementations of AI technology. AI adoption and implementation has some serious challenges to overcome.

1. Semantics
Modern AI systems currently (2024) cannot handle ambiguity, subjectiveness and common-sense (cultural/social sense) oriented tasks, i.e non-algorithmic tasks

2. Power Consumption
Total energy consumption in general is:
TOTAL ENERGY CONSUMPTION = ENERGY TO TRAINING + ENERGY FOR INFERENCE $$E_{total} = ( P_{hardware}\cdot t_{train}) + ( P_{inference}\cdot t_{inference}\cdot N_{inferences}) $$ $\quad$where,
$\quad\quad P_{hardware}$ = Power consumption of the hardware during training
$\quad\quad t_{traim}$ = time taken to train the model
$\quad\quad P_{inference}$ = Power consumption per inference
$\quad\quad t_{inference}$ = time for a single inference
$\quad\quad N_{inference}$ = total number of inferences performed

Modern (2024) large scale, real world application systens (AlphaGo, Alpha Fold, ChatGPT $\dots$), $E_{total}$ is hundreds of killowats\hr of continous electricity use $\equiv$ to an industrial manufactoring factory, but only the size of a large office. This does not inlcude the electricity required to power the cooling and supporting infrastructure

3. Training Time
For large AI models, training can take days to months $\Rightarrow$ training is an energy expensive process $\therefore$ live updating (or minumum day update cadence) is difficult=esxpensive

4. Training Data and Storage
Large AI models require huge amounts of highly specific and structured data (ranging from terabytes to petabytes), spanning as much of the experience $(E)$ sapce possible within a task $(T)$ performance. The storage of this data is no easy task either.

The Basic Cell¶

The fundamental unit of any ANN = Basic Cell = One node Feed Forward Network (FFN) = Perceptron (Figure 1). A perceptron is the simplest type of neural network, where information flows in one direction—from input to output—without loops. Inspired by how neurons in the brain work. Modern ANN architectures are considered fully-connected even though with the introduction of layers $\rightarrow$ organization = structure $\rightarrow$ modularity and reduced connectivity $\rightarrow$ higher-order functions $\Rightarrow$ the emergence of apparent intelligence in the system. A Perceptron alone is best suitable for solving linearly separable classification problems —that is, problems where the data can be divided into two classes using a straight line (in 2D), a plane (in 3D), or a hyperplane (in higher dimensions). Eg. Identifying spam by keywords in emails. This is because the perceptron computes a weighted sum of inputs, applies a bias, and passes the result through an activation function (e.g., a step function), which gates the activation in a binary process. This simple mechanism is sufficient to separate data into two groups but for more complex data multi-layer perceptrons (MLPs) or deep neural networks (DNNs) are needed. With the addition of multiple (tens to hundreds) layers its possible to progressively extract higher-level features from the raw input. Eg the early layers in an image recognition DNN might detect edges, while deeper layers might identify shapes, objects, or faces.

Alt Text
Figure 1 – Skematic representation of a Perceptron, equivalent to a single node feed forward ANN.

ANN Structure and Organization : Inference Layer¶

An Artificial Neural Network (ANN) is a simplified computational model designed to mimic the structure and function of neural networks in the human brain. Rather than biological systems, ANNs are implemented in software as organized arrangements of interconnected nodes, often referred to as cells, which are structured into distinct layers (Figure 2). These layers facilitate the flow of information through the network, enabling complex computations.

Fjodor van Veen (2016, 2017) of the Asimov Institute has curated an extensive collection of diagrammatic representations of various ANN architectures and their functions, based on primary literature sources. His work highlights that most modern neural networks are constructed using only a handful of fundamental cell types, which are combined in diverse ways to produce the wide array of architectures we use today.

Alt Text
Figure 2 – A skematic representation of a generic ANN architecture represented in code

The connection between a presynaptic node in one layer and a postsynaptic node in the subsequent layer is often represented using index notation. This notation helps specify the particular relationship:

$i^{th}$ Connection,
of $j^{th}$ node (post-synaptic),
in $L^{th}$ Layer (post-synaptic),
from $k^{th}$ node in $(L-1)$ Layer (pre-synaptic)

The signal strength, $S$, of a connection $i$ to cell $j$, in layer $L$ = product of Input feature $(\theta^{L}_{i,j})$, into $i^{th}$ connection to $j^{th}$ cell, in $L^{th}$ layer AND connection weight $(w^{L}_{i,j})$. This weight determines the relative importance of the connection. Upon initialization of the artificial neural network (ANN), these weights are typically assigned random real values (positive, negative, or zero). $$S_{i} = \theta_{n} \cdot w_{i} \quad\quad\quad (1)$$

A Perceptron aggregates all incoming inputs into a weighted sum — also known as net input, $(z)$. This aggregation includes an activation bias, $b$, a constant that can range depending on the initialization strategy and the optimization process, which allows the network to adjust the output independently of the input. The bias term helps prevent the perceptron from being stuck at output values like zero and can enhance flexibility in modeling complex data
$$z = \sum_{i=1}^{n} S_{i} + b_j= \sum_{i=1}^{n} \theta_{n}w_{i} + b_{j} \quad\quad\quad (2)$$

he net input is transformed into an output value through an Activation Function $(F)$. the classical perceptron introduced by Frank Rosenblatt (1958), the activation function was a step function, producing discrete outputs (e.g., binary classifications like 0 or 1). While step functions work well for simple tasks, modern neural networks—which evolved from perceptrons—use more sophisticated activation functions such as sigmoid, tanh, or ReLU (Rectified Linear Unit). These functions enable the network to model more complex relationships between inputs and outputs, improving the ability to approximate real-world data. $$ F(z) = a \quad\quad\quad\quad (3)$$

Where $a$ is the output of the activation function. During inference this output is either passed to subsequent nodes in the next layer or used as the network’s final output. During the trainig phase it is evaluated in the update layer of the network.

ANN Structure and Organization : Update Layer¶

During the training phase of a neural network, input data is propagated through the perceptron, where its performance is evaluated at the output. The optimization process—central to machine learning—enables the perceptron to independently update its connection weights, $(w_{i,j})$ using input data $(X \in [\vec{x}, \vec{y}] = ((\theta_1,…\theta_n), y)$. This is achieved through iterative adjustments aimed at minimizing the error, as defined by the loss function $\phi$.

The optimization process improves the network’s performance measure, $P(T)$, by applying a learning algorithm that reduces the gradient of the loss function, $\nabla\phi(w,b)$. This gradient descent approach guides the network toward optimal predictions. The step size for each adjustment is determined by the learning rate $(\eta)$ which regulates the rate of convergence:
$\quad$ Smaller steps $(\eta)$ : Enable finer adjustments, leading to higher accuracy but slower convergence.
$\quad$ Larger steps $(\eta)$ : Speed up training but risk overshooting and reduced accuracy.
The weight update rule, which lies at the heart of the training loop, is expressed as:

$$w^{*}_{i,j} = w_{i,j} – \nabla\phi(w,b) \quad\quad\quad\quad (4)$$

Here $w^{*}_{i,j}$. represents the updated weight, and $\phi$ quantifies the error between the network’s predictions and the target outcomes. By iteratively minimizing $\phi$ the network becomes increasingly adept at generating accurate predictions.

Keynote¶

In this article the Activation Funtion,$(F)$, and the loss function $(\phi)$ were not discussed in any detail. A large deal of the complexity and intricacies surrounding the details of the Training and Inference process depend on these functions, and so they will be covered in more detail at another time.

Sources¶

Sources … …
[1] Dive in Deep Learning, E-Book
[2] The Assimov Institute: Project > Neural Network Zoo Project

Aknowledgements¶

Big thank you to everyone at OpenAI, ChatGPT was super helpful in brainstorming, editiing, and preliminary research.