6 Math Foundations to Start Learning Machine Learning

As a Data Scientist, machine learning is our arsenal to do our job. I am pretty sure in this modern times, everyone who is employed as a Data Scientist would use machine learning to analyze their data to produce valuable patterns. Although, why we need to learn math for machine learning? There is some argument I could give, this includes:

  • Math helps you select the correct machine learning algorithm. Understanding math gives you insight into how the model works, including choosing the right model parameter and the validation strategies.
  • Estimating how confident we are with the model result by producing the right confidence interval and uncertainty measurements needs an understanding of math.
  • The right model would consider many aspects such as metrics, training time, model complexity, number of parameters, and number of features which need math to understand all of these aspects.
  • You could develop a customized model that fits your own problem by knowing the machine learning model’s math.

The main problem is what math subject you need to understand machine learning? Math is a vast field, after all. That is why in this article, I want to outline the math subject you need for machine learning and a few important point to starting learning those subjects.

Machine Learning Math

We could learn many topics from the math subject, but if we want to focus on the math used in machine learning, we need to specify it. In this case, I like to use the necessary math references explained in the Machine Learning Math book by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2021.

In their book, there are math foundations that are important for Machine Learning. The math subject is:

Image created by Author

Six math subjects become the foundation for machine learning. Each subject is intertwined to develop our machine learning model and reach the “best” model for generalizing the dataset.

Let’s dive deeper for each subject to know what they are.

Linear Algebra

What is Linear Algebra? This is a branch of mathematic that concerns the study of the vectors and certain rules to manipulate the vector. When we are formalizing intuitive concepts, the common approach is to construct a set of objects (symbols) and a set of rules to manipulate these objects. This is what we knew as algebra.

If we talk about Linear Algebra in machine learning, it is defined as the part of mathematics that uses vector space and matrices to represent linear equations.

When talking about vectors, people might flashback to their high school study regarding the vector with direction, just like the image below.

Geometric Vector (Image by Author)

This is a vector, but not the kind of vector discussed in the Linear Algebra for Machine Learning. Instead, it would be this image below we would talk about.

Vector 4×1 Matrix (Image by Author)

What we had above is also a Vector, but another kind of vector. You might be familiar with matrix form (the image below). The vector is a matrix with only 1 column, which is known as a column vector. In other words, we can think of a matrix as a group of column vectors or row vectors. In summary, vectors are special objects that can be added together and multiplied by scalars to produce another object of the same kind. We could have various objects called vectors.

Matrix (Image by Author)

Linear algebra itself s a systematic representation of data that computers can understand, and all the operations in linear algebra are systematic rules. That is why in modern time machine learning, Linear algebra is an important study.

An example of how linear algebra is used is in the linear equation. Linear algebra is a tool used in the Linear Equation because so many problems could be presented systematically in a Linear way. The typical Linear equation is presented in the form below.

Linear Equation (Image by Author)

To solve the linear equation problem above, we use Linear Algebra to present the linear equation in a systematical representation. This way, we could use the matrix characterization to look for the most optimal solution.

Linear Equation in Matrix Representation (Image by Author)

To summary the Linear Algebra subject, there are three terms you might want to learn more as a starting point within this subject:

  • Vector
  • Matrix
  • Linear Equation

Analytic Geometry (Coordinate Geometry)

Analytic geometry is a study in which we learn the data (point) position using an ordered pair of coordinates. This study is concerned with defining and representing geometrical shapes numerically and extracting numerical information from the shapes numerical definitions and representations. We project the data into the plane in a simpler term, and we receive numerical information from there.

Cartesian Coordinate (Image by Author)

Above is an example of how we acquired information from the data point by projecting the dataset into the plane. How we acquire the information from this representation is the heart of Analytical Geometry. To help you start learning this subject, here are some important terms you might need.

  • Distance Function

A distance function is a function that provides numerical information for the distance between the elements of a set. If the distance is zero, then elements are equivalent. Else, they are different from each other.

An example of the distance function is Euclidean Distance which calculates the linear distance between two data points.

Euclidean Distance Equation (Image by Author)
  • Inner Product

The inner product is a concept that introduces intuitive geometrical concepts, such as the length of a vector and the angle or distance between two vectors. It is often denoted as ⟨x,y⟩ (or occasionally (x,y) or ⟨x|y⟩).

Matrix Decomposition

Matrix Decomposition is a study that concerning the way to reducing a matrix into its constituent parts. Matrix Decomposition aims to simplify more complex matrix operations on the decomposed matrix rather than on its original matrix.

A common analogy for matrix decomposition is like factoring numbers, such as factoring 8 into 2 x 4. This is why matrix decomposition is synonymical to matrix factorization. There are many ways to decompose a matrix, so there is a range of different matrix decomposition techniques. An example is the LU Decomposition in the image below.

LU Decomposition (Image by Author)

Vector Calculus

Calculus is a mathematical study that concern with continuous change, which mainly consists of functions and limits. Vector calculus itself is concerned with the differentiation and integration of the vector fields. Vector Calculus is often called multivariate calculus, although it has a slightly different study case. Multivariate calculus deals with calculus application functions of the multiple independent variables.

There are a few important terms I feel people need to know when starting learning the Vector Calculus, they are:

  • Derivative and Differentiation

The derivative is a function of real numbers that measure the change of the function value (output value) concerning a change in its argument (input value). Differentiation is the action of computing a derivative.

Derivative Equation (Image by Author)
  • Partial Derivative

The partial derivative is a derivative function where several variables are calculated within the derivative function with respect to one of those variables could be varied, and the other variable are held constant (as opposed to the total derivative, in which all variables are allowed to vary).

  • Gradient

The gradient is a word related to the derivative or the rate of change of a function; you might consider that gradient is a fancy word for derivative. The term gradient is typically used for functions with several inputs and a single output (scalar). The gradient has a direction to move from their current location, e.g., up, down, right, left.

Probability and Distribution

Probability is a study of uncertainty (loosely terms). The probability here can be thought of as a time where the event occurs or the degree of belief about an event’s occurrence. The probability distribution is a function that measures the probability of a particular outcome (or probability set of outcomes) that would occur associated with the random variable. The common probability distribution function is shown in the image below.

Normal Distribution Probability Function (Image by Author)

Probability theory and statistics are often associated with a similar thing, but they concern different aspects of uncertainty:

•In math, we define probability as a model of some process where random variables capture the underlying uncertainty, and we use the rules of probability to summarize what happens.

•In statistics, we try to figure out the underlying process observe of something that has happened and tries to explain the observations.

When we talk about machine learning, it is close to statistics because its goal is to construct a model that adequately represents the process that generated the data.

Optimization

In the learning objective, training a machine learning model is all about finding a good set of parameters. What we consider “good” is determined by the objective function or the probabilistic models. This is what optimization algorithms are for; given an objective function, we try to find the best value.

Commonly, objective functions in machine learning are trying to minimize the function. It means the best value is the minimum value. Intuitively, if we try to find the best value, it would like finding the valleys of the objective function where the gradients point us uphill. That is why we want to move downhill (opposite to the gradient) and hope to find the lowest (deepest) point. This is the concept of gradient descent.

Gradient Descent (Image by Author)

There are few terms as a starting point when learning optimization. They are:

  • Local Minima and Global Minima

The point at which a function best values takes the minimum value is called the global minima. However, when the goal is to minimize the function and solved it using optimization algorithms such as gradient descent, the function could have a minimum value at different points. Those several points which appear to be minima but are not the point where the function actually takes the minimum value are called local minima.

Local and Global Minima (Image by Author)
  • Unconstrained Optimization and Constrained Optimization

Unconstrained Optimization is an optimization function where we find a minimum of a function under the assumption that the parameters can take any possible value (no parameter limitation). Constrained Optimization simply limits the possible value by introducing a set of constraints.

Gradient descent is an Unconstrained optimization if there is no parameter limitation. If we set some limit, for example, x > 1, it is an unconstrained optimization.

Conclusion

Machine Learning is an everyday tool that Data scientists use to obtain the valuable pattern we need. Learning the math behind machine learning could provide you an edge in your work. There are many math subjects out there, but there are 6 subjects that matter the most when we are starting learning machine learning math, and that is:

  • Linear Algebra
  • Analytic Geometry
  • Matrix Decomposition
  • Vector Calculus
  • Probability and Distribution
  • Optimization

If you start learning math for machine learning, you could read my other article to avoid the study pitfall. I also provide the math material you might want to check out in that article.

 

By: Cornellius Yudha Wijaya

Source: 6 Math Foundations to Start Learning Machine Learning | by Cornellius Yudha Wijaya | Towards Data Science

.

Critics:

Machine learning (ML) is the study of computer algorithms that improve automatically through experience and by the use of data. It is seen as a part of artificial intelligence. Machine learning algorithms build a model based on sample data, known as “training data“, in order to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of applications, such as in medicine, email filtering, speech recognition, and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.

A subset of machine learning is closely related to computational statistics, which focuses on making predictions using computers; but not all machine learning is statistical learning. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a related field of study, focusing on exploratory data analysis through unsupervised learning. In its application across business problems, machine learning is also referred to as predictive analytics.

Machine learning approaches are traditionally divided into three broad categories, depending on the nature of the “signal” or “feedback” available to the learning system:

  • Supervised learning: The computer is presented with example inputs and their desired outputs, given by a “teacher”, and the goal is to learn a general rule that maps inputs to outputs.
  • Unsupervised learning: No labels are given to the learning algorithm, leaving it on its own to find structure in its input. Unsupervised learning can be a goal in itself (discovering hidden patterns in data) or a means towards an end (feature learning).
  • Reinforcement learning: A computer program interacts with a dynamic environment in which it must perform a certain goal (such as driving a vehicle or playing a game against an opponent). As it navigates its problem space, the program is provided feedback that’s analogous to rewards, which it tries to maximize.

References

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.