Skip to main content

Mathematical and Computational Prerequisites

  • Chapter
  • First Online:
Introduction to Deep Learning

Part of the book series: Undergraduate Topics in Computer Science ((UTICS))

  • 402k Accesses

Abstract

The mathematical part starts with the review of functions, derivations, vectors and matrices. There all the prerequisites for understanding gradient descent and calculating gradients by hand are given. The chapter provides also an overview of the basic probability concepts, as deep learning today (as opposed to the historical approach) is mainly perceived as either calculating conditional probabilities or probability distributions. The following section gives a brief overview of logic and Turing machines aimed at better understanding the XOR problem and memory-based architectures. Threshold logic gates are only briefly touched upon and placed in the context of a metatheory for deep learning. The remainder of the chapter is a quick introduction to Python, as this will be the language used in the examples in the book. The introduction to Python presented here is sufficient to understand all code in the book.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Notice that they also have the same number of members or cardinality, namely 2.

  2. 2.

    The counting starts with 0, and we will use this convention in the whole book.

  3. 3.

    The traditional definition uses sets to define tuples, tuples to define relations and relations to define functions, but that is an overly logical approach for our needs in the present volume. This definition provides a much wider class of entities to be considered functions.

  4. 4.

    A function with n-arguments is called an n-ary function.

  5. 5.

    The ReLU or rectified linear unit defined by \(\rho (x) = \max (x,0)\) is an example of a function that is continuous even though it is (usually) defined by cases. We will be using ReLU extensively from Chap. 6 onwards.

  6. 6.

    This is why \(0.999\dots \ne 1\).

  7. 7.

    This is especially true in programming, since when we program we need to approximate functions with real numbers by using functions with rational numbers. This approximation also goes a long way in terms of intuition, so it is good to think about this when trying to figure out how a function will behave.

  8. 8.

    With the exception of division where the divisor is 0. In this case, the division function is undefined, and therefore the notion of continuity does not have any meaning in this point.

  9. 9.

    Rational functions are of the form \(\frac{f(x)}{g(x)}\) where f and g are polynomial functions.

  10. 10.

    The process of finding derivatives is called ‘differentiation’.

  11. 11.

    Which is a 0-ary function, i.e. a function that gives the same value regardless of the input.

  12. 12.

    The chain rule in Lagrange notation is more clumsy and void of the intuitive similarity with fractions: \(h'(x)=f'(g(x))g'(x)\).

  13. 13.

    Keep in mind that \(h(x)=g(f(x))=(g\circ f)(x)=g(u)\circ f(x)\), which means that h is the composition of the functions g and f. It is very important not to mix up compositions of functions like \(f(x)=(3-2x)^5\) with an ordinary function like \(f(x)=3-2x^5\), or with a product like \(f(x)=sin x \cdot x^5\).

  14. 14.

    These rules are not independent, since both ChainExp and Exp are a consequence of CHAINRULE.

  15. 15.

    We deliberately avoid talking about fields here since we only use \(\mathbb {R}\), and there is no reason to complicate the exposition.

  16. 16.

    One for each dimension.

  17. 17.

    A minimal subset such that a property P holds is a subset (of some larger set) of which we can take no proper subset such that P would still hold.

  18. 18.

    Matrix subtraction works in exactly the same way, only with subtraction instead of addition.

  19. 19.

    To get the actual f(x) we just need to plug in the minimal x and calculate f(x).

  20. 20.

    In the case of multiple dimensions, we shall do the same calculation for every pair of \(x_i\) and \(\nabla _i f(\mathbf {x})\).

  21. 21.

    Note that a function can have many local minima or minimal points, but only one global minimum. Gradient descent can get ‘stuck’ in a local minimum, but our example has only one local minimum which is the actual global minimum.

  22. 22.

    We stop simply because we consider it to be ‘good enough’—there is no mathematical reason for stopping here.

  23. 23.

    This book is available online for free at https://www.probabilitycourse.com/.

  24. 24.

    Properties are called features in machine learning, while in statistics they are called variables, which can be quite confusing, but it is standard terminology.

  25. 25.

    Note that the mean is equally useless for describing the first four and the last member taken in isolation.

  26. 26.

    The sequence can be sorted in ascending or descending order, it does not matter.

  27. 27.

    This is the ‘official’ name for the mean, median and mode.

  28. 28.

    Not 5 on one die or the other, but 5 as in when you need to roll a 5 in \(\text {Monopoly}^{\circledR }\) to buy that last street you need to start building houses.

  29. 29.

    In \(6^2\), the 6 denotes the number of values on each die, and the 2 denotes the number of dice used.

  30. 30.

    What we called here ‘basic probabilities’ are actually called priors in the literature, and we will be referring to them as such in the later chapters.

  31. 31.

    All machine learning algorithms are estimators.

  32. 32.

    Note that ideally we would like an estimator to be a perfect predictor of the future in all cases, but this would be equal to having foresight. Scientifically speaking, we have models and we try to make them as accurate as possible, but perfect prediction is simply not on the table.

  33. 33.

    ‘Disjoint’ means \(A\cap B = \emptyset \).

  34. 34.

    There are others, but they are in disguise.

  35. 35.

    A version of Bayes’ original manuscript is available at http://www.stat.ucla.edu/history/essay.pdf.

  36. 36.

    This is not exactly how it behaves, but it is a simplification which is more than enough for our needs.

  37. 37.

    Text editors are Notepad, Vim, Emacs, Sublime, Notepad++, Atom, Nano, cat and many others. Feel free to experiment and find the one you like most (most are free). You might have heard of the so-called IDEs or Integrated Development Environments. They are basically text editors with additional functions. Some IDEs you might know of are Visual Studio, Eclipse and PyCharm. Unlike text editors, most IDEs are not freely available, but there are free versions and trial versions, so you may experiment with them before buying. Remember, there is nothing essential an IDE can do but a text editor cannot, but they do offer additional conveniences in IDEs. My personal preference is to use Vim.

  38. 38.

    Never call this an ‘if-loop’, since it is simply wrong.

  39. 39.

    In a programming jargon, when we say ‘the syntax is the same’ or ‘you can use a similar syntax’ means that you should try to reproduce the same style but with the new values or objects.

  40. 40.

    Note that even though the name we assign to a library is arbitrary, there are standard abbreviations used in the Python community. Examples are np for Numpy, tf for TensorFlow, pd for Pandas and so on. This is important to know since on StackOverflow you might find a solution but without the import statements. So if the solution has np somewhere in it, it means that you should have a line which imports Numpy with the name np.

  41. 41.

    In Python, technically speaking, every function returns something. If no return command is issued, the function will return None which is a special Python keyword for ‘nothing’. This a subtle point, but also the cause of many intermediate-level bugs, and therefore it is worth noting it now.

  42. 42.

    In Python 3, this is no longer exactlythat list, but this is a minor issue at this stage of learning Python. What you need to know is that you can count on it to behave exactly like that list.

  43. 43.

    Notice that the code, as it stands now, does not have this problem, but this is a bug since a problem would arise if the room temperature turns out to be an odd number, and not an even number as we have now.

  44. 44.

    JSON stands for JavaScript Object Notation, and JSONs (i.e. Python dictionaries) are referred to as objects in JavaScript.

References

  1. J.R. Hindley, J.P. Seldin, Lambda-Calculus and Combinators: An Introduction (Cambridge University Press, Cambridge, 2008)

    Book  Google Scholar 

  2. G.S. Boolos, J.P. Burges, R.C. Jeffrey, Computability and Logic (Cambridge University Press, Cambridge, 2007)

    Book  Google Scholar 

  3. P. Renteln, Manifolds, Tensors, and Forms: An Introduction for Mathematicians and Physicists (Cambridge University Press, Cambridge, 2013)

    Book  Google Scholar 

  4. R. Courant, J. Fritz, Introduction to Calculus and Analysis, vol. 1 (Springer, New York, 1999)

    Book  Google Scholar 

  5. S. Axler, Linear Algebra Done Right (Springer, New York, 2015)

    MATH  Google Scholar 

  6. P.N. Klein, Coding the Matrix (Newtonian Press, London, 2013)

    Google Scholar 

  7. H. Pishro-Nik, Introduction to Probability, Statistics, and Random Processes (Kappa Books Publishers, Blue Bell, 2014)

    Google Scholar 

  8. D.P. Bertsekas, J.N. Tsitsiklis, Introduction to Probability (Athena Scientific, Nashua, 2008)

    Google Scholar 

  9. S.M. Stigler, Laplace’s 1774 memoir on inverse probability. Stat. Sci. 1, 359–363 (1986)

    Article  MathSciNet  Google Scholar 

  10. A. Hald, Laplace’s Theory of Inverse Probability, 1774–1786 (Springer, New York, 2007), pp. 33–46

    Google Scholar 

  11. W. Rautenberg, A Concise Introduction to Mathematical Logic (Springer, New York, 2006)

    MATH  Google Scholar 

  12. D. van Dalen, Logic and Structure (Springer, New York, 2004)

    Book  Google Scholar 

  13. A.M. Turing, On computable numbers, with an application to the Entscheidungsproblem. Proc. Lond. Math. Soc. 42(2), 230–265 (1936)

    MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sandro Skansi .

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Skansi, S. (2018). Mathematical and Computational Prerequisites. In: Introduction to Deep Learning. Undergraduate Topics in Computer Science. Springer, Cham. https://doi.org/10.1007/978-3-319-73004-2_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-73004-2_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-73003-5

  • Online ISBN: 978-3-319-73004-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics