Attribution & Sources

By Nicholas Tavares

Course: CPE 695 — Applied Machine Learning, Stevens Institute of Technology (Schaefer School of Engineering & Science)

Primary sources: Stevens Institute of Technology CPE 695 Lecture Slides (Instructor: Shucheng Yu)

Textbooks from the syllabus:

Tom M. Mitchell, Machine Learning
Christopher M. Bishop, Pattern Recognition and Machine Learning (2006)
Aurélien Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (2019)
Ian Goodfellow, Yoshua Bengio & Aaron Courville, Deep Learning (2016)
Stephen Marsland, Machine Learning: An Algorithmic Perspective (2nd ed.)

Intro ML — Theory Review

A comprehensive study companion covering core theory, definitions, formulas with proper notation, and key distinctions across Modules 1–8. Read this first, then test yourself with the practice exam.

Modules 1–8

Lecture-based emphasis

Formula reference

Exam traps called out

Study Navigation

Use this page for theory review, then switch to the exam page to test yourself.

Open Practice Exam Home

Module 1: Core Concepts

Machine learning is about building systems that improve their performance on a task through experience, rather than being explicitly programmed for every scenario. Tom Mitchell's 1997 definition formalizes this: a program learns if it gets measurably better at some task as it sees more data. The three pillars of this definition are the task T, the experience E, and the performance measure P. Note that a specific model or hypothesis class H is something we choose when we design the system — it is not part of the formal definition itself.

ML is broadly split into supervised learning (labeled input–output pairs), unsupervised learning (no labels — the algorithm finds structure on its own), and reinforcement learning (an agent learns by receiving rewards or penalties). Within supervised learning, there's a further distinction between batch learning (train on the entire dataset at once) and online learning (update incrementally as new data arrives). Similarly, model-based learning fits explicit parameters (like regression coefficients), while instance-based learning stores training examples and makes predictions by comparing new inputs to stored ones (like k-NN).

Mitchell's Definition

A program learns from experience E with respect to task T and performance measure P if its performance at T, as measured by P, improves with E.

T (task), E (experience), and P (performance measure) are required.
A particular model H is not required by the definition itself.
Supervised → labeled data; Unsupervised → unlabeled data.
Online learning updates incrementally; batch learning trains on the full dataset.
Model-based → fits parameters; Instance-based → compares to stored examples (e.g., k-NN).

⚠ Exam trap: “Which item is NOT required in the definition of machine learning?” The answer is usually Model H.

Module 2: Linear Regression

Linear regression models the relationship between input features and a continuous target by fitting a linear function. Given features x₁, x₂, …, xₙ, the model predicts ŷ = w₀ + w₁x₁ + … + wₙxₙ. The weights w are chosen to minimize a cost function, most commonly the Mean Squared Error (MSE). Unlike most ML models, linear regression has a closed-form solution called the normal equation, which directly computes the optimal weights in one step by solving w = (XᵀX)⁻¹Xᵀy. This works well for small datasets but becomes expensive for large ones because of the matrix inversion.

When the normal equation is too costly, we use gradient descent (GD) — an iterative optimization method that adjusts weights in the direction that reduces the cost. The learning rate α controls step size: too small and convergence is painfully slow, too large and the algorithm oscillates or diverges. Batch GD uses all training examples per update, Stochastic GD (SGD) uses one example per update (noisier but faster), and Mini-batch GD is the practical middle ground. Regularization adds a penalty to large weights: Ridge (L2) shrinks all coefficients toward zero but rarely zeroes them out, while Lasso (L1) can drive some coefficients exactly to zero, performing implicit feature selection.

Core Model

ŷ = w₀ + w₁x₁ + w₂x₂ + … + wₙxₙ

Key Formulas

MSE = (1/m) ∑ᵢ (ŷ⁷ − y⁷)² Normal Equation: w = (XᵀX)⁻¹ Xᵀy SGD weight update: wⱼ ← wⱼ − α · 2(ŷ − y) · xⱼ Ridge (L2): J(w) = MSE + λ ∑ wⱼ² Lasso (L1): J(w) = MSE + λ ∑ |wⱼ|

Linear regression has a closed-form solution; logistic regression does not.
If α is too large, GD may oscillate or diverge instead of converging.
Lasso (L1) can drive coefficients to exactly zero → feature selection.
Ridge (L2) shrinks coefficients but rarely zeros them out.
MAE is more robust to outliers than MSE/RMSE because it grows linearly with error.
Basis functions (e.g., x², sin(x)) transform inputs into new features — the model remains linear in the parameters.

⚠ Exam trap: More training iterations do not automatically solve overfitting. They can actually make it worse.

Module 3: Logistic Regression

Logistic regression is a classification model, despite the name. It models the probability that an input belongs to class 1 by passing a linear combination of features through the sigmoid function σ(z) = 1/(1 + e⁻ẑ). The output is always between 0 and 1, making it interpretable as a probability. The model is trained by minimizing cross-entropy loss (also called log loss), which penalizes confident wrong predictions heavily. A key advantage: with cross-entropy, the optimization landscape is convex, so gradient descent can find the global optimum. Unlike linear regression, there is no closed-form normal equation solution.

Formulas

P(y=1|x) = σ(wᵀx) σ(z) = 1 / (1 + e⁻ẑ) σ′(z) = σ(z) · (1 − σ(z)) Cross-entropy loss: L = −[y log(ŷ) + (1−y) log(1−ŷ)]

Binary classification → outputs a probability between 0 and 1.
Cost function: cross-entropy / log loss (not MSE).
Optimization is convex → GD reaches the global optimum.
No closed-form (normal equation) solution exists.
Multiclass extension: softmax regression or one-vs-rest.

Module 3: Evaluation Metrics

Once a classifier is trained, we need to measure how well it actually works. Accuracy is the simplest metric but can be dangerously misleading on imbalanced datasets (e.g., 99% negative class → always predicting negative gives 99% accuracy). Precision tells you how reliable your positive predictions are, while Recall tells you how many actual positives you caught. The F1 score is the harmonic mean of precision and recall — useful when you need a single balanced metric. The ROC curve plots True Positive Rate vs. False Positive Rate at all thresholds; a curve hugging the top-left corner is ideal, while one near the diagonal is no better than random guessing.

Formulas

Precision = TP / (TP + FP) Recall = TP / (TP + FN) F1 = 2TP / (2TP + FP + FN) TPR (Sensitivity) = TP / (TP + FN) FPR = FP / (FP + TN) Specificity = TN / (TN + FP)

ROC curve plots TPR (y-axis) vs FPR (x-axis).
ROC near the diagonal → random-guess-level performance.
Accuracy is misleading on imbalanced datasets.
One-vs-rest (OvR) is a standard multiclass extension.

Module 4: Decision Trees

Decision trees are hierarchical, rule-based models that recursively split the data on feature values to create regions that are as pure (homogeneous) as possible. At each internal node, the algorithm picks the feature and threshold that best separates the data according to some impurity measure. ID3 uses information gain (based on entropy) to choose splits, while CART uses Gini impurity. Training is greedy — each split is locally optimal, but there is no guarantee the overall tree structure is globally optimal.

Trees are used for both classification (discrete targets) and regression (continuous targets). A fully-grown tree will memorize the training data (overfit), so we use pruning to cut back branches that don't improve generalization. Reduced error pruning removes subtrees and checks if validation performance improves. Chi-squared pruning tests whether a split is statistically significant. Entropy of 0 means a node is perfectly pure (all one class), while maximum entropy means the classes are evenly mixed.

Formulas

Entropy(S) = − ∑ᵢ pᵢ log₂(pᵢ) Information Gain: IG(S, A) = Entropy(S) − ∑ᵥ (|Sᵥ| / |S|) · Entropy(Sᵥ) Gini Impurity = 1 − ∑ₖ pₖ² Binary Gini = 1 − p² − (1−p)² = 2p(1−p)

ID3 → information gain (entropy-based); CART → Gini impurity.
Training is greedy — not guaranteed to find the globally optimal tree.
Entropy = 0 → perfectly pure node; Entropy = 1 (binary) → maximum impurity.
Reduced error pruning uses a validation set to decide whether to remove subtrees.
Chi-squared pruning tests whether splits are statistically significant.
Trees work for both classification and regression.

⚠ Exam trap: Decision trees are NOT only for classification. They are also used for regression (regression trees predict continuous values).

Module 5: Bias, Variance & Cross-Validation

Every model's prediction error can be decomposed into three components: bias (systematic error from wrong assumptions — the model is too simple), variance (sensitivity to fluctuations in the training set — the model is too complex), and irreducible noise (inherent randomness in the data). This is the bias-variance tradeoff: as model complexity increases, bias decreases but variance increases. The sweet spot is where total error is minimized. K-fold cross-validation helps find this sweet spot by repeatedly splitting the data into K folds, training on K−1 folds and validating on the remaining one. After selecting the best hyperparameters, you retrain on all training data and test once on the held-out test set.

Decomposition

Expected Error ≈ Bias² + Variance + Irreducible Noise

High bias → underfitting (model too simple).
High variance → overfitting (model too complex).
K-fold CV is used for model selection and hyperparameter tuning.
After tuning: retrain on all training data, then test once on the held-out test set.

Module 5: Ensemble Methods

Ensemble methods combine multiple weak learners to build a stronger predictor. Bagging (bootstrap aggregating) trains multiple models on random subsets sampled with replacement, then averages their predictions. This reduces variance and is especially effective for unstable learners like decision trees (Random Forest is bagging + random feature subsets). Pasting is the same idea but samples without replacement. Boosting works sequentially: each new model focuses on the mistakes of the previous ones, gradually reducing bias. Stacking trains a meta-model (blender) to learn how to combine the outputs of several base learners optimally.

Bagging: sampling with replacement → reduces variance.
Pasting: sampling without replacement.
Boosting: sequential → each learner corrects earlier mistakes.
Stacking: a meta-model learns to combine base learner outputs.
Bagging is most helpful for unstable learners (e.g., decision trees).

Module 6: Support Vector Machines

SVMs find the optimal separating hyperplane that maximizes the margin between classes. The margin is the distance between the decision boundary and the nearest data points from each class — these nearest points are called support vectors. A larger margin typically leads to better generalization. For data that isn't linearly separable, the kernel trick maps inputs into a higher-dimensional space where a linear separator can be found, without explicitly computing the transformation. Common kernels include polynomial and RBF (Gaussian). The hyperparameter C controls the tradeoff: large C penalizes violations heavily (tighter margin, risk of overfitting), small C allows more violations (wider margin, more regularization).

Key Concepts

Objective: maximize margin = 2 / ‖w‖ subject to: yᵢ(wᵀxᵢ + b) ≥ 1 Kernel trick: K(xᵢ, xⱼ) = φ(xᵢ) · φ(xⱼ) → compute dot product in high-dim space without explicit mapping

Support vectors are the points nearest the decision boundary.
Kernel functions enable nonlinear separation (polynomial, RBF, etc.).
SVM optimization is quadratic → avoids poor local minima.
Large C → penalizes violations heavily → tighter margin, higher variance.
Small C → allows more violations → wider margin, more regularization.
SVR (Support Vector Regression) adapts SVM for continuous targets.

Module 7: Bayesian Learning

Bayes' theorem is the foundation: it lets us update our belief about a hypothesis after observing evidence. P(h|D) is the posterior probability of hypothesis h given data D; P(D|h) is the likelihood; P(h) is the prior. The MAP (Maximum A Posteriori) approach picks the single hypothesis with the highest posterior probability. The Bayes optimal classifier goes further — it considers all hypotheses, weighted by their posterior probabilities, making it the best possible classifier (but often intractable). Naive Bayes makes the simplifying assumption that all features are conditionally independent given the class, which makes computation tractable and works surprisingly well in practice. The Gibbs algorithm randomly selects a hypothesis according to the posterior distribution; its expected error is at most twice that of the Bayes optimal classifier.

Bayes' Theorem

P(h|D) = P(D|h) · P(h) / P(D) MAP: hᶜᴄᴈ = argmaxₕ P(D|h) · P(h) If P(A|B) = P(A), then A ⊥ B (independent): → P(A ∩ B) = P(A) · P(B) Naive Bayes assumption: P(x₁, x₂, …, xₙ | C) = ∏ᵢ P(xᵢ | C)

If P(A|B) = P(A), then A and B are independent → P(A∩B) = P(A)·P(B).
MAP picks the hypothesis with highest posterior probability.
Bayes optimal uses all hypotheses weighted by posterior → best possible.
Naive Bayes assumes features are conditionally independent given class.
Gibbs algorithm: expected error ≤ 2 × (Bayes optimal error).
EM algorithm is useful when there are hidden/latent variables (not when there aren't).
Bayesian belief network: a directed acyclic graph (DAG) representing conditional dependencies.

Module 8: Neural Networks

Neural networks are composed of layers of interconnected nodes (neurons). The simplest form is the perceptron — a single-layer model that can only learn linearly separable patterns (it famously fails on XOR). Adding hidden layers creates a multilayer perceptron (MLP), which can represent much more complex functions. The universal approximation theorem says that with enough neurons in a single hidden layer, an MLP can approximate any continuous function — and with at most two hidden layers, it can represent any Boolean or continuous target function.

Training uses backpropagation, which efficiently computes gradients by applying the chain rule backward through the network. The sigmoid activation σ(z) = 1/(1+e⁻ẑ) was historically popular because it's smooth and differentiable, with the elegant derivative σ′(z) = σ(z)(1−σ(z)). For binary classification output layers, sigmoid is typical; for multiclass, softmax is used to produce a probability distribution. Regression networks and classification networks use different loss functions (e.g., MSE for regression, cross-entropy for classification). Small initial weights are preferred with sigmoid to avoid saturating in flat regions where gradients vanish.

Sigmoid Activation

σ(z) = 1 / (1 + e⁻ẑ) σ′(z) = σ(z) · (1 − σ(z)) Softmax (multiclass): P(class k) = eẑₖ / ∑ⱼ eẑⱼ

Backpropagation (Chain Rule)

∂L/∂wᵢⱼ = ∂L/∂aⱼ · ∂aⱼ/∂zⱼ · ∂zⱼ/∂wᵢⱼ where z = ∑ wᵢxᵢ + b, a = σ(z)

Perceptron: single-layer → can only classify linearly separable data.
A single-layer perceptron cannot solve XOR.
Backpropagation = chain rule applied backward through layers.
Sigmoid derivative: σ′(z) = σ(z)(1 − σ(z)).
Binary classification output → sigmoid; Multiclass → softmax.
Regression and classification networks use different cost functions.
Small initial weights are preferred with sigmoid to avoid vanishing gradients.
At most two hidden layers can represent any Boolean or continuous function.

⚠ Exam trap: “Which statement about sigmoid is NOT true?” — The answer was “None of the above” because all listed statements were true.

Fast Cram Checklist

Know cold

MSE, normal equation, σ′(z) = σ(z)(1−σ(z))
Precision, Recall, F1, TPR, FPR
Entropy, information gain, Gini impurity
Bayes' theorem & independence condition
Bias² + Variance + Noise decomposition

Common comparisons

Linear regression vs. logistic regression
Bagging vs. pasting vs. boosting vs. stacking
Bias vs. variance
ID3 (info gain) vs. CART (Gini)
Ridge (L2) vs. Lasso (L1)

Likely traps

Accuracy on imbalanced data
“Decision trees are only for classification”
“Logistic regression has a normal equation”
“More iterations always fix overfitting”
“EM is useful only without hidden variables”