By Nicholas Tavares
Course: CPE 695 — Applied Machine Learning, Stevens Institute of Technology (Schaefer School of Engineering & Science)
Primary sources: Stevens Institute of Technology CPE 695 Lecture Slides (Instructor: Shucheng Yu)
Textbooks from the syllabus:
A comprehensive study companion covering core theory, definitions, formulas with proper notation, and key distinctions across Modules 1–8. Read this first, then test yourself with the practice exam.
Machine learning is about building systems that improve their performance on a task through experience, rather than being explicitly programmed for every scenario. Tom Mitchell's 1997 definition formalizes this: a program learns if it gets measurably better at some task as it sees more data. The three pillars of this definition are the task T, the experience E, and the performance measure P. Note that a specific model or hypothesis class H is something we choose when we design the system — it is not part of the formal definition itself.
ML is broadly split into supervised learning (labeled input–output pairs), unsupervised learning (no labels — the algorithm finds structure on its own), and reinforcement learning (an agent learns by receiving rewards or penalties). Within supervised learning, there's a further distinction between batch learning (train on the entire dataset at once) and online learning (update incrementally as new data arrives). Similarly, model-based learning fits explicit parameters (like regression coefficients), while instance-based learning stores training examples and makes predictions by comparing new inputs to stored ones (like k-NN).
Linear regression models the relationship between input features and a continuous target by fitting a linear function. Given features x₁, x₂, …, xₙ, the model predicts ŷ = w₀ + w₁x₁ + … + wₙxₙ. The weights w are chosen to minimize a cost function, most commonly the Mean Squared Error (MSE). Unlike most ML models, linear regression has a closed-form solution called the normal equation, which directly computes the optimal weights in one step by solving w = (XᵀX)⁻¹Xᵀy. This works well for small datasets but becomes expensive for large ones because of the matrix inversion.
When the normal equation is too costly, we use gradient descent (GD) — an iterative optimization method that adjusts weights in the direction that reduces the cost. The learning rate α controls step size: too small and convergence is painfully slow, too large and the algorithm oscillates or diverges. Batch GD uses all training examples per update, Stochastic GD (SGD) uses one example per update (noisier but faster), and Mini-batch GD is the practical middle ground. Regularization adds a penalty to large weights: Ridge (L2) shrinks all coefficients toward zero but rarely zeroes them out, while Lasso (L1) can drive some coefficients exactly to zero, performing implicit feature selection.
Logistic regression is a classification model, despite the name. It models the probability that an input belongs to class 1 by passing a linear combination of features through the sigmoid function σ(z) = 1/(1 + e⁻ẑ). The output is always between 0 and 1, making it interpretable as a probability. The model is trained by minimizing cross-entropy loss (also called log loss), which penalizes confident wrong predictions heavily. A key advantage: with cross-entropy, the optimization landscape is convex, so gradient descent can find the global optimum. Unlike linear regression, there is no closed-form normal equation solution.
Once a classifier is trained, we need to measure how well it actually works. Accuracy is the simplest metric but can be dangerously misleading on imbalanced datasets (e.g., 99% negative class → always predicting negative gives 99% accuracy). Precision tells you how reliable your positive predictions are, while Recall tells you how many actual positives you caught. The F1 score is the harmonic mean of precision and recall — useful when you need a single balanced metric. The ROC curve plots True Positive Rate vs. False Positive Rate at all thresholds; a curve hugging the top-left corner is ideal, while one near the diagonal is no better than random guessing.
Decision trees are hierarchical, rule-based models that recursively split the data on feature values to create regions that are as pure (homogeneous) as possible. At each internal node, the algorithm picks the feature and threshold that best separates the data according to some impurity measure. ID3 uses information gain (based on entropy) to choose splits, while CART uses Gini impurity. Training is greedy — each split is locally optimal, but there is no guarantee the overall tree structure is globally optimal.
Trees are used for both classification (discrete targets) and regression (continuous targets). A fully-grown tree will memorize the training data (overfit), so we use pruning to cut back branches that don't improve generalization. Reduced error pruning removes subtrees and checks if validation performance improves. Chi-squared pruning tests whether a split is statistically significant. Entropy of 0 means a node is perfectly pure (all one class), while maximum entropy means the classes are evenly mixed.
Every model's prediction error can be decomposed into three components: bias (systematic error from wrong assumptions — the model is too simple), variance (sensitivity to fluctuations in the training set — the model is too complex), and irreducible noise (inherent randomness in the data). This is the bias-variance tradeoff: as model complexity increases, bias decreases but variance increases. The sweet spot is where total error is minimized. K-fold cross-validation helps find this sweet spot by repeatedly splitting the data into K folds, training on K−1 folds and validating on the remaining one. After selecting the best hyperparameters, you retrain on all training data and test once on the held-out test set.
Ensemble methods combine multiple weak learners to build a stronger predictor. Bagging (bootstrap aggregating) trains multiple models on random subsets sampled with replacement, then averages their predictions. This reduces variance and is especially effective for unstable learners like decision trees (Random Forest is bagging + random feature subsets). Pasting is the same idea but samples without replacement. Boosting works sequentially: each new model focuses on the mistakes of the previous ones, gradually reducing bias. Stacking trains a meta-model (blender) to learn how to combine the outputs of several base learners optimally.
SVMs find the optimal separating hyperplane that maximizes the margin between classes. The margin is the distance between the decision boundary and the nearest data points from each class — these nearest points are called support vectors. A larger margin typically leads to better generalization. For data that isn't linearly separable, the kernel trick maps inputs into a higher-dimensional space where a linear separator can be found, without explicitly computing the transformation. Common kernels include polynomial and RBF (Gaussian). The hyperparameter C controls the tradeoff: large C penalizes violations heavily (tighter margin, risk of overfitting), small C allows more violations (wider margin, more regularization).
Bayes' theorem is the foundation: it lets us update our belief about a hypothesis after observing evidence. P(h|D) is the posterior probability of hypothesis h given data D; P(D|h) is the likelihood; P(h) is the prior. The MAP (Maximum A Posteriori) approach picks the single hypothesis with the highest posterior probability. The Bayes optimal classifier goes further — it considers all hypotheses, weighted by their posterior probabilities, making it the best possible classifier (but often intractable). Naive Bayes makes the simplifying assumption that all features are conditionally independent given the class, which makes computation tractable and works surprisingly well in practice. The Gibbs algorithm randomly selects a hypothesis according to the posterior distribution; its expected error is at most twice that of the Bayes optimal classifier.
Neural networks are composed of layers of interconnected nodes (neurons). The simplest form is the perceptron — a single-layer model that can only learn linearly separable patterns (it famously fails on XOR). Adding hidden layers creates a multilayer perceptron (MLP), which can represent much more complex functions. The universal approximation theorem says that with enough neurons in a single hidden layer, an MLP can approximate any continuous function — and with at most two hidden layers, it can represent any Boolean or continuous target function.
Training uses backpropagation, which efficiently computes gradients by applying the chain rule backward through the network. The sigmoid activation σ(z) = 1/(1+e⁻ẑ) was historically popular because it's smooth and differentiable, with the elegant derivative σ′(z) = σ(z)(1−σ(z)). For binary classification output layers, sigmoid is typical; for multiclass, softmax is used to produce a probability distribution. Regression networks and classification networks use different loss functions (e.g., MSE for regression, cross-entropy for classification). Small initial weights are preferred with sigmoid to avoid saturating in flat regions where gradients vanish.