While practical methods for data science and machine learning are advancing very rapidly, theoretical foundations are often not yet present; we wish to fill in this gap.
The researchers in MILD bring expertise from several disciplines. Graduate students involved in MILD will receive cross-disciplinary training and be exposed to ideas from all of the following areas.
Probability theory provides a wealth of tools that are invaluable for work in this area. Examples include concentration of measure in high dimensions, sub-Gaussian and sub-exponential random variables, random matrix theory, and uniform deviation inequalities.
Information theory is a discipline that studies how to remove redundancy and represent information in the most efficient manner (compression), and how to add redundancy and convey information reliably through a noisy medium (error correction). The answers to these questions require fundamental information measures such as entropy that capture the amount of uncertainty in a probabilistic model. These techniques are used in making decisions under uncertainty, choosing informative features in machine learning, comparing probabilistic models, etc.
Differential privacy provides provable privacy guarantees for individuals in the training data that will be used for training machine learning models. Examples include a variety of ML applications such as clustering, classification, generative modeling.
Algorithm design is a discipline that studies how to efficiently analyze and manipulate data on a computer. These techniques are crucial for efficiently processing modern large-scale data sets, particularly in machine learning. Modern algorithms are frequently randomized, so tools from probability theory are frequently very useful.
Learning theory gives a theoretical foundation to many of the ideas in machine learning. Examples include measuring the complexity of hypothesis classes, proving sample bound on learning, online learning methods.
Mathematical statistics applies probabilistic techniques to provide rigorous backing for various statistical methods. Examples include optimality of model selection, consistency of estimation, tight confidence sets for unknown parameters, and generally analyzing high-dimensional data.