Apache MADlib

Machine Learning at Scale

Powerful open source library of scalable in-database algorithms for machine learning

Data is at the center of digital transformation—using data to drive action is how transformation happens. Therefore, it's important to efficiently extract patterns from data in order to identify the insights and actions needed. Machine learning algorithms allow organizations to not only identify patterns and trends in their datasets but also enables them to make high-value predictions that can guide better decisions and smart actions in near real time and without human intervention.

Apache™ MADlib® is an open source library for scalable in-database analytics. It provides data-parallel implementations of machine learning, mathematical, statistical, and graph methods on the PostgreSQL family of databases, including Pivotal Greenplum®. MADlib uses the MPP architecture’s full compute power to process very large data sets, whereas other products are limited by the amount of data that can be loaded into memory on a single node. MADlib algorithms are invoked from a familiar SQL interface so they are easy to use.

Massively Parallel Machine Learning

The methods in MADlib are designed to run on shared-nothing, “scale-out” MPP architectures. This allows machine learning computations to be executed close to the data and at very high speed.

MADlib runs analytics on extremely large datasets. It differentiates itself from other analytical packages by not limiting execution to memory-only structures on a single computing node. MADlib users can add more nodes as data scales. Using all data and not a sample significantly improves accuracy.

Rich Portfolio of Analytical Methods

The MADlib community has steadily added new methods in the areas of mathematics, statistics, machine learning, graph analytics, and data transformation. The current library includes a comprehensive collection of algorithms, operators, and utility functions.

Extensive Support to Popular Data Science Interfaces

PivotalR is a R wrapper that allows practitioners who know R but very little SQL to leverage the performance and scalability benefits of MADlib. It translates R model formulas into corresponding SQL statements (via MADlib), executes these statements in the database or on-hadoop and returns summarized model output to R.

MADlib Methods

Supervised Learning
Regression Models
  • Cox Proportional Hazards Regression
  • Elastic Net Regularization
  • Generalized Linear Models
  • Logistic Regression
  • Marginal Effects
  • Multinomial Regression
  • Ordinal Regression
  • Robust Variance, Clustered Variance
  • Support Vector Machines
  • Linear Regression
Tree Methods
  • Decision Tree
  • Random Forest
Other Methods
  • Conditional Random Field
  • Naive Bayes
  • Neural Networks
Time Series
Model Evaluation
  • Cross Validation
  • Prediction Metrics
  • Train-Test Split
Nearest Neighbors
  • k-Nearest Neighbors
Unsupervised Learning
  • Association Rules (Apriori)
  • Clustering (K-means)
  • Topic Modeling (LDA)
  • All Pairs Shortest Path (APSP)
  • Breadth-First Search
  • Hyperlink-Induced Topic Search (HITS)
  • Average Path Length
  • Closeness Centrality
  • Graph Diameter
  • In-Out Degree
  • PageRank
  • Single Source Shortest Path (SSSP)
  • Weakly Connected Component
Other Modules
  • Conjugate Gradient
  • Linear Solvers
    • Dense Linear Systems
    • Sparse Linear Systems
  • PMML Export
  • Random Sampling
  • Stratified Sampling
  • Balanced Sampling
  • Term Frequency for Text
  • Path Functions
  • Sessionization
Data Types and Transformations
  • Array Operations
  • Dimensionality Reduction (PCA)
  • Encoding Categorical Variables
  • Matrix Operations
  • Matrix Factorization (SVD, Low Rank)
  • Norms and Distance Functions
  • Sparse Vectors
  • Pivot
  • Stemming
  • Cardinality Estimators
  • Correlation and Covariance
  • Summary
  • Hypothesis Tests
Other Statistics
  • Probability Functions

MADlib Architecture

User Interface (SQL)
High-level Abstraction Layer (Python)
Iteration Controller
RDBMS Built-in Functions
Functions for Inner Loops (C++)
Implements Convex Optimization
Low-level Abstraction Layer (C++)
Matrix Operations, C++ to DB Type Bridge
RDBMS Query Processing

The MADlib Analytics Approach

MADlib approach to analytics is based on the MAD acronym:

  • Magnetic: Designed to draw different types of data sources and data scientists to a single environment where best practices on analytics can be shared
  • Agile: Built for fast, exploratory and iterative analytics where lightweight modeling is possible and integration of new data is extremely easy
  • Deep: An environment where advanced machine learning and statistical algorithms are supported