Apache MADlib (incubating)

Machine Learning at Scale

Powerful open source library of scalable in-database algorithms for machine learning

Machine learning is not new but with the recent explosion of new data volumes and sources, it has become a critical component of big data analytics. Machine learning algorithms allow organizations to not only identify patterns and trends in their big data datasets but also enables them to make high-value predictions that can guide better decisions and smart actions in near real time and without human intervention. Machine learning analytical packages have evolved as an external platforms often running outside of large data repositories like MPP data warehouses or Hadoop systems in production. Customers looking for massively parallel processing, analytical speed, proximity to data and easy to use interfaces require a new approach to machine learning.

Apache™ MADlib® is an open source library for scalable in-database analytics. It provides data-parallel implementations of machine learning, mathematical and statistical methods on the Pivotal Greenplum™, PostgreSQL and the Apache™ HAWQ® (incubating) Hadoop Native SQL platform. MADlib uses the MPP architecture’s full compute power to process very large data sets, whereas other products are limited by the amount of data that can be loaded into memory on a single node. MADLib algorithms are invoked from a familiar SQL interface so they are easy to use.


Massively Parallel Machine Learning

The methods in MADlib are designed to run on shared-nothing, “scale-out” MPP architectures. This allows machine learning computations to be executed close to the data and at very high speed.


Sophisticated In-Database/On-Hadoop Analytics

MADlib core functionality is written in declarative SQL statements. The technology occupies a unique position in data science and machine learning by supporting diverse SQL engines and data stores. MADlib currently supports Pivotal Greenplum, PostgreSQL and Apache HAWQ.


Scalable Data and Accurate Insights

MADlib runs analytics on extremely large datasets. It differentiates itself from other analytical packages by not limiting execution to memory-only structures on a single computing node. MADlib users can add more nodes as data scales. Using all data and not a sample significantly improves accuracy.


Rich Portfolio of Analytical Methods

The MADlib community has steadily added new methods in the areas of mathematics, statistics, machine learning and data transformation. The current library includes over 30 principle algorithms as well as many additional operators and utility functions.


Extensive Support to Popular Data Science Interfaces

PivotalR is a R wrapper that allows practitioners who know R but very little SQL to leverage the performance and scalability benefits of MADlib. It translates R model formulas into corresponding SQL statements (via MADlib), executes these statements in the database or on-hadoop and returns summarized model output to R.


MADlib Methods


Predictive Analytics Library

Supervised Learning
Regression Models
  • Cox Proportional Hazards Regression
  • Elastic Net Regularization
  • Generalized Linear Models
  • Logistic Regression
  • Marginal Effects
  • Multinomial Regression
  • Ordinal Regression
  • Robust Variance, Clustered Variance
  • Support Vector Machines
Tree Methods
  • Decision Tree
  • Random Forest
Other Methods
  • Conditional Random Field
  • Naive Bayes
Unsupervised Learning
  • Association Rules (Apriori)
  • Clustering (K-means)
  • Topic Modeling (LDA)
Time Series
  • ARIMA
Model Evaluation
  • Cross Validation
Other Modules
  • Conjugate Gradient
  • Linear Solvers
  • PMML Export
  • Random Sampling
  • Term Frequency for Text
Data Types and Transformations
  • Array Operations
  • Dimensionality Reduction (PCA)
  • Encoding Categorical Variables
  • Matrix Operations
  • Matrix Factorization (SVD, Low Rank)
  • Norms and Distance Functions
  • Sparse Vectors
Statistics
Descriptive
  • Cardinality Estimators
  • Correlation
  • Summary
Inferential
  • Hypothesis Tests
Other Statistics
  • Probability Functions


MADlib Architecture


User Interface (SQL)
High-level Abstraction Layer (Python)
Iteration Controller
RDBMS Built-in Functions
Functions for Inner Loops (C++)
Implements Convex Optimization
Low-level Abstraction Layer (C++)
Matrix Operations, C++ to DB Type Bridge
RDBMS Query Processing


The MADlib Analytics Approach

MADlib approach to analytics is based on the MAD acronym:

  • Magnetic: Designed to draw different types of data sources and data scientists to a single environment where best practices on analytics can be shared
  • Agile: Built for fast, exploratory and iterative analytics where lightweight modeling is possible and integration of new data is extremely easy
  • Deep: An environment where advanced machine learning and statistical algorithms are supported