Powerful open source library of scalable indatabase algorithms for machine learning
Data is at the center of digital transformation—using data to drive action is how transformation happens. Therefore, it's important to efficiently extract patterns from data in order to identify the insights and actions needed. Machine learning algorithms allow organizations to not only identify patterns and trends in their datasets but also enables them to make highvalue predictions that can guide better decisions and smart actions in near real time and without human intervention.
Apache™ MADlib® is an open source library for scalable indatabase analytics. It provides dataparallel implementations of machine learning, mathematical, statistical, and graph methods on the PostgreSQL family of databases, including Pivotal Greenplum®. MADlib uses the MPP architecture’s full compute power to process very large data sets, whereas other products are limited by the amount of data that can be loaded into memory on a single node. MADlib algorithms are invoked from a familiar SQL interface so they are easy to use.
Massively Parallel Machine Learning
The methods in MADlib are designed to run on sharednothing, “scaleout” MPP architectures. This allows machine learning computations to be executed close to the data and at very high speed.
MADlib runs analytics on extremely large datasets. It differentiates itself from other analytical packages by not limiting execution to memoryonly structures on a single computing node. MADlib users can add more nodes as data scales. Using all data and not a sample significantly improves accuracy.
Rich Portfolio of Analytical Methods
The MADlib community has steadily added new methods in the areas of mathematics, statistics, machine learning, graph analytics, and data transformation. The current library includes a comprehensive collection of algorithms, operators, and utility functions.
Extensive Support to Popular Data Science Interfaces
PivotalR is a R wrapper that allows practitioners who know R but very little SQL to leverage the performance and scalability benefits of MADlib. It translates R model formulas into corresponding SQL statements (via MADlib), executes these statements in the database or onhadoop and returns summarized model output to R.
Graph Processing on Greenplum Database using Apache MADlib
Data Science Reveals Extraordinary Insights into Drivers and Their Behavior
Video: The MADlib Project: SQL Toolkit for Large Scale Predictive Analytics
Video: The Evolution of MADlib
Pivotal Data Science Transport Demo
White Paper: MAD Skills: New Analysis Practices for Big Data
White Paper: The MADlib Analytics Library
Machine Learning on Greenplum with Apache MADlib
MADlib Methods
 Cox Proportional Hazards Regression
 Elastic Net Regularization
 Generalized Linear Models
 Logistic Regression
 Marginal Effects
 Multinomial Regression
 Ordinal Regression
 Robust Variance, Clustered Variance
 Support Vector Machines
 Linear Regression
 Decision Tree
 Random Forest
 Conditional Random Field
 Naive Bayes
 Neural Networks
 ARIMA
 Cross Validation
 Prediction Metrics
 TrainTest Split
 kNearest Neighbors
 Association Rules (Apriori)
 Clustering (Kmeans)
 Topic Modeling (LDA)
 All Pairs Shortest Path (APSP)
 BreadthFirst Search
 HyperlinkInduced Topic Search (HITS)
 Average Path Length
 Closeness Centrality
 Graph Diameter
 InOut Degree
 PageRank
 Single Source Shortest Path (SSSP)
 Weakly Connected Component
 Conjugate Gradient

Linear Solvers
 Dense Linear Systems
 Sparse Linear Systems
 PMML Export
 Random Sampling
 Stratified Sampling
 Balanced Sampling
 Term Frequency for Text
 Path Functions
 Sessionization
 Array Operations
 Dimensionality Reduction (PCA)
 Encoding Categorical Variables
 Matrix Operations
 Matrix Factorization (SVD, Low Rank)
 Norms and Distance Functions
 Sparse Vectors
 Pivot
 Stemming
 Cardinality Estimators
 Correlation and Covariance
 Summary
 Hypothesis Tests
 Probability Functions
MADlib Architecture
The MADlib Analytics Approach
MADlib approach to analytics is based on the MAD acronym:
 Magnetic: Designed to draw different types of data sources and data scientists to a single environment where best practices on analytics can be shared
 Agile: Built for fast, exploratory and iterative analytics where lightweight modeling is possible and integration of new data is extremely easy
 Deep: An environment where advanced machine learning and statistical algorithms are supported