Powerful open source library of scalable in-database algorithms for machine learning
Machine learning is not new but with the recent explosion of new data volumes and sources, it has become a critical component of data analytics. Machine learning algorithms allow organizations to not only identify patterns and trends in their datasets but also enables them to make high-value predictions that can guide better decisions and smart actions in near real time and without human intervention. Machine learning analytical packages have evolved as an external platforms often running outside of large data repositories like MPP data warehouses or Hadoop systems in production. Customers looking for massively parallel processing, analytical speed, proximity to data and easy to use interfaces require a new approach to machine learning.
Apache™ MADlib® is an open source library for scalable in-database analytics. It provides data-parallel implementations of machine learning, mathematical and statistical methods on the Pivotal Greenplum™, PostgreSQL and the Apache™ HAWQ® (incubating) Hadoop Native SQL platform. MADlib uses the MPP architecture’s full compute power to process very large data sets, whereas other products are limited by the amount of data that can be loaded into memory on a single node. MADLib algorithms are invoked from a familiar SQL interface so they are easy to use.
Massively Parallel Machine Learning
The methods in MADlib are designed to run on shared-nothing, “scale-out” MPP architectures. This allows machine learning computations to be executed close to the data and at very high speed.
Sophisticated In-Database/On-Hadoop Analytics
MADlib core functionality is written in declarative SQL statements. The technology occupies a unique position in data science and machine learning by supporting diverse SQL engines and data stores. MADlib currently supports Pivotal Greenplum, PostgreSQL and Apache HAWQ.
Scalable Data and Accurate Insights
MADlib runs analytics on extremely large datasets. It differentiates itself from other analytical packages by not limiting execution to memory-only structures on a single computing node. MADlib users can add more nodes as data scales. Using all data and not a sample significantly improves accuracy.
Rich Portfolio of Analytical Methods
The MADlib community has steadily added new methods in the areas of mathematics, statistics, machine learning and data transformation. The current library includes over 30 principle algorithms as well as many additional operators and utility functions.
Extensive Support to Popular Data Science Interfaces
PivotalR is a R wrapper that allows practitioners who know R but very little SQL to leverage the performance and scalability benefits of MADlib. It translates R model formulas into corresponding SQL statements (via MADlib), executes these statements in the database or on-hadoop and returns summarized model output to R.
- Cox Proportional Hazards Regression
- Elastic Net Regularization
- Generalized Linear Models
- Logistic Regression
- Marginal Effects
- Multinomial Regression
- Ordinal Regression
- Robust Variance, Clustered Variance
- Support Vector Machines
- Decision Tree
- Random Forest
- Conditional Random Field
- Naive Bayes
- Association Rules (Apriori)
- Clustering (K-means)
- Topic Modeling (LDA)
- Cross Validation
- Conjugate Gradient
- Linear Solvers
- PMML Export
- Random Sampling
- Term Frequency for Text
- Array Operations
- Dimensionality Reduction (PCA)
- Encoding Categorical Variables
- Matrix Operations
- Matrix Factorization (SVD, Low Rank)
- Norms and Distance Functions
- Sparse Vectors
- Cardinality Estimators
- Hypothesis Tests
- Probability Functions
The MADlib Analytics Approach
MADlib approach to analytics is based on the MAD acronym:
- Magnetic: Designed to draw different types of data sources and data scientists to a single environment where best practices on analytics can be shared
- Agile: Built for fast, exploratory and iterative analytics where lightweight modeling is possible and integration of new data is extremely easy
- Deep: An environment where advanced machine learning and statistical algorithms are supported