Querying External Data Sources with Hadoop
In large enterprises, it is not uncommon to have big data in different formats, sizes and stored across different systems. Moreover, enterprises typically have a multitude of systems with gold mines of information that can be put to use for strategic insights. Linking these existing storage systems with HDFS can be very challenging.
Pivotal helps leverage your existing data infrastructure investments with HDFS and begins to shift your legacy enterprise data warehouse, analytical data marts and data silos into a centrally modern, governed business data lake, where all data types are stored and accessible for on-demand analytics.
Pivotal HD is able to connect all data across multiple systems without having to move or copy the data to and from HDFS for analysis. This is possible through Pivotal Xtension Frameworks (PXF). PXF is an external table interface in HAWQ (fast, scalable, production grade, 100% SQL compliant query engine on HDFS) that allows you to read and query data directly stored in and outside of the Hadoop ecosystem – HDFS, Hive, HBase, etc. while supporting a wide range of data format such as Text, AVRO, RCFile and many more. PXF also delivers a fast extensible framework by exposing parallel APIs to connect HAWQ with additional data sources namely GemFire XD, JSON, Accumulo and Cassandra.
Join us for this technical preview on how to add extensibility into Hadoop:
- Eliminate the need to copy data from the underlying storage system to HDFS
- Leverage rich, deep and fast analytics on Hadoop data files of various kinds
- Conduct statistical and analytical functions from HAWQ on HBase or Hive Data
- Run complex analytical queries that join in-database dimensional data with fact data stored in HBase
- Easily write your own custom PXF connectors to external data sources
- Cut down time and operational costs
Nikesh: I wanted to thank everyone for joining today's webinar series What You Can Do with Hadoop. My name is Nikesh Shah. I'm a Principal Product Marketing Manager for Pivotal HD. Joining me today is Alon Goldshuv who is our Engineering Manager for Pivotal Extension Framework. We're very excited to kick off this installment of our webinar series. This month focus is on Querying External Data Sources with Hadoop.
Before we jump into today's presentation there are some housekeeping items to cover. Your phones have been placed in listen mode only so if you wish to submit a question please do so via the WebEx Q&A chat box to all panelists and we'll have time at the end of today's session to answer your questions but feel free to submit them as we move along. Please note today's session will be recorded and available for replay under the resources tab on GoPivotal.com.
In today's webinar we'll cover a quick review of HAWQ, our SQL query engine on Hadoop. We'll outline some recent benchmark results as well and we'll then follow that by a deep look at Pivotal Extension Frameworks and it's extensibility into connecting to other data sources. We'll actually show you a live demonstration of how to use PXF with HDFS, HBase and other external data sources outside of the Hadoop ecosystem and as mentioned conclude with a Q&A session.
Pivotal HD is our enterprise Hadoop distribution. As you can see in this architecture diagram we provide a lot of value add-on to the Apache Hadoop distribution. We fundamentally believe that Pivotal HD which again stands for Hadoop distribution is our foundation for a business data lake type reference architecture that's able to store all your data, provide accessibility to analyze all your data and more importantly helping organizations turn that insight into their operations through the development of application. We really believe this because Pivotal HD provides some of the world's most advanced real time analytics. We have something called HAWQ which again is our SQL query engine and we'll deep dive on that but we also provide GemFire XD which provides real time data services on top of Hadoop so we're able to provide an online transactional processing for Hadoop which is very unique in today's current market landscape.
In addition we provide a lot of advanced analytical tool sets. We'll cover MADlib in a little bit but we also provide things like Open MPI and GraphLab for graph analytics on top of that as well. We also have tools for developers from Spring to Spring XD and obviously tools for the Hadoop admins and operation teams to manage their cluster environments in what we call a Command Center. In addition it's all these value add-ons that we add to Hadoop. We do package it as you can see here Apache Hadoop 2.2 so you see your standard HDFS, Map Reduce, HBase, Yarn, Oozie and additional open source tools from within the Hadoop ecosystem.
HAWQ is one of our biggest differentiators as far as it relates to our distribution and we've heard a lot of feedback from you all out there on some of the performance results and compliance with HAWQ and how that is very beneficial to many of your organizations as you guys begin to move workloads from the EDW to Hadoop. HAWQ was the first SQL compliant query engine in Hadoop when we announced it back in February 2013 and it contains some really strong capabilities such as the world's leading optimizer, robust data management capabilities and deep analytics.
A lot of that was available because we were able to leverage the 10 years of innovation in R&D that came from the Greenplum database which was an analytical data warehouse. We were able to leverage MPP architecture in a shared nothing approach where we're using parallelism and parallel loading and unloading and resource management and things like obviously SQL compliance and ODBC and JDBC compliance as well in addition to having a deep rich third party tool integration with various BI tools, ETL tools and data mining tools.
At a very high level HAWQ is delivering a high performance query processing and as I mentioned we have interactive and true ANSI SQL support. One of the biggest things I wanted to talk about is our new cost-based parallel query optimizer which is codenamed Orca. We built this new query optimizer from the ground up and it includes a lot of new feature functionality to really help support big data use cases today. Some of the benefits of that are also leverage from feedback from you guys and our customer base. The great thing about this query optimizer is it's both leveraged in HAWQ as well as our Greenplum database or our analytical data warehouse so it's providing a lot of different capabilities into how to run and execute query. It's a cost-based query optimizer that is looking for the most efficient plans. It physically plans out the scans, some joins and sorts and aggregation to give you the best optimal results as you're running your query on top of Hadoop.
In addition to that HAWQ is proving rich enterprise class database services and management. Everything from scatter-gather data loading to row and column storage and as I mentioned a lot of third-party tools and open client interfaces. In addition we've recently added support for [Parquet 00:06:22] which is a hybrid column open storage format which is delivering significant performance and scalability improvements to querying on top of Hadoop.
The other aspect of HAWQ is it provides pre-integrated deep scalable analytics so we package something called MADlib which is an open source library of machine learning algorithms. We've recently just added more of these algorithms natively into HAWQ to unlock deeper predictive insights that many data scientists are trying to uncover. In addition we realize that not all data scientists are big users of SQL so we've added user defined functions such as procedural languages like R, Python and Java. This allows the data scientists and analysts to leverage the full power of these additional languages within the business logic procedures that they're most comfortable with.
We recently worked on a SIGMOD paper which is an industry-wide paper entitled ORCA: A Modular Query Optimizer Architecture for Big Data. This was recently published at the SIGMOD Conference back in June and in this paper we really outline HAWQ's leading query optimizer architecture in great technical depth and the importance of SQL compliance and we outline the results of a standard TPC-DS benchmark test.
Out of this standard benchmark test HAWQ is able to run all 111 queries in the report without rewriting and this is very significant because rewriting queries involves manual processes, complexity, probability of error and slower time to market. So because HAWQ is the leading SQL compliant query engine on Hadoop it provides things like correlated subquery, window function, large number of joins including multiple fact dimension tables, dynamic partition elimination and because of those kind of rich features that you would typically find in a tool that's SQL compliant we're able to run all these basic queries within this benchmark result and that's really critical because again many organizations are moving certain workloads off from the EDW to Hadoop. You want to have that same type of expectation and experience that you do in your EDW environment and with HAWQ we're trying to really deliver that same similar experience to organization.<
In addition what we did in this report is we also ran some benchmark results and as I mentioned and you saw on that prior graph and slide, HAWQ was able to run 80 more queries. We were only able to test 30 or so of the queries against other SQL like query engines on Hadoop and we can see that HAWQ delivers 6x faster performance against Impala and about 21x faster performance against Stinger. As I mentioned we recently outlined all this in our SIGMOD paper which you can find on our website and we even wrote a blog recently on the paper that goes into greater depth and detail again which can also be found on our website. I highly encourage you to go take a look. To look at that in greater depth and detail.
In addition to some of the capabilities around deep scaled analytics, robust data management capabilities, SQL compliance, the leader query engine, HAWQ also includes something called an extension framework or for short PXF. This enables intelligent query integration or to HDFS, HBase and Hive. It supports many common data format sources such as Avro and protocol buffers and sequence files and in addition it provides extensible frameworks to connecting to other data sources such as JSON or Accumulo.
What we're trying to deliver here is you just saw some of these leading benchmark results that we just published. We're able to provide that not only within the Hadoop ecosystem but even with outside so it's my pleasure now to turn the gears over to Alon Goldshuv who's our leading Engineering Manager for Pivotal Extension Framework and he'll go into a deeper dive of PXF. Alon.
Alon: Hey Nikesh, thank you. Hello everybody. I am going to talk about the Pivotal Extension Framework. I guess we'll start with some basics. What is the PXF? What is the value that it brings to the table? How is it used? We'll go over the list of existing features. Then I'll talk about the extensibility aspect of it which is a key capability of PXF and we'll end up with a short demo and show some basic usage examples.
As you can see here starting with some really high level visualization in order to understand PXF use cases and the reason that it's really useful for us. In the core here we have the Pivotal HD distributed file system and on top of it we have our high performance data processing engines. HAWQ being one of them shown in here. As Nikesh said this is our top choice SQL engine for analytics in our product. HAWQ and Pivotal Hadoop they work together to give you the ability to store and analyze huge amounts of data in a very fast and efficient way. To make it visually clear data that is loaded through HAWQ into Pivotal Hadoop is represented here as green boxes. You can think about it as data blocks. That's just a visualization.
Now the Pivotal Hadoop ecosystem is comprised of many other data processing engines as Nikesh mentioned. We have GemFireXD. There's Hive. There's Hbase. There's others and also a wide range of available data storage techniques such as [Avro 00:13:32] and JSON and some others. Similarly users of both systems will inject data on the disk so this here represented by the orange boxes so along the right you have another system and something that is not HAWQ. It maybe a Hadoop system. It could be GemFire. It has its own set of users and they inject data into the system, into the file system and each one of them can actually operate independently. <
What we end up here is we end up with a really nice set of techniques and [engines 00:14:20]. They can produce a lot of value on its own. It can process and work with the data that was loaded into it but yet technically speaking the data is all stored on Hadoop. It is accessible to all. When we looked at that when designing Pivotal's big data [suite 00:14:42] it became obvious that it will enable deep sharing of data between those various systems. It will really open the door for a lot of innovative capabilities so this is really where PXF comes in.
This is how we chose to define PXF. It's a fast extensible framework connecting HAWQ to a data store of choice that exposes a parallel API. So what does that mean? Fast means in this context parallel and highly optimized for speed. Data store of choice means that PXF enables HAWQ to connect to almost any other data store that is available to us and extensible means that it's a plugable framework and new functionality can be added by the customers themselves. I will talk more about that shortly.
When HAWQ uses PXF it can directly read and analyze any data that comes from other systems. If before HAWQ could read those green boxes with the help of PXF, it can also directly analyze some of those orange boxes which is basically data coming from foreign sources which are unfamiliar to HAWQ. This can be done in two different ways. This is one way. It can be done directly using SQL on all the data from the other system and it can also be done indirectly so basically HAWQ can use PXF to first convert the data into a HAWQ friendly format and then run analytical queries. Both of those options work. It really depends on the user. It depends on the use case but the flexibility is there.
This is really all done in a really low level in a single step in parallel. There is no data free staging. No conversion, pre-conversion of data formats is needed and another key point to understand here is that given that functionality it opens the door for thinking differently about designing the data lake as a whole. For example, data sets that are optimized for a real time transactional system could be stored on those systems while data that is optimized for different analytical usage could be stored on those analytical systems and then still be shared. You can still be able to share the data to join it together. You can use HAWQ to query the data from that single source when you need to so this a really fresh idea. I will explain how this can be done.
Basically in order to get HAWQ to use and analyze data from another system there are two simple basic steps. The first one is to define the PXF table in HAWQ. PXF is utilizing the external table mechanism in HAWQ. HAWQ has quite a few different [external 00:18:23] tables [inaudible 00:18:23] that could be used and PXF is one of them so we will need to define that. I will show an example shortly how this can be done. Then when you define a PXF table you need to pick the right plugins that is right for the job. The plugin can be something that we already shipped with the product or it could be something that you can write yourself as well and plug it in.
The second thing is specifying the data source of choice so if you want to analyze data like [Hdfs 00:19:01] maybe your data source of choice is a directory or a set of files. If you want to analyze an HBase table or read the data from it your data source of choice is the HBase stable [inaudible 00:19:17]. Just a few examples here. After defining the PXF table all there is left to do is to run SQL on that table and basically your done. You can do it directly or indirectly by copying the data to a HAWQ table first.
A few examples here. I just want to show the syntax to show how easy that is. For those of you who are familiar with SQL which I assume is most of you this is the syntax to create a PXF external table. You have the create external table. You follow it by a table name of your choice and the column list of the actual data that you expect to receive. Then the second row is probably the most important one. This is where you specify the location of your external data source and in here we specify the PXF [inaudible 00:20:18] so PXF column slash slash. Then we add the address of the PXF service which is by the way something that is going to go away in the next release so this will not be relevant anymore. It will be even simpler.
Then the name of the data source that you're interested in and then any plugin options. This format specifies the data format that you expect. In most cases in PXF this will be in internal binary format but sometimes it is also possible to receive the data in the delimited text as well. I will show both of those examples in the demo and lastly it's also possible to utilize the single row error handling mechanism in HAWQ's external table. This is also a really cool capability.
Basically this allows you to ignore and log any formatting errors that happen when the data is in transit so occasionally since the data is external and not in the database it is possible that whenever we are streaming it into HAWQ in parallel the data has formatting issues. The same happens when we load any data set into any database. Sometimes the data has issues and we don't want to abort the whole operation. Maybe one data row or 10 data rows out of a few million are badly formatted so using this clause here we can just let HAWQ ignore those errors and load them aside in a special table for us so that we can later on go and examine it.
After we do the first step of defining the table we can actually run SQL in it. You can basically run anything directly on the table. Really any SQL besides probably updating the remote data or deleting it but you can run any analytics on it that you want.
The first example here shows the select and the second one shows an insert into select so indirectly you can first select all the data or maybe even a subset of the data or query results and store it in HAWQ table and then run the query directly on the HAWQ table. The reason you would want to do that sometimes if you want to run analytics on the same data set quite a few times you may want to first store it in HAWQ. Obviously the HAWQ internal format will be faster than any external format that has to be translated and shipped to HAWQ. Even though PXF is extremely fast it is possible in order to [skip 00:23:26] the repeating queries but it's also possible to load it into HAWQ first so it really depends on the use.
Let's take an example of how this looks like in practice. We recently finished writing a [JSON 00:23:42] plugin for Accumulo and defining an Accumulo data source looks like this. We give the table a name but let's look at the second row here which is the most important. We specify an Accumulo table name and then you specify that the PXF profile that you're interested in is Accumulo and that makes PXF aware that it's going to use the Accumulo connector. That's how it will like in some examples.
We create the table. We call it [t 00:24:23]. We specify that we want the record key from Accumulo and two different qualifiers. Accumulo has qualifiers. It's very similar to HBase and [column 00:24:39] families will specify the ones that we want. Then in the location clause we specify the instance name and the table, the Accumulo table, which is in here called sales and then profile equal to Accumulo. In the format we specify the PXF internal binary format as described in here.
astly we can run any queries on that. For example, we can do a group buy as in this example and get the maximum price. Pick any column that we want. Run order by. Join it with the HAWQ table. Just about anything like that.
Let's talk about the features. There's a list of most of the basic features. I will talk just briefly about some more advanced features as we go here. We can start with the HBase connector that we have. That is a connector that allows HAWQ to read data from HBase tables. One of the cool features there is that it supports the filter push down so whenever a query to an HBase table and it includes the where clause. We actually push down that where clause in all the filters into the actual HBase scan. This way we're really speeding up the execution. This also allows us to do better planning and to understand where the parts of the table in HBase of interest are so that's a really nice feature.
We also support Hive with Hive's partition exclusion so also if you run a filter on the query and the filter is on its partitioning or the partition in Hive, in run time we will understand which partition is interesting and ignore the rest. Also another cool feature in the Hive connector is that we're able to support transparently all the types of file storage techniques that Hive supports so that is sequence files and text and RC and ORC and that works automatically.
For HDFS files we can analyze a lot of different types delimited text, csv, Sequence files, Avro files, splittable and non-splittable files and also different types of compression. We also [inaudible 00:27:32] supposed to supply the write feature in which you can actually export data from HAWQ into Hdfs and that works for delimited text, csv and Sequence files as of today and also enables to pick any compressor that you'd like and other options are to write the data on this.
We have also some other connectors and here I specify GemFireXD, JSON, Cassandra and Accumulo. They are not shipped. Actually GemFireXD is not beta. It's [GA 00:28:11] so this is a mistake. The other ones are not yet shipped with the product. I will talk about them briefly soon. There is something that is unique about both of them.
Another really cool feature of PXF is that it supports stats collection so basically today we support it on Hdfs files that are read to PXF. You can actually run and analyze commands on the PXF table that points to Hdfs data and it will collect statistics on it. Today we collect really basic statistics but in the next release we're going to collect both table label and column label statistics. This will be done on any type of connector transparently so you could run stats on GemFire data, on Hive, on HBase data and of course the main reason to do that is to enable the HAWQ query optimizer Orca but to have much better information about the data that is coming so that it will be able to actually plan the query better especially when we're joining it with some other data sets so that is also very important and a really nice capability.
We do automatic data locality optimization so HAWQ when it's using PXF will try to match the various HAWQ segments on the various HAWQ [executors 00:29:53] on the various nodes with the actual data on disk. Whenever we can use locality we use it so that reduces network traffic and in many cases also increases the response.
We support the Kerberos Hadoop as well.
We support high availability in Hadoop so if let's see a main one goes down that doesn't affect our execution and also if the [PXF 00:30:24] service goes down it doesn't affect execution.
Lastly and probably most importantly extensibility. We'll talk about extensibility in a second but just a simple example of how a writable PXF table would look like.
Very similarly you define it. The syntax is almost the same. The main difference is that in here we specify destination which is an Hdfs directory that we want to export the HAWQ data into. We specify here a profile which is Hdfs text in this example and because we want the data to be written as delimited text and also in this example specify compression codecs as we want the data to be compressed with the gzip codec. The format here also it will specify text format and the delimiter of choice that we want to use. There are other formatting options that are not specified here. This one is significant. Then you can just run a query in HAWQ and insert the data into the table and the data will get exported in parallel and very quickly.
The extensibility features do exist so as I said this is a really key feature. Basically every connector in PXF is comprised of three different plugins. Fragmenter, accessor and resolver. The built-in connectors in PXF we already implemented here in the development team of PXF. We implemented those three sets of plugins. For HBase we have a fragmenter, accessor, a resolver. For Hive we have those but these are really open.
These are really open for anybody, any customer, that wants to implement them in order to use HAWQ to query another native source that we don't have built-in support for. That is fairly easy to do. We supply an actual API in our documentation that explains how this would be done. I will explain it very very briefly here because that's a topic that we can go more deeply into and that will take more time but just in general the fragmenter module returns a list of source data fragments and their location.
Let's take an example. The HBase for example will return a list of HBase region, table region. The [Hdfs 00:33:09] fragmenter will return a list of [Hdfs 00:33:13] split, data split on this. The accessor gets that list of fragments after HAWQ did a few optimizations with them and it reads them and returns records so it's basically for whoever is familiar with Hadoop. At its core it will use an input format and the record reader and read from a specific fragment that the fragmenter returns. It will read the data and return records. The resolver will take those records and deserialize each record according to a given schema or technique so maybe data is serialized with maybe Avro or maybe in some special technique that is proprietary. The resolver is the one that's going to be able resolve it and build actually a record that will be streamed back into HAWQ.
Let's look at a few examples. In the syntax you saw that we use a profile name. Actually at the bottom here of the slide you can see the location clause and profile equals HBase. Each profile is specified in an XML file that really describes the various plugins that comprise this profile. This is our HBase profile and you can see here the fragmenter, accessor and resolver. These are the actual Java class name that implements those plugins but as I said this is extensible so anybody can implement their own accessor, resolver and fragmenter and define a new profile and then use that new profile in the external table to finish and read data from it.
Here's just a simple example a profile called YYZ. Here I chose to use the Hdfs data fragmenter which is already available. It's a built-in fragmenter that can read data from Hdfs and because I want to read my Hdfs data but my Hdfs data is actually on some proprietary file that I wrote myself and nobody else can read it or knows how to read it and I also serialize my data in a special way that nobody will be able to deserialized. I wrote my own accessor and resolver and included those in the profile and that enabled me to use them transparently. In here I can specify you can see here some value and another key. I can specify any key and value that I want and that will automatically will get shipped to my fragmenter, accessor and resolver. That's also [inaudible 00:36:27].
I said I'm going to talk about GemFire, JSON and Accumulo and Cassandra. These are actually plugins that the development team did not write. These are written by other teams. Either by customer or by another development team so we have the plugins for GemFireXD. This was written by the the Gem engineering team. We have JSON and Cassandra that was written by a field engineer by Adam Shook specifically and then a Accumulo connector as well that was very recently completed and was written by another company which is called ClearEdge. This works closely with Accumulo.
There are also some other proprietary connectors that some of our customers wrote. We here in the development team we have not seen the code. We don't even know how it works and they just wrote it and plugged it in and they use it so we think this is a really cool capability to be able to use. By the way there are I think a couple of blogs online that show I think Adam wrote a blog about the JSON connector that he wrote so there's some more in-depth information there for any of you who is interested in actually seeing how implementing that looks like. Also in our documentation there is the API which is public so you can also look at that.
Last slide before we go into the demo. This is a rough architecture. I'll go through it very quickly because I think we don't have much time. The demo will be probably more interesting. We have here at the bottom three data nodes. On each data node there is a HAWQ [executor 00:38:29], QE and also basically a data node and various other services. Hbase for our [region 00:38:40] servers and things like that and also the [PXF 00:38:43] service on each one of the nodes. At the top right there is a data name node with a PXF service and on the left there is the HAWQ master node which has a query dispatch which is where the planner [inaudible 00:39:01] planner is also being used.
Step #1 SQL, user feeds in SQL into the HAWQ master. In Step #2 the metadata request from PXF comes in into the PXF service and that's where the fragmenter is actually being executed. As I said before the fragmenter returns a list of fragments and some other metadata of choice back to the HAWQ master and then the HAWQ master in Step #3 dispatches a plan, a query plan, through the query executor. The query planning includes a list of fragments that each query executor should actually execute. Then it's Step #4, each query executor requests the data for the specific fragments. PXF service runs the accessor and the resolver. Constructs a chunk of records and returns them into HAWQ, into the query executor. The query executor does what it needs to do with that so basically all the SQL processing happens there and then returns the results to the query dispatcher and back to the user.
Yeah, I think let's go to the demo. Nikesh, can you let me know if this is clear enough? This is visible?Nikesh: Yeah, we can see the screen.
Alon: Okay, so well this is really a really really basic example. I'm not using any big data in here. I just want to show how PXF is used and what can be done with it but you should just know that this is really using really really big data and runs really fast. Actually, let me go back one step.
I'm using here it's what we call single cluster Hadoop. [inaudible 00:41:19] it's single cluster. It's not [inaudible 00:41:21] to Hadoop VM that we ship with the products but some really cool single nodes is a tool that we wrote here in the development team. I'm running on my one node. This is just the one node. You can see three data nodes here and the R and the Hadoop region servers and Zookeeper and Hive so everything runs here on my machine and I'm going to do the demo on that.
What I'm going to do I'm going to create some data on Hbase and I'm going to create some data on Hdfs and then we're going to run create external tables all nodes with PXF and see if we can query the data. For Hbase I'm going to create a purchase table. It has two column families, CS1, CS2 and various qualifiers. Date, vendor code, location are in the first column family and then price in the second one. I'm going to insert three rows here so let's do that first. Okay, so let's go into Hbase and I'm going to create the table and insert the data into it. Actually let me make this a bit smaller. Here we go.
Let's skim the data, the table, to make sure that everything is in here. Okay, so we can see those three rows. The row key is on the left and the various qualifiers and the column families is on the right with their respective values. Now let's go back. Actually let's go into HAWQ. I'll use the same only just [inaudible 00:43:35] here. I create the HAWQ database called demo. Login to it. There are no relations here so let's create the first one here and go quickly over the [inaudible 00:43:57].
I created an external table. I called it Hbase_purchase. I'm asking to be able to read the record key and in here I'm asking for three different qualifiers out of the four. I just want to do that so that you can see it's possible. This is actually a feature. We see many cases will have a really large number of qualifiers documented to. It could be even a million or more. We don't want to always read all of them. We want to analyze specific things.
Let's pick a subset of qualifiers that we're interested in and we'll still be able to project on only the ones that we want. We don't have to always use all of them but this is only the subset that interests me. I'm not interested in others. I don't have to pick it such as date, vendor code and price. The location here is PXF. This is the PXF service. Purchase which is Hbase [stable 00:45:00] name that we just created and profile equals Hbase to tell PXF that we're going to read Hbase data and the format is the binary format that I spoke about before.
Let's run a simple query and as you can see this is the exact data that they just inserted into the Hbase table. Now this is already really cool but really I can do anything that I want with that so I have two other queries here which are very simple still but can show that you can really do anything that you want on the data set. In here I'm ordering by price. In here I'm asking for the average price so really anything that you'd like to do you can do here.
Okay, so let's go back. Let's look at the Hdfs example. I'm going to create the directory called demo data on Hdfs. I'm going to load some data files into it. I will see what the data looks like in those data files and then I would like to create a PXF table that knows how to read it. I apologize for the warnings here. That's unrelated due to the product.
We have three different files. One of them is called uninteresting because I will not want to read this file. I will only want to read the vendor A and vendor B file and I'll show you how this can be done with the PXF. Let's actually look at the data. Let's take this one so there's only one row here. Some other uninteresting data file so we will want to ignore that and let's look at this one. This one has 20 rows of vendor names, actually vendor ids and vendor names and the other one has 10 more so different data set that I'm interested in actually.
Let's go back and now create the PXF table that will be able to read that data so going back to [T-SQL 00:47:41]. The table specifying I have a vendor id. I can give it any name that I want. This is an integer and a vendor name. PXF location, this is my data of interest so it's under demo data directory and I'm specifying the wild card pattern here. I only want vendors_*.txt to be read and I'm also specifying that the profile is Hdfs text simple which is the delimiting text profile that we have built-in. I also specified the original data is in text format and the delimiter is [inside 00:48:31] as you may remember from the [Hdfs 00:48:33] [inaudible 00:48:34].
Let's run a connect here and there you go. You see the data. You see that the data from the other file in that directory is not shown. It was correctly ignored by PXF and we can do the same, order by vid. Anything that we want really like in the Hbase example that we chose. Now finally I will do a last query here. This query shows how you can actually run a join on those two PXF tables, Hdfs vendor and Hbase purchase. I'm running the join on vendor ID on the Hdfs data and vendor code on the Hbase data so both of those tables are defining HAWQ but the data is not in HAWQ. Got it. Okay so actually I'll just go back here
ere is the query and here's the result so you can see that in the Hbase table we had only three vendor codes we wrote and they match only three vendors on the Hdfs data and this is the correct results. As I said, you can join it within HAWQ tables. You can do here really anything that you want so with that I think Nikesh I'll turn it back to you.
Nikesh: Okay, perfect. Thank you Alon for that overview of the Pivotal Extension Framework and the demo with Hdfs and Hbase. I think that was some pretty cool stuff and you definitely showed the extensibility and I think more critical the ease for anybody to go build connectors on their own whether it be within the Hadoop ecosystem or outside. Just to summarize real quick here. Pull up my slide.
We outlined a lot of stuff in this past hour I think really at a high level. Focusing on the leading architecture that HAWQ delivers with our new query optimizer. Our breadth and depth of SQL compliance and the endpoints of that as it does move, as organizations do move workloads off of the enterprise data warehouse and then obviously the external frameworks that Alon had just gone through in their ability to connect to other data sources. These are some of the key reasons, fundamental things we're working on and really enabling HAWQ to be a leading query engine within this business data lake reference architecture that we keep talking about here at Pivotal.
With that why don't we just move over to the questions and answer portion for today's session. For those that are still with us please remember you can submit your questions via the Q&A chatbox to all panelists. I think Alon and there were questions that came through while you were going over some of the demo. I think one interesting question that I saw come through was an individual that asked is it possible to read the HAWQ table not the external table data but by using map reduce program?
Alon: Yeah, this is out of the scope of PXF but it is possible. HAWQ delivers dedicated input and output formats that can be used on any Hadoop workload so basically if you want to run a map reduce that reads from the HAWQ table you'll need to, in your [mapper 00:52:52], to use the HAWQ input format and then you'll be able to do it.
Nikesh: Okay and then another question that came through is does HAWQ run in Yarn?
lon: I'm probably not the best person to answer that. I think that in the current release it does not but I think that there are plans to run [inaudible 00:53:24] on.
Nikesh: Yeah and you are correct. I'm aware of that so as of right now HAWQ is not Yarn aware but we have every intention to do so and that will be coming through upcoming releases this year.
Another question that came and I'm not sure I understand the full question but the individual that asked it so does that name, I think it was a reference of what you were talking about earlier but they're asking I think if HAWQ can query other relational databases. The answer to that is simply yes right so GemXD is considered its own in-memory database that it can run queries off of as well as some of the other ones as we mentioned.
Anything to add to that Alon?
Alon: I think that as long as the other database system exposes the parallel API. If the database system, any database system, has an API to talk to the I guess the master to the entry point through JDBC, LDBC and things like that you could use PXF through that but that is really not useful because that would just be really really slow streaming data through one stream. You could always do that but if the database system exposes a parallel API so that you can actually access the various nodes of the parallel database and ask for data from those nodes then you should be able to write a PXF plugin that will do that.
In the example of Hive, Hive is a bit different because it's Hadoop based. It is sort of a data warehouse and it can run SQL on it and everything. It is relational but it does have the various Hdfs capabilities underneath so in our Hive connector we don't really talk to Hive. It's not the Hive exposes the parallel API. We just ask Hive for the metadata for the table and then we talk to Hdfs directly so that maybe a special case I would say but in general almost anything is possible. It's extensible and as long as the API is there you can.
Nikesh: I think another question. I know we're running short on time so maybe this will be the last question before we wrap up. I think again and this might have been asked prior Alon to you going through some of your slides in what you were covering. Somebody was asking do we provide some examples of some data stores providing a parallel API
Alon: Can you clarify that again?
Nikesh: They wanted to know do we provide examples of data stores providing a parallel API?
Alon: Got it.
Nikesh: You had mentioned it's only available when they have a parallel API. Are there examples?
Alon: Sure, let's take Hbase for example. Hbase has a Java API that allows you to get data from a specific region server so in Hbase the table data is divided into regions and regions are served by region servers so that exposes the Java API that lets you contact those directly. If we take another example that is Hdfs, Hdfs also allows you to access data that lives on a specific data node. Basically usually what will happen is that the main node will be the one that is accessed first and supplies the metadata but actually the client will get the data from the data nodes themselves so that also includes a parallel API. These are two examples.
Nikesh: Okay, perfect. Before we leave I just wanted to remind you all that there are plenty of resources available today on our website GoPivotal.com. You can easily mimic the demo shown to you today by downloading our Pivotal HD single node VM which includes both HAWQ and GemFireXD or our 50 node version of Pivotal HD for unlimited use with a 60-day version of HAWQ. We have plenty of documentation and self-paced tutorials along with viewable support and knowledge base forms. In addition you can find third party technical whitepapers that outline some of our query optimizers, architectures, SQL compliance and performance and as Alon had mentioned earlier be sure to check out some of our blogs directly on PXF and in JSON and Hadoop and the other analyzing raw Twitter data with HAWQ and PXF.
Well, that pretty much concludes our session for today. I wanted to thank you all who are still on the call past the hour and on behalf of Alon and I here at Pivotal we'd both like to thank you for your participation today in today's webinar.