Scale Out Your Big Data Apps: The Latest on Pivotal GemFire and GemFire XD

View Slides

Companies across all industries and sizes are investing in strategic custom applications to enhance their competitive advantages. Developing these applications requires continuous improvement, based on insights gleaned from collecting and analyzing the data that they generate.

Big Data for high-performing, scalable and reliable applications requires a new set of tools and technologies. Pivotal GemFire is a distributed in-memory NoSQL data management solution for creating high-scale custom applications. Pivotal GemFire XD supports structured data as part the industry’s first Hadoop-based platform for creating closed loop analytics solutions – enabling businesses to continuously optimize real-time automation in their applications.

In this webinar, Pivotal experts share information about Pivotal GemFire and GemFire XD and how modern Big Data technology can enhance applications.


Gregory Chase
Director of Data Product Marketing, Pivotal

Greg Chase is an enterprise software marketing executive more than 20 years experience in marketing, sales and engineering with software companies. Most recently Greg has been passionately advocating for innovation and transformation of business and IT practices through big data, cloud computing and business process management in his role as Director of Product Marketing at Pivotal Software.

Luke Shannon
Field Engineer, Pivotal

Luke Shannon is a Field Engineer at Pivotal specializing in Spring, tc Server, RabbitMQ, GemFire and GemFire XD. Luke previously worked as a Senior Sales Engineer at JasperSoft, a Developer at TurboPromote, an IT consultant at NativeMinds and a Technical Writer at 9003 Inc.


Host: Hello and welcome. Thank you for attending the webcast Scale Out Your Big Data Apps: The Latest on Pivotal GemFire and GemFire XD. I'm happy to introduce our presenters for today, Gregory Chase, Director of Data Product Marketing at Pivotal; and Pivotal Field Engineer, Luke Shannon.

Please feel free to submit your questions anytime during the webcast via the chat console. We will do our best to answer them at the end, or if we can't get to them we'll get back to you via email. Thanks again for being here and now I'll hand it off to Greg and Luke.

Gregory: Hello, good morning, good evening, to all depending on where you're calling in from. What we're talking about today is Pivotal GemFire and Pivotal GemFire XD. We wanted to update you really quick on what's in the latest releases and cover very briefly the kinds of use cases where GemFire and GemFire XD are used. We also have a really great demo that Luke will show you and then we'll take any questions at the end.

These tools are extremely powerful and comprehensive. So, there's a lot, especially from the developer point of view that you may want to learn about these tools that we will not have time to cover in about 30 minutes. If you have questions and you want to know more information, or have a longer conversation, please let us know. We'll be more than happy to spend time with you on what it is you're trying to build and how GemFire can help you take this to amazing scale.

What's new with GemFire and GemFire XD? We recently had two releases of both. GemFire 8 was released in September and GemFire XD had a new version, version 1.3, in October. Both of these offerings are distributed in memory databases – in the case of GemFire, for NoSQL applications; and in the case of GemFire XD, for SQL-based applications.

They're built for applications that need horizontal scale out performance, that need consistent database operations across all of these different share-nothing, globally distributed nodes. They need to be highly available, they need resilient grids and they need to achieve global scalability. In other words, the applications and the data need to not only be accessible across the world with low latency, but in fact, you may need to deal with geo-specific, data-specific laws.

Then, when you need powerful developer or database-based features or you want to work with database standards, these applications can help you. And then the same for if you have a globally distributed share-nothing grid you're going to need to simplify the administration of these.

Pivotal GemFire and Pivotal GemFire XD are part of Pivotal's Big Data Suite. They're the real-time component of Big Data Suite. Big Data Suite is a package that includes interactive insight and analytics with Greenplum Database and HAWQ and then includes unlimited Pivotal HD (Pivotal Hadoop). Essentially, if you look at it from a full circle point of view, you store data using Pivotal HD. You analyze it either using Greenplum Database, the massively parallel analytic data warehouse, or using HAWQ if you're going to be addressing data directly in your Hadoop cache. Then, you use GemFire and GemFire XD to create applications that implement the insights that you discovered and also will store data back into the Hadoop installation.

What's new? What's in the latest releases for GemFire and GemFire XD? Top line new capability is for GemFire 8. First, we offer more data per node. This is achieved through in-memory compression. I think we conservatively say you can add at least 50% more data per node. Higher resilience mainly in the automation of, should a node go down or become disconnected from the grid, it knows now how to rapidly bring itself back up, update itself and reassociate itself with the grid – all without the need of the administrator. More languages supported, in particular through a new REST interface, which I believe Luke will be actually showing you a bit of in the demo today.

In Pivotal GemFire XD 1.3, this is from many people's point of view more of an incremental release, but we did something really amazing with this application. We had a prior version of GemFire called SQLFire. This was a standalone SQL driven in-memory database; however, it was not connected necessarily to Hadoop. Then, we decided to combine the two of them into one single offer with Pivotal GemFire XD 1.3. If you are a prior customer of SQLFire, congratulations, you now have the ability to have GemFire XD 1.3.

With this, you gain the ability to choose your persistence. You can persist data standalone in local files as you would normally do, or you can persist data into HDFS. This has a couple of interesting opportunities. Now you can access that data through MapReduce in addition to the SQL-based tools in GemFire. You actually can start treating Hadoop as a read/write update file system through GemFire. Some of the other additional enhancements that we have made to GemFire XD are performance and resilience enhancements and we have radically improved the .NET support.

Let's talk about how GemFire and GemFire XD are commonly used. Scale up performance is really the first major reason people call Pivotal and want to work with us. When they start running into problems with traditional relational databases or just they know they need to serve a large number of concurrent users, they like the fact that all operational data is stored in memory. They like the fact that the grids are linearly scalable. If we need more capacity, if we need more processing capability, add more nodes and the nodes easily become part of the grid without too much administrative work.

The data distribution, the sharding if you will, of data between the different nodes is handled automatically and is very easy to configure. This means that you can take advantage of both singular nodes and replicant nodes and really distribute work inquiry all across the grid as it makes sense.

My favorite customer story related to this is about massive scale capability, The China Railway Corporation. Many of the workers in China have to leave their hometowns to go find work in other cities and they take the train to do so. Every Chinese New Year, they all want to all come back to their families and they all want to take the train at about the same time. This spikes ticket purchasing like you've never seen on holidays and especially at that time of year.

The systems they had used to implement this before just could not handle the load. By taking advantage of GemFire to run their system, not only did they make the system more scalable and able to handle this demand in a pleasing way for their customers, in fact, they actually consolidated and used fewer systems to achieve this.

Another important use case for GemFire and GemFire XD is when people don't want just massive scalability, what they want is consistency across nodes. You can imagine if you have a lot of nodes that are all operating and sharing on data it doesn't take much to start screwing up the consistency of the data. In fact, GemFire and GemFire XD can run a distributed database grid with consistency.

How do we do this? First of all, as I mentioned, all operational data is stored in memory; however, it is persisted to disk. This means that should a node go down or should the grid go down, it will come back up and it will have a consistent durable state of the data. Second of all, there is a lot of work, several layers of consistency checks that are built into the grid itself. You can configure this depending on how you want to optimize the performance of your grid. You can even disable the consistency if what you really want is a cache rather than a distributed database. This will ensure that what you get is never an object that inconsistent with this data to your application.

As a result of this, another important thing is when you're working with grids and you're working with distributed data across the systems, your queries need to know where to go to be effectively operated on. It doesn't make sense, for example, to have one grid operator trying to query data that's held by another grid unit. What you really want is the query engine to be able to dispatch subqueries out to the appropriate grid members as it makes sense.

Then, the last thing is since you know that certain grids operate and shard on certain data you want to be able to send function calls and operations to the grid member that has the data, so you don't need to send data across the wire to another processing member. All of this is built in for you as part of GemFire. Then, of course, we have indexing available. We have triggers and event notification; all the kinds of things you would expect to find in a traditional database are actually here and are distributed aware.

One of my favorite stories of this is Newedge. Newedge is a financial trading processing company. They do post trade processing. If you trade stocks on an exchange, they're basically the people that help figure out the settlements after the fact. They wanted to have a global system across their many different offices to be able to show them the state of their business at any time. If you have many different locations around the globe all operating on the same data, consistency (especially when you're dealing with financial data) is very, very important.

The last major use case of where GemFire is used is in the case of high availability. High availability and resilience can mean a few things. Yes, you want your data to always be available if possible, or at least you want your data to not get corrupted if it goes down. If you're dealing with the grid you want the ability to have it heal itself and not have to call your operator and have him come out and spend hours reconfiguring nodes and reconfiguring systems to bring it back up. All of this, split brain detection and reassembling clusters is all built into GemFire's functionality.

A second piece – both in terms of high availability and disaster recovery, as well as achieving global scale – is WAN connectivity between sites. In this particular case, you can run clusters in different sites and this automatically gives you disaster recovery capabilities as well as multi-site failover capability. This is also a mechanism we use for achieving global scale.

Here, for example, let's say you have a global application but you're dealing with European countries and in European countries personally identifiable information cannot leave country borders. You can have a cluster dedicated to each European country and achieve the requirements but still also be able to run a global application for your company or your customers.

One of my favorite examples of a customer who needed to achieve is GIRE. GIRE is a payment processing company out of South America. They have a bunch of kiosks and branches all around the country that are linked together with intermittent network connections. Whether that's an issue with the actual network or its design I don't really know, but the key thing here is sometimes they're connected to the grid and sometimes they're not. They actually take advantage of the resilience that's been built into GemFire to be able to continue and operate and take payments, no matter what the connectivity state is at the time. When the grid and the network and the clients all are connected back together, it handles changes. It keeps everything consistent and they're able to process 19 million transactions a month from 4,000 points of sale.

Really quick and there's a lot of information in these slides and I won't have much time to go into the details. Basically, there are two main deployment methods. You can see how we start from being maybe part of a single web server. In this typical example here you're actually caching information, taking advantage of hybrid functionality in something like tc Server. This you can do without even changing your code. If you need to scale up, you can create a cluster. You can create a tiered cluster if you're getting into application-specific data types and then when you need performance and availability you bring in the WAN connectivity to go between multiple sites and achieve geodistributed capabilities.

GemFire XD's a little bit different. GemFire XD uses standard based connectivity, so in fact because it's using ODBC and JDBC you do not have a client. You're using your standard good old-fashioned Java or .NET or C++ database connectivity capabilities. As I mentioned, GemFire XD has the ability to do local disk persistence or persist to a Pivotal HD cluster. Again, you have the capability of creating a tiered cluster and taking the WAN connectivity to go into geodistributed capabilities.

Going deeper, if you were using this in a Hadoop context, here's an example of how you could use GemFire XD for that persistence, into your Hadoop. You can use HAWQ for doing inquiries on structured data that's stored and you can also take advantage of the MapReduce capabilities to do additional information there as well. Then, in the case of what would you put in here, you'd be putting in SQL objects, JSON and you'd be taking advantage of say serving online applications, working with Internet of Things sensors and then your analytic tables and apps would hit against HAWQ.

If I haven't taken all you time here, Luke, how would you like to grab the ball and show us a little GemFire?

Luke: Thanks, Greg. Great introduction for the demo. What I want to show today is an application that is powered by GemFire. This is a key value application, so key value store, we're using that model in GemFire. It contains a client that connects to the grid. I just want to walk through some of the features that GemFire would get through an application like this and then talk about how it can scale.

This is a Spring-based MVC application. One nice thing when you're working with GemFire is we have a very nice Spring story. So, the programming model is very simple. You can use Spring Data repositories to communicate with the grid, so it becomes a very familiar model to a lot of people out there.

We can do things like search for our customer. You'll notice as I'm typing I'm getting an autocomplete here. We're basically calling the grid and passing in characters and doing light searches and due to GemFire's low latency we get back results very quickly. We can do CRUD operations, so I can actually update this record. Very common features. Very easy to implement.

Now let's place an order. We'll go here to the order section. Once again, as I start typing I'm able to get an autocomplete. We'll go with that record we updated and let's put in a small order for three. The order's completed. If we view the customer's profile, so all this data are objects that are being stored in GemFire and we'll take a look how that looks on the grid side in a second.

The idea is there's also a transaction score here. The transaction score is a count of orders; however, if there's more than three orders on a single day or an order occurs on someone's birthday the score is actually higher. There's a bit more logic than just regular accounting. We'll see that this actually a function that executes on the GemFire grid, so it's some Java logic we have deployed to the grid. We'll just save that for later. You can see here there's all the orders in the system.

If you we go back home, our order count is a one. Now, just as a little background, these modules are actually pinging the GemFire grid every three seconds and getting aggregates of stock by type and stock by brand, as well as this count. GemFire's ability to handle a lot of connections and provide low latency response is critical here.

Now you'll see this backorder is zero. Let's take a look at what a backorder is like. This time we'll go with soccer shoes. We'll stick with our customer that we've been working with here. Let's do a pretty large quantity. This item is backordered. You'll see the backorders are now one.

This is a situation where the data is in memory in the grid, but it actually may have some value to other parts of our organization. If we log onto the server here for a second, I have a MySQL instance running. If you see, the order I just put through for a quantity of 9,654 has been written to a database on a server. What's happened is our client, our application which contains the GemFire client, put a record into this backorder region. GemFire builds up them up in a queue and then asynchronously writes them to this database. This is actually exactly what GemFire XD does when it’s writing to HDFS. In this case, I've actually extended that functionality to have it write to the MySQL database. Now someone from the organization that I work for with SQL tools can just come in and build up a backorder report.

Now that we're out of the server, let's take a look. We have a tool called Pulse. This is a visual tool to see how the grid looks. At present, we have one server with three processes. Each GemFire process is its own JVM. We have a locator process, which is like a membership coordinator and then data processes that actually store the data. Those two processes combined give us 2 GB of heap. You'll notice we have one client connected, which is this application. We'll talk a little bit about these queries shortly.

Now if we take a look at the data, as we put data into the grid we put it into different regions. These are basically hash maps, collections of key values. This transaction is partitioned. What this means is I have two member processes and we have about 10,000 records and they're spread evenly across the two processes. What this means is each process also has the functionality to do the order count. When my client makes the call to request that, these two processes start cranking on the records they have in their JVM and parallel to give the answer back to my client that made the call and then that allows me to aggregate the results over that large data set here in my client. The processing is all uploaded and distributed across the grid. We'll see how that scales shortly.

We have a feature in GemFire 8, we have added the Pulse data browser. What this allows me to do is actually GemFire has an object query language, OQL and I can execute a query and actually see data here right in the UI. This one product we have 586, with 31 on hand and it's a pair of red shoes.

Greg mentioned the REST UI. We have indeed added a REST UI for the GemFire grid with version 8. Traditionally, up till now when you've developing you used one of our client APIs. The one that we're looking at in the application I'm showing you is Java-based and specifically there's a Spring project called Spring Data GemFire that I've used to wrap the Java APIs to make my programming model a little bit more straightforward. There's a C Sharp and a C++ library as well.

You would bring in those libraries, use those APIs and that would allow you to work effectively with the grid. That's doing your CRUD operations, calling functions, getting metadata, etcetera.

The REST API is now opening up the grid for something like Python. We can execute functions and we can do queries on the data. Let's take a look at the region operations. With regions, I can list all the regions. Let's see how that looks. These are all the regions. This is everywhere put my data. There's my partition transaction, which is so important.

Let's take a look at getting an object. We'll get the one that I just queried. Indeed, we see our red shoes with stock on hand of 31. With this new REST functionality now all kinds of languages, basically anything that can do a REST call, can now leverage the power of the grid.

That gives you a little bit of an idea of how everything is working. Let's put some transactions into the system and see a few more things that GemFire can do and then see how it scales. What I've just started up is a Spring Boot application. It has a GemFire client built in. That GemFire client connects to the grid, so you'll see now our client connections are now two and there is starts pumping transactions into my transaction region of 25 every five seconds. We have some data flow coming in now.

If you look at our UI, you'll see that there's some updates happening. GemFire's low latency gives us this great ability to make calls and adapt these lighter-weight UIs to how the data's changing. You can see this is showing a trend as we pull the GemFire grid and these aggregates that's it's actually going downwards. We can indicate that with an arrow.

This little component pings the grid about every three seconds. This is a high chart, lightweight graphs and gives us a count of orders in the system at this time. This is all polling. This area is very interesting. This is what's called the continuous query. This continuous query, when my client connected to the grid it registered this query with the grid and this is where you can see these two CQs and subscription.

The application is basically saying to the grid if a product object is changed and its quantity attribute becomes 22 or less, please let me know about that object. Every single client that connects can register such queries. What I've done is just displayed what is getting pushed from the grid to my application in this little module. This gives me a chance and run time to take advantage of placing an order for some stock that might be decreasing. There's all sorts of implications here giving people run time control over things.

That gives us a good idea of some of the run time capabilities. Let's begin to take a look at how the system would scale out. I have another server here. These are just centralized Linux servers. What I'm going to do rather than remotely make calls is just log in to show you it’s a different server. What I'm going to do is start up two more GemFire processes. These GemFire processes contact this membership coordinator, the locator and the grid I first started. Thanks to our new cluster configuration service they get the configs and they're also admitted to the grid.

What's really interesting is you'll see our heap is going out, so we're scaling out our capacity. Also, if we look at how this region was distributed, so right now you can see we still have our data spread across two. Now you'll see it's spread across four but unevenly. If you give it a second, we now have it spread across four fairly evenly. Our 10,000+ is now in small chunks across all those systems. We have a redundancy set of one. That means every time I do a put into the grid, the grid will copy a backup copy to another node on the machine. That means, if I lose a node, I won't lose any data.

Let's actually explore that right now. Adding those four members added our capacity and our compute availability, but didn't affect anything here in our display. Let's actually take a member down, so we're basically going to kill one of these processes. Remember, these processes, they're just Java processes. Their configurations explain how the Java process can communicate with the grid.

You'll see our member counts just dropped down to three and you'll notice our distribution. Now one member is actually taking more entries, but because of our redundancy there is no drop in the order capped because the grid now was able to say we lost some objects here but we have backups. What the grid will do is appoint new backups in another region and then reassign the primaries. We always keep that data. Then, if I want, I can actually kick this member back off. Let's say someone unplugged a server accidentally and now they're going to plug it back in, what the server again will do is contact the primarily membership coordinator and say, 'I'm coming back in.'

These members can have data persisted on disk. In this particular, case that data would be know to be stale, so the member would join, be admitted to the system and then get new data redistributed. It might become the secondary copies for a bunch of objects that need to be made secondary copies and then new puts coming in, the primaries may go there. The grid attempts to balance the traffic and also maintain consistency.

In a second or two, we should see this member. Now we have four members again. We should see this entry count get redistributed. There we go. Now we're back to our even distribution. All of this happened without a blip on the application side thanks to GemFire's resiliency of data. We were able to do all that scaling without it affecting performance.

That's really all I have to show today. Greg, I'll turn it back over to you.

Gregory: Very good. That was a very nice fast demo. I really liked how you were able to show the responsiveness of your application and, at the same time, how you were able to actually take advantage of the in-memory speed to create pleasing user experiences such as that really super fast auto-connect.

We have a number of good questions. We're a little bit over time, so I apologize for that in advance. If you want to stick around and hear our questions we'll be happy to answer them as we get them. First one is, is this demo public? We are in the process of doing that. Matter of fact, I believe, we're even eventually trying to make it possible to get your hands on the code. So stay tuned for that.

Let's see. Some other questions. Luke, here's a good one for you. Explain what you mean by NoSQL? What can GemFire do?

Luke: In this case, NoSQL just means I'm working with the key value store. The keys are how I look up an object and that key is an object itself. The value can be an object or even an object graph, so we can have nested objects. What this means is when I'm working very close to my application code I can actually persist the objects that I'm working with without an ORM mapping layer.

I can persist an object into the grid, flood the grid with a whole bunch of them, do a calculation in memory and get back a result and the objects that I'm putting in I can get back without any kind of translation. I would say that would be the context that I was referring to for NoSQL at this point.

Gregory: How does that work with JSON?

Luke: GemFire can actually support JSON. It can take JSON as a tight OQL, it can work actually with JSON. Of course, the REST is JSON compliant. If I was using an application that was working with JSON, GemFire has full built-in support for that.

Gregory: Very good. Here's another for you. How much data can GemFire handle?

Luke: That sort of depends on your Java tuning and your data policies. I was only using a redundancy of one here. You might want to up your data redundancy. Indexes, add memory states and of course with any Java process you need to have room for garbage collection and regular Java management. We have lots of customers with data of five terabytes and upwards. So, a fair amount. Thanks to compression you actually fit a little bit more. I didn't use this in this current example because my data sets are quite small.

Gregory: Okay, a third question, I'll hand this one. What Hadoop distributions can GemFire XD handle? Today, we are currently supporting Pivotal HD (Pivotal Hadoop). We do test. Pivotal HD is extremely Apache compliant and we do test on Apache. It's not necessarily a technical limitation. It is our process of partnering between the different major Hadoop vendors in the ecosystem and this continues to be an evolving thing as the Hadoop ecosystem itself evolves.

Stay tuned, but for right now if you want support you are working against Pivotal HD. However, if you buy GemFire XD, chances are you will (also through Pivotal Big Data Suite) already have the ability to use unlimited Pivotal HD because we do not charge based on volume. We charge based on number of cores and processes used.

Let's see we have one last one here and Luke I'll pass this one to you. Please list the types of APIs supported by GemFire and GemFire XD.

Luke: GemFire, we have our REST API as I showed you so you can make the REST calls. We have a Java library, which basically allows you to do your cred operations and function execution. We have a C Sharp library, so from the .NET environment you can configure the regions, do CRUD operations, execute functions, query data. There's a query service. Same with the C++.

In terms of GemFire XD, as Greg said, it's JDBC or ODBC. GemFire XD is GemFire but with a SQL layer. As opposed to having regions, which are essentially hash maps (collections of key value), I actually have the concept of tables in memory. I have to make my rows and columns. In that case, I would specify DDL and in the DDL we have some reserved words to figure out how to distribute and replicate data across the grid and what level of redundancy you may have. Then, from that point, you would just use in the language you're working with whatever ODBC or JDBC support is present.

Of course, in Java with GemFire there's a great Spring project, Spring Data GemFire, which builds upon Spring Data and uses the repository approaches. Then, Spring Boot has support for GemFire, so you can build up a Spring Boot project and it will automatically (if you pick the GemFire starter kit) bring in everything you need to build an application. The app I demoed for you today is actually a Spring Boot application.

Gregory: We're well over time, but I think there's some really great questions and I'll take one last one to end up here. I have a question here is, is GemFire suited to do distributed analytics at scale?

If it wasn’t made abundantly clear in the kinds of the things that we do, GemFire is a transactional database. It is built to do OLTP at incredible scale. It can do some analytics with the query capabilities that it has, but it is not an OLAP-style database. In fact, we have (in Big Data Suite) a different technology called HAWQ, which does massively parallel advanced analytics tied to one of the most powerful open source data science libraries in the market. That's what we use for analytics at scale on petabytes of detailed information. There really are very specific technologies for different specific use cases.

With that I will turn this back over to Katherine.

Host: Thank you. We’ll go ahead and bring this webcast to a close. We will be sending out the links to this webcast recording and the slide share deck of the session within the next few days. There's also lots of information on the Pivotal website if you're interested. Thank you again for joining us and have a great day.