GPU Powered * : Accelerated computing for the future

GPUs are making a big buzz in the market. I am seeing the word GPU Powered as a norm. You can also see that with NVIDIA’s share price rising at a faster pace. Couple of months back, I had to look into the GPU world since there were multiple requests around doing data science using GPUs and multiple people has started asking me about it.

We all know that it is used in mobile devices for things like games. My curiosity started rising and some of the questions that came into my mind was how does a Graphic Processing unit relate to the application/database/analytics world? How is different from CPUs? Why everyone is suddenly talking about it?

Following are some of links that has helped me understand this world. Sharing my experience so far. You may want to start watching these videos and posts
What is a GPU and how does it work?
https://www.youtube.com/watch?v=0_TN845dxUU&t=1s>
What is GPU Accelerated computing?
http://www.nvidia.com/object/what-is-gpu-computing.html
CPUs and GPUs – There’s enough room for everyone
http://sqream.com/cpu-and-gpus-theres-room-for-everyone/

First things first. You will realize that there are two types of GPUs. Server Grade and Consumer Grade GPUs

Server-grade GPUs are designed so that they can be installed next to each other with no space separation, and fill all the available PCI slots available on the server, thus optimizing space usage and maximizing the amount of compute power per rack space unit. You can fit four Tesla K80 boards in a 1U server; that’s 8 GPUs total (K80 boards have 2 GPUs each), and that’s an impressive amount of compute throughput. The same applies to Tesla Pascal P100 models, with the due differences (one GPU per board). If you are building a supercomputer or a GPU-based server farm and buying hundreds or thousands of GPUs, these details matter a lot.

Consumer-grade GPUs typically have active cooling with a fan that ingests airflow orthogonally to the longitudinal axis of the board. That requires space clearance to accommodate the cooling airflow. That leads to less dense configurations than in servers. Typical consumers do not care because they have computer cases with more vertical development, less need for density, and most users only have one GPU card per host.

Via : Daniele Paolo Scarpazza, https://www.quora.com/What-is-the-different-between-gaming-GPU-vs-professional-graphics-programming-GPU

My interests are around Server grade GPUs.Where can I find a Server grade GPU to explore?
Use AWS – AWS has G2, P2 and F instances supporting. P2 generally fits in my ecosystem very well. Refer to this presentation Deep Dive on Amazon EC2 Instances (Slide 42 onwards). Google and Microsoft also provides similar instances. NVIDIA has a GPU Cloud, which is in Private Beta I believe.

How do I start Programming in GPUs?
To program in a GPU, NVIDIA has created a platform called CUDA 10 years back. One can also use OpenCL to program. But these are low level languages like “C”.
May not be for everyone. Are there higher level abstractions available? Yes. Let us start looking from Databases side.

Welcome to the world of GPU Powered Databases. MapD and Kinetica are very promising in this space and there are other databases like BlazingDB. Some of the benchmarks these guys are quoting are at the scale of 30X to 100X difference with a typical MPP database in the market.
How do you setup one of these databases? AMIs are available in AWS Marketplace. Use the Opensource one to play with (will work out cheaper). Try MapDs New York City taxi rides demo in your environment.  When I saw the demo first time, I was speechless for some time.

Watch these videos by Todd Mostak, the founder of MapD talking. Very insightful.
The Rise of the GPU: How GPUs Will Change the Way You Look at Big Data
https://www.youtube.com/watch?v=mwpd13urFog&t=25s
The Promise of GPU Analytics
https://www.youtube.com/watch?v=qxWSSz8x6NI
On a side note, it is also important why column stores work better in an in-memory world. I understand now that this can be used for powering databases (In-memory powered by CPUs).

What else can we use it for? How about Visualization? If the Visualization doesn’t support these kind of model, then you still will not get the interactivity what you are looking for.
I liked the way MAPD has done their visualization. Taking the power of GPUs at the consumer end and rendering it. Look for OPENGL, VEGA and D3.Js. You will be able to see how to use GPUs for Visualization.
http://developer.download.nvidia.com/presentations/2008/NVISION/NVISION08_MGPU.pdf
You may want to take a look at these JS libraries
https://stardustjs.github.io/
http://gpu.rocks/

What else can we do with GPUs? The real advantage of using GPUs will be when we leverage its power for Parallel Processing. Which means, this would be extremely useful for any complex Mathematical calculations etc. or scenarios like deep learning.
Today all the deep learning frameworks (Tensorflow or Deeplearning4J or H20) support GPU natively. All you need to do is install it on GPUs and establish the device mapping. Tensorflow would then take care of it automatically. Setting this up, is straight forward. This enables support not only for CPUs or GPUs but also for TPUs (Google latest one) or anything that comes in future.

To understand more watch this video: Effective TensorFlow for Non-Experts
https://www.youtube.com/watch?v=5DknTFbcGVM

BTW, I run Keras with Tensorflow as a backend and use Python (PyCharm) in my machine -:)

Conclusion
I believe that this is a great technology and that this is going to stay . In my opinion it is not CPU Vs GPU. It is GPGPU and it will be the way forward. MapD has raised 25M and Kinetica has raised 50M recently. This definitely shows the potential in this space. I also believe this model would bring back the Scale-up model from our current Scale-out model. Power consumption/Energy conservation is one definite plus, with reduced maintenance efforts. Though the cost of the server (P2 Instance vs T2 Instance) is high right-now, it is not apple to apple comparison. Also, the prices will come down over a period of time.

One thing which I liked the most in the GPU Powered database world is that, since you can run these queries and get sub-second performance, this will help you move out of all the Pre-computations, aggregations we used to in our analytics world. And this also doesn’t have that 32 concurrent user restriction a typical MPP database may have. You will see lot more traction in this area with all major companies moving in this direction and acquisitions.

As always, it’s great to play in the bleeding edge area. Get to learn new things on a regular basis.

Happy Learning!

Advertisements

Data Infrastructure, Data Pipeline and Analytics – Reading List – Sep 27, 2016

Splunk vs ELK: The Log Management Tools Decision Making Guide
Much like promises made by politicians during an election campaign, production environments produce massive files filled with endless lines of text in the form of log files. Unlike election periods, they’re doing it all year around, with multiple GBs of unstructured plain text data generated each day.
http://blog.takipi.com/splunk-vs-elk-the-log-management-tools-decision-making-guide/

Building a Modern Bank Backend
https://monzo.com/blog/2016/09/19/building-a-modern-bank-backend/

An awesome list of Micro Services Architecture related principles and technologies.
https://github.com/mfornos/awesome-microservices#api-gateways–edge-services

Stream-based Architecture
Part of the Stream Architecture Book. An excellent overview on the topic.
https://www.mapr.com/ebooks/streaming-architecture/chapter-02-stream-based-architecture.html

The Hardest Part About Micro services: Your Data
Of the reasons we attempt a micro services architecture, chief among them is allowing your teams to be able to work on different parts of the system at different speeds with minimal impact across teams. So we want teams to be autonomous, capable of making decisions about how to best implement and operate their services, and free to make changes as quickly as the business may desire. If we have our teams organized to do this, then the reflection in our systems architecture will begin to evolve into something that looks like micro services.
http://blog.christianposta.com/microservices/the-hardest-part-about-microservices-data/

New Ways to Discover and Use Alexa Skills
Alexa, Amazon’s cloud-based voice service, powers voice experiences on millions of devices, including Amazon Echo and Echo Dot, Amazon Tap, Amazon Fire TV devices, and devices like Triby that use the Alexa Voice Service. One year ago, Amazon opened up Alexa to developers, enabling you to build Alexa skills with the Alexa Skills Kit and integrate Alexa into your own products with the Alexa Voice Service.
http://www.allthingsdistributed.com/2016/06/new-ways-to-discover-and-use-alexa-skills.html

Happy Learning!

Data Infrastructure, Data Pipeline and Analytics – Reading List – Sep 20, 2016

Hadoop architectural overview
An Excellent series of posts – talking about Hadoop and Related components, Key metrics to monitor in Production
https://www.datadoghq.com/blog/hadoop-architecture-overview/
Surviving and Thriving in a Hybrid Data Management World
The vast majority of our customers who are moving to cloud applications also have a significant current investment in on premise operational applications and on premise capabilities around data warehousing, business intelligence and analytics. That means that most of them will be working with a hybrid cloud/on premise data management environment for the foreseeable future.
http://blogs.informatica.com/2016/08/19/surviving-thriving-hybrid-data-management-world/#fbid=dlbfZB7A1Sd
Data Compression in Hadoop
File compression brings two major benefits: it reduces the space needed to store files, and it speeds up data transfer across the network or to or from disk. When dealing with large volumes of data, both of these savings can be significant, so it pays to carefully consider how to use compression in Hadoop.
http://comphadoop.weebly.com/
 What is “Just-Enough” Governance for the Data Lake?
Just-enough governance is similar to the Lean Startup methodology concept of building of a Minimum Viable Product (MVP). From an enterprise perspective, just-enough governance means building only the process and control necessary to solve a particular business problem.
https://infocus.emc.com/rachel_haines/just-enough-governance-data-lake/
Mind map on SAP HANA
https://www.mindmeister.com/353051849/sap-hana-platform
Should I use SQL or NoSQL?
Every application needs persistent storage — data that persists across program restarts. This includes usernames, passwords, account balances, and high scores. Deciding how to store your application’s important data is one of first and most important architectural decisions to be made.
https://www.databaselabs.io/blog/Should-I-use-SQL-or-NoSQL
Happy Learning!

Data Infrastructure, Data Pipeline and Analytics – Reading List – Sep 12, 2016

Three incremental, manageable steps to building a “data first” data lake
Applications have always dictated the data. That has made sense historically, and to some extent, continues to be the case. But an “applications first” approach creates data silos that are causing operational problems and preventing organizations from getting the full value from their business intelligence initiatives.
http://www.networkworld.com/article/3098937/analytics/three-incremental-manageable-steps-to-building-a-data-first-data-lake.html

Azure SQL Data Warehouse: Introduction
Azure SQL Data Warehouse is a fully-managed and scalable cloud service.
https://www.simple-talk.com/cloud/azure-sql-data-warehouse/

The Informed Data Lake: Beyond Metadata
Historically, the volume and extent of data that an enterprise could store, assemble, analyze and act upon exceeded the capacity of their computing resources and was too expensive. The solution was to model some extract of a portion of the available data into a data model or schema, presupposing what was “important,” and then fit the incoming data into that structure.
https://hiredbrains.wordpress.com/2016/05/13/the-informed-data-lake-beyond-metadata/

Real Time Streaming with Spring xd, Apache Geode (Gemfire), and Greenplum
Spring xd is a unified, distributed, and extensible service for data ingestion, real time analytics, batch processing, and data export.
http://zdatainc.com/2016/01/real-time-streaming-with-spring-xd-apache-geode-gemfire-and-greenplum-earthquake-data-demo/

Data Orchestration using Hortonworks DataFlow (HDF)
Hortonworks Dataflow (HDF), powered by Apache NiFi, is the first integrated platform that solves the real time complexity and challenges of collecting and transporting data from a multitude of sources be they big or small, fast or slow, always connected or intermittently available
http://zdatainc.com/2016/02/hello-nifi-data-orchestration-using-hortonworks-dataflow-hdf/

Happy Learning!