Data Platform : Exploratory Data Analysis

Today, everyone talks about storing data in the raw format so that it can be analyzed and generate insights at a later point of time. Fantastic Idea.  Data Lakes just delivers that promise. However, the complexity of data is increasing day by day. And there are these new data sources that are getting added on a regular basis.

lakeswamp
Not every day you end up dealing with data sets which you are familiar with. Considering the kind of new type of data that gets added, most likely that one would end up dealing with data sets out of their comfort zone.
Data science teams spend most of their time with exploring and understanding data.

  • If you must deliver some quick insights on a set of data will you go through them manually to figure out or can we do something within the data lake that can be used?
  • What would be an easy way for the Data Science team to understand the data set quickly, understand patterns, relationships so that we could generate some hypothesis?

Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to

  1. Maximize insight into a data set;
  2. Uncover underlying structure;
  3. Extract important variables;
  4. Detect outliers and anomalies;
  5. Test underlying assumptions;
  6. Develop parsimonious models; and
  7. Determine optimal factor settings.

Via : http://www.itl.nist.gov/div898/handbook/eda/section1/eda11.htm
Exploratory Data Analysis : http://www.statgraphics.com/exploratory-data-analysis

If you google about exploratory data analysis, you will get tons and tons of material about doing EDA using R or Python.

Considering the shortened timelines, if you expect your data science teams to develop code to understand, you may not be able to deliver value at the speed in which your business is expecting results.

There are couple of tools which may help you understand your data faster. Google Data Profiling and you will get tons and tons of results on this topic. My favorite tools right now in this topic are

  1. Trifacta Wrangler
  2. Exploratory.io

Both are easy to use with a simple user interface. You can use the free version to get started. If you have an automated data pipeline using SPARK, you can also generate the profile statistics about the incoming data and store it as part of your Catalog.

I really like this presentation on this topic.. Data Profiling and Pipeline Processing with Spark.

Once you do this with Spark, you may want to update the data profile information and store it as part of your catalog. If you index your catalog with Elastic Search, you may be able to provide an API for your Data Science teams to search for the files with certain quality attributes etc.

The above tools will help you get a quick understanding of your data. But, what If you want pointers for analysis to get started about your data? Only a profiler will not help in this case. You may want to explore this product from IBM (yeah… you heard it right… it’s from IBM and I am using it daily). Check it out here… IBM Watson Analytics

Watson Analytics – is a SMART discovery service and it is super smart. It is available for $80 User/Month. For the value, you get out of it, $80 per month is really nothing.
You can use it for data exploration and predictive analytics and it is effortless. A free one month subscription is available for you to play with.predictive

I have looked around various products and I couldn’t find anything which is closer to what Watson offers. If i have to mention about a drawback, it doesn’t provide connectivity to S3. You may have to connect to Postgresql or Redshift to extract data.watsonconfig

If you can integrate it in your platform and use it effectively, you will be able to add value to your customers in literally no time.

Happy Learning!

Getting Started with Serverless Architecture

technology-1587673_960_720

Serverless Architecture is relatively very new. I’ve been exploring Serverless architecture for the new platform architecture off late. Though it is very interesting obviously there is a reasonable learning curve and I don’t see lot of best practices out there yet.

Everything looks green on the other side.. We will learn as we move forward..

Since, we use AWS as our cloud provider, most of the examples you will see are related to AWS Lambda.

Specific Reasons for exploring Serverless Architecture 

  1. No operating systems to choose, secure, patch, or manage.
  2. No servers to right size, monitor, or scale out.
  3. No risk to your cost by over-provisioning.
  4. No risk to your performance by under-provisioning.

https://d0.awsstatic.com/whitepapers/AWS_Serverless_Multi-Tier_Architectures.pdf

One thing i learnt in the last few years about developing distributed applications is that, it is not about learning new things… it is always about unlearning what you have done in the past.

If you are specific about Vendor lock-in then this may not be a choice at all for you…

Following is my reading list on Serverless Architecture.

What is Serverless?
https://auth0.com/blog/what-is-serverless/

Serverless Architectures
http://martinfowler.com/articles/serverless.html

What is Serverless Computing and Why is it Important?
https://www.iron.io/what-is-serverless-computing/

Serverless Architecture in short
https://specify.io/concepts/serverless-architecture

Is “Serverless” architecture just a finely-grained rebranding of PaaS?
http://www.ben-morris.com/is-serverless-architecture-just-a-finely-grained-rebranding-of-paas/

IAAS, PAAS, Serverless.
https://read.acloud.guru/iaas-paas-serverless-the-next-big-deal-in-cloud-computing-34b8198c98a2#.m9us1c5fe

Serverless Delivery: Architecture
https://stelligent.com/2016/03/17/serverless-delivery-architecture-part-1/

Principles of Serverless Architectures
There are five principles of serverless architecture that describe how an ideal serverless system should be built. Use these principles to help guide your decisions when you create serverless architecture.
1. Use a compute service to execute code on demand (no servers)
2. Write single-purpose stateless functions
3. Design push-based, event-driven pipelines
4. Create thicker, more powerful front ends
5. Embrace third-party services
https://dzone.com/articles/serverless-architectures-on-aws

Serverless Architectures – Building a Serverless system to solve a problem
https://serverless.zone/serverless-architectures-9e23af71097a#.j9z60nxw1

Serverless architecture: Driving toward autonomous operations
https://www.slalom.com/thinking/serverless-architecture

Serverless Developers
https://serverless-developers.com/

The essential guide to Serverless technologies and architectures
http://techbeacon.com/essential-guide-serverless-technologies-architectures

Using AWS Lambda and API Gateway to create a serverless schedule
https://www.import.io/post/using-amazon-lambda-and-api-gateway/

Five Reasons to Consider Amazon API Gateway for Your Next Microservices Project
http://thenewstack.io/five-reasons-to-consider-amazon-api-gateway-for-your-next-microservices-project/

AWS Lambda and the Evolution of the Cloud
https://blog.fugue.co/2016-01-31-aws-lambda-and-the-evolution-of-the-cloud.html

SquirrelBin: A Serverless Microservice Using AWS Lambda
https://aws.amazon.com/blogs/compute/the-squirrelbin-architecture-a-serverless-microservice-using-aws-lambda/

A Crash Course in Amazon Serverless Architecture
http://cloudacademy.com/blog/amazon-serverless-api-gateway-lambda-cloudfront-s3/
­
AWS Lambda and Endless Serverless Possibilities
https://abhishek-tiwari.com/post/aws-lambda-and-endless-serverless-possibilities

Awesome Serverless – A Curated List
https://github.com/JustServerless/awesome-serverless

Happy Learning!

Data Infrastructure, Data Pipeline and Analytics – Reading List – Sep 27, 2016

Splunk vs ELK: The Log Management Tools Decision Making Guide
Much like promises made by politicians during an election campaign, production environments produce massive files filled with endless lines of text in the form of log files. Unlike election periods, they’re doing it all year around, with multiple GBs of unstructured plain text data generated each day.
http://blog.takipi.com/splunk-vs-elk-the-log-management-tools-decision-making-guide/

Building a Modern Bank Backend
https://monzo.com/blog/2016/09/19/building-a-modern-bank-backend/

An awesome list of Micro Services Architecture related principles and technologies.
https://github.com/mfornos/awesome-microservices#api-gateways–edge-services

Stream-based Architecture
Part of the Stream Architecture Book. An excellent overview on the topic.
https://www.mapr.com/ebooks/streaming-architecture/chapter-02-stream-based-architecture.html

The Hardest Part About Micro services: Your Data
Of the reasons we attempt a micro services architecture, chief among them is allowing your teams to be able to work on different parts of the system at different speeds with minimal impact across teams. So we want teams to be autonomous, capable of making decisions about how to best implement and operate their services, and free to make changes as quickly as the business may desire. If we have our teams organized to do this, then the reflection in our systems architecture will begin to evolve into something that looks like micro services.
http://blog.christianposta.com/microservices/the-hardest-part-about-microservices-data/

New Ways to Discover and Use Alexa Skills
Alexa, Amazon’s cloud-based voice service, powers voice experiences on millions of devices, including Amazon Echo and Echo Dot, Amazon Tap, Amazon Fire TV devices, and devices like Triby that use the Alexa Voice Service. One year ago, Amazon opened up Alexa to developers, enabling you to build Alexa skills with the Alexa Skills Kit and integrate Alexa into your own products with the Alexa Voice Service.
http://www.allthingsdistributed.com/2016/06/new-ways-to-discover-and-use-alexa-skills.html

Happy Learning!

Developing a Robust Data Platform : Key Considerations

key-considerations

Developing a robust data platform requires definitely more than HDFS, Hive, Sqoop and Pig. Today there is a real need for bringing data and compute as close as possible. More and more requirements are forcing us to deal with high-throughput/low-latency scenarios. Thanks to in-memory solutions, things definitely seems possible right now.

One of the lesson I have learnt in the last few years is that it is hard to resist developing your own technology infrastructure while developing a platform infrastructure. It is always important to remind ourselves that we are here to build solutions and not technology infrastructure.

Some of the key questions that needs to be considered while embarking on such journey is that

  1. How do we handle the ever growing volume of data (Data Repository)?
  2. How do we deal with the growing variety of data (Polyglot Persistence)?
  3. How do we ingest large volumes of data as we start growing (Ingestion Pipelines/Write Efficient)?
  4. How do we scale in-terms of faster data retrieval so that the Analytics engine can provide something meaningful at a decent pace?
  5. How do we deal with the need for Interactive Analytics with a large dataset?
  6. How do we keep our cost per terabyte low while taking care of our platform growth?
  7. How do we move data securely between on premise infrastructure to cloud infrastructure?
  8. How do we handle data governance, data lineage, data quality?
  9. What kind of monitoring infrastructure that would be required to support distributed processing?
  10. How do we model metadata so that we can address domain specific problems?
  11. How do we test this infrastructure? What kind of automation is required?
  12. How do we create a service delivery platform for build and deployment?

One of the challenges I am seeing right now is that the urge to use multiple technologies to solve similar problems.  Though this gives my developers the edge to do things differently/efficiently, from a platform perspective this would increase the total cost of operations.

  1. How do we support our customers in production?
  2. How can we make the life our operations teams better?
  3. How do we take care of reliability, durability, scalability, extensibility and Maintainability of this platform?

Will talk about the data repository and possible choices in the next post.

Happy Learning!

Software Architecture, Customer Success

Happened to Watch couple of good videos last week on Software Architecture, Design and Customer Success.

How the World Wide Web just happened – Tim Berners-Lee
https://www.youtube.com/watch?v=yF5-6AcohQw
Great Session. Talks about the importance of being in the right place and the right time.

Mary Poppendieck (Poppendieck.LLC) – The New New Software Development Game: Containers, Micro Services
http://m.ustream.tv/recorded/61477219?rmalang=de_DE
Complexity grows non-linearly with Software size. Software size continues to grow so software complexity will continue to grow even faster. She explains what can we do about the complexity?

A summary of this talk is available here
http://highscalability.com/blog/2015/4/27/how-can-we-build-better-complex-systems-containers-microserv.html

Zen and the art of Customer Relationships
https://www.youtube.com/watch?v=G_2UP4-J7Vc
I loved the Zen and the Art of Customer Relationships presentation from Zen Desk. Awesome Presentation!
Pointers for building long lasting relationships

  1. Don’t overestimate your importance in your customers life
  2. Consider the entire customer experience
  3. Recognize the right relationships and adapt
  4. Be something actual humans can relate to
  5. Be Transparent
  6. Empower your best people to do what’s best
  7. Put a face to your customers

Framework to Build a Killer Customer Success Scorecard
https://www.youtube.com/watch?v=lhx06h8RZ3Q
Another Fantastic presentation from the trenches. A good overview around how to define Customer Success and what are the metrics to monitor (Customer, Financial, Practice and Inter-team)

Building the Customer Success Management Team
https://www.youtube.com/watch?v=XIx5HhfG56w
Happy Learning!

Microservices : Reading List

Modern day businesses requires agility to survive and to be a leader. If you translate this business requirement into technology requirement, this means X Deploys a day (Time to market).

The big bloated, complex applications that we have built over a period of time is not allowing us to meet this X Deploys a day without compromising quality. If there is a way to decompose the big bloated monolith application blocks into smaller chunks it will help the business to extend, manage and deploy and eventually the X Deploys a day could become a reality.

How do we get there? Is there a way to achieve this? Microservices (lots of small applications) is one of the ways that could help in achieving this.

Microservices means developing a single, small, meaningful functional feature as single service, each service has its own process and communicate with lightweight mechanism, deployed in single or multiple servers.
Source

Additional Reading List
The Twelve-Factor App
http://12factor.net/

Microservices Reading List
http://www.mattstine.com/microservices

Understanding Microservices
http://kpbird.com/2014/11/Monolithic-vs-MicroService-Architecture/
http://shakayumi.tumblr.com/post/95688359079/whats-the-big-idea-with-microservices
http://kpbird.com/2014/06/Microservice-Architecture-A-Quick-Guide/
http://www.infoq.com/articles/microservices-intro
http://www.slideshare.net/mstine/microservices-cf-summit
http://java.dzone.com/articles/microservice-architecture
http://tech.gilt.com/post/35711763311/how-gilt-com-give-came-to-be

Microservices Architecture and Scalability
http://www.pst.ifi.lmu.de/Lehre/wise-14-15/mse/microservice-architectures.pdf
http://technologyconversations.com/2015/01/26/microservices-development-with-scala-spray-mongodb-docker-and-ansible/

Microservices Patterns
http://blog.arungupta.me/microservice-design-patterns/
http://microservices.io/patterns/index.html

Simon Brown’s Video : Software Architecture & Balance with Agility
https://vimeo.com/user22258446/review/79382531/91467930a4

Books
Building Microservices
Software Architecture for Developers

Frameworks
http://gilliam.github.io/concepts.html
http://projects.spring.io/spring-boot/
http://fabric8.io/
http://azure.microsoft.com/en-us/campaigns/service-fabric/

Digital Businesses and APIs

I first heard the term API during 1996-97, when I was programming in VB (Win32 APIs). Hence this is not a new term for sure. But you can hear this term quite frequently these days. What is happening? What has changed? Let us take couple of traditional businesses and see how they have operated.

Before 1990 During 1990-2010 Now
A bank used to transact only between certain business hours till some time back. Internet Banking Came into Picture. One needed a Desktop to operate. Center and Web as Channels. Internet and Mobile Banking All one need is some form of mobile device to operate. Operations are 24/7 and the channels are multiple.
The business model for an University to run courses on-premise between certain business hours. Universities started providing online courses. Online, Offline Courses available. Courses are provided via Web , Mobile and Tablet channels Newer models like MooC coming into picture
A Book store selling books in their stores/chain of stores between certain business hours Internet Shopping via Browsers Store still exist. Most of the shopping happens via Tablets, Mobile and Web. Price Comparison Sites/Applications. Sell via Blogs and other websites (Widgets). Retargeting

What is the underlying trend? Internet Explosion and growth of Smart Phones/Devices has forced companies to rethink the way they have done business. The disruption has made companies to rethink their business models. Newer Digital Business Models are evolving which is enabling companies to reach to newer markets, global customers and gain competitive advantage. Few examples You can pay utility bills via the popular chat application you are using http://www.innovativechina.com/2013/07/china-merchants-bank-launches-its-own-wechat-bank/ http://www.opptrends.com/2014/04/after-alibaba-tencent-now-baidu-inc-bidu-comes-with-mobile-wallet/ Internet companies are providing interest for your deposits. http://qz.com/160589/alibaba-yu-e-bao-money-market-account-serious-threat-to-chinese-banks/ It’s a great thing for a consumer and definitely a great opportunity for companies. To address this increasing number of channels and look at newer business opportunities and models, companies needed a way to expose and consume data. APIs have become the common way to expose, consume and communicate to the various channels and fuel innovation.

An API — Application Programming Interface — at its most basic level, allows your product or service to talk to other products or services. Via What is an API?

To see an API in action, check this https://developer.pearson.com/apis/topten-travel-guides/#!/travel/listCategories_get_0Portal shows all the APIs that are available. Good one. This is one of my favorite example. Is this happening in just one industry? No, the changes are across the industries (Banking, Retail, Healthcare, Energy, Transportation, Automotive) to name a few. Check this link to see the fastest growing API Categories http://www.slideshare.net/programmableweb/fastest-growing-web-api-categories-last-6-months/ Different types of APIs

  1. Open APIs (Public, Web APIs – Open to All)
  2. Partner APIs  (Protected, Open to Select Few)
  3. Enterprise APIs (Private, Your traditional SOA Based mostly, Open to Employees only)

In a nutshell, APIs are a must-have in any technology strategy today. APIs power the Digital Business and acts as the Glue in SMAC stack. References: http://apievangelist.com/index.html http://www.cutter.com/content-and-analysis/resource-centers/agile-project-management/sample-our-research/apmu1306.html Happy Learning!!!