Developing a Robust Data Platform : Key Considerations

key-considerations

Developing a robust data platform requires definitely more than HDFS, Hive, Sqoop and Pig. Today there is a real need for bringing data and compute as close as possible. More and more requirements are forcing us to deal with high-throughput/low-latency scenarios. Thanks to in-memory solutions, things definitely seems possible right now.

One of the lesson I have learnt in the last few years is that it is hard to resist developing your own technology infrastructure while developing a platform infrastructure. It is always important to remind ourselves that we are here to build solutions and not technology infrastructure.

Some of the key questions that needs to be considered while embarking on such journey is that

  1. How do we handle the ever growing volume of data (Data Repository)?
  2. How do we deal with the growing variety of data (Polyglot Persistence)?
  3. How do we ingest large volumes of data as we start growing (Ingestion Pipelines/Write Efficient)?
  4. How do we scale in-terms of faster data retrieval so that the Analytics engine can provide something meaningful at a decent pace?
  5. How do we deal with the need for Interactive Analytics with a large dataset?
  6. How do we keep our cost per terabyte low while taking care of our platform growth?
  7. How do we move data securely between on premise infrastructure to cloud infrastructure?
  8. How do we handle data governance, data lineage, data quality?
  9. What kind of monitoring infrastructure that would be required to support distributed processing?
  10. How do we model metadata so that we can address domain specific problems?
  11. How do we test this infrastructure? What kind of automation is required?
  12. How do we create a service delivery platform for build and deployment?

One of the challenges I am seeing right now is that the urge to use multiple technologies to solve similar problems.  Though this gives my developers the edge to do things differently/efficiently, from a platform perspective this would increase the total cost of operations.

  1. How do we support our customers in production?
  2. How can we make the life our operations teams better?
  3. How do we take care of reliability, durability, scalability, extensibility and Maintainability of this platform?

Will talk about the data repository and possible choices in the next post.

Happy Learning!

“Data is long-term, Applications are temporary.”

Think data first. Data is long-term, applications are temporary. I recently happened to read this in one of the blog post. I couldn’t agree more. Data remains one of the most strategic projects for most of the companies.

Every fifth person you talk to, every other start up you come across and job postings has something or other to mention about data, analytics etc. But, when I speak to the guys whoever I come across in my ecosystem, lot of guys think it is only doing cool stuff in R.

Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it.

If someone is an application developer for the last 10 years, can he/she suddenly become an expert in statistics and become an expert in Algorithms? Suddenly you start calling yourself a Data Scientist? May be… Nothing is impossible. But if that’s what is your passion you wouldn’t be an application developer for the last 10 years. Right?

Is there anything else one can learn and contribute in the data world? Thought of sharing couple of valuable links which can give you a very good idea on the various aspects and where one can fit in.

#1 Will Balkanization of Data Science led to one Empire or many Republics? Via http://www.kdnuggets.com/2015/11/balkanization-data-science.html
#2 Becoming a Data Scientist via http://nirvacana.com/thoughts/becoming-a-data-scientist/
#3 Difference between Data Engineering and Data Science via http://www.galvanize.com/blog/difference-between-data-engineering-and-data-science/
#4 The world of data science: Who does what in the data world? Via http://cloudtweaks.com/2015/11/booming-world-data-science/matrix-1013612_640

Data is one of the hottest stack right now and it is growing at a crazy speed. It would be extremely difficult for any single individual to cope up with this change unless one’s basics are right.

Once you have the basics right, it is about Meta learning and evolving from there.

Working with various large scale data related projects for the last 15 months, following is my high level list of items one need to know to have a reasonable understanding of data (Big/Small). This list is no specific order. 😦

General A Basic overview of what is Descriptive, Diagnostic, Prescriptive, Predictive and Cognitive Analytics? Understanding of the concepts and difference
Data Warehouses
  • OLAP VS OLTP
  • Dimensional Modelling (Star Schemas, Snowflake Schemas)
  • Difference between Multi-Dimensional, Relational, Hybrid
  • In-Memory OLAP
No SQL Databases
  • CAP Theorem
  • If you are from application development, this is where the most important change would be. So far, you would have dealt primarily with Key-Value stores and Document Stores. For Analytics purpose (Write Efficient), it is important to start understanding column databases (E.g.: Cassandra) and Graph (E.g.:Neo4J). This is again a big shift from what you would have done as an application developer. Spend some time on it.
  • In-Memory databases in general.
  • Apart from Cassandra and Neo4J, get an understanding of what MemSQL offers. Yes, it is MemSQL and not MySQL J seems very impressive.
Outside EDWs
  • MPPs/PDWs – Difference between traditional EDWs and MPPs?
  • DWH on cloud AWS Redshift, Azure SQL Data Warehouse
Data Mining
  • What does it mean?
  • Data Mining Algorithms
Hadoop
  • Hadoop and Various Hadoop Components
  • When to use Hadoop?
  • Parallelization and Map Reduce Fundamentals
Outside Hadoop
  • Difference between Hadoop, Spark and Storm (I personally prefer SPARK. RDDs give me the same comfort what I had with ADO.NET)
  • When to use Hadoop/Spark/Storm over MPP?
ETL
  • Data Munging/Wrangling
  • Scrubbing
  • Transforming
  • Reading and Loading Data
  • Exception Handling
  • Jobs/Tasks
Real time Analytics Working with Stream: Real time Analytics is something everyone talks about. But without understanding what it means by Stream processing you will never be able to figure out this.
From an application background

  • Reactive Architecture (Responsive, Resilient, Elastic and Message driven)
  • Understand the difference between an Event and a Transaction.
  • Event Processing(CQRS, Actor Model[Akka], Complex Event Processing)

If you don’t understand the above, then it would be difficult to move forward. Spend time on these before moving forward to other items
Messaging/Data bus

  • Kafka

Processing Streams

  • Spark/Storm

Lambda Architecture

Machine Learning Machine Learning

  • Difference between Data Mining and Machine Learning
  • ML Algorithms

Couple of very good posts to read in this
Machine Learning for Programmers: Leap from developer to machine learning practitioner via http://machinelearningmastery.com/machine-learning-for-programmers/
What Every Manager Should Know About Machine Learning via https://hbr.org/2015/07/what-every-manager-should-know-about-machine-learning
Most of what we are doing can be achieved at some level using Excel Analytics Data Pack. In fact, I would say Excel is the most powerful tool out there.

Recommendation Engines
  • Collaborative Filtering
  • Content-based Filtering
  • Hybrid

Once you are clear with the concepts start implementing using Apache Mahout

Communication Protocols
  • JSON, AVRO, Protocol Buffer, and Thrift: If you are from application development – you would have used JSON extensively. It is time to understand the other ones as well. I keep arguing this with my friend Sendhil (IMO, AVRO seems to be the way to go – where things are evolving and need for self-documentation – Cowboys Friendly).
Time Series
  • Modelling
  • Databases (OpenTSDB)
  • Forecasting
  • Trend Analysis
Modern day HOLAP Engines
  • Apache Kylin (My favourite at this point)
Data Visualization Self-Service is the Mantra here. Read this article: Data Scientists Should be Good Storytellers

Most of the people in an organization cannot understand the outcome of analytics, however they do need the proof of analysis and data. Data storytellers incorporate data and analytics in a compelling way as their stories involve real people and organizations” via https://dzone.com/articles/data-scientists-should-be-good-storytellers

  • How to represent data (Graphs/Charts)?
  • Excel Power Pivot/ Power BI (Polybase)
  • Lumira
  • D3.js
Deep Learning Though it may or may not be important at this point, try to understand what is deep learning. Read this : Deep Learning in a Nutshell: Core Concepts via http://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-core-concepts/
Data Lake One of my favorite topic and something I learnt after burning my hands is with data lake

  • Understand what Data Lakes mean? Why do you need one? How to build a data lake on your own?
  • Extract Load and Transform (ELT)
  • ELT vs ETL

Read this: https://azure.microsoft.com/en-in/solutions/data-lake/

Language Though there is a bunch of things to do with Python, R, Java etc. My choice is Scala (I love the way the language allows you to express. Wish someone can afford me as a developer again J)

If you have a good grasp on above, then it is time for you to figure our when to use what (Creating Solutions).

 “If all you have is a hammer, everything looks like a nail”

Read this:  The Ethics of Wielding an Analytical Hammer via http://sloanreview.mit.edu/article/the-ethics-of-wielding-an-analytical-hammer/

Data is having an impact on business models and profitability. It’s hard to find a non-trivial application that doesn’t use data in a significant manner ~ Ben Lorica, O’Reilly Media

Ok, this looks like a large list. Where do I start?

  1. Focus on the basics. Get a good overview of the ecosystem
  2. Decide your area of specialization.
  3. Focus on your specialization and build skills.
  4. Iterate and change course as required.
  • If you are more than 10 years of experience, understand the business situation and figure out when to use what. May be pick 1 or 2 items and start implementing in your environment.
  • If you are less than 10 years of experience, pick up a scenario and try to implement this and see if it makes any business sense.

What I have not covered in the list? I haven’t gone into the details of

  1. Hadoop Ecosystem and components (Pig/Hive etc.)
  2. Algorithms
    1. Nearest Neighbour
    2. K-Means Clustering
    3. Linear Regression
    4. Decision Trees etc.
  3. R in detail
  4. Infrastructure
    1. Env Setup
    2. Zookeeper, Yarn, Mesos
    3. Replication
  5. Vertical Industry Solutions
  6. Operational Systems (like Splunk)
  7. Data Governance

I keep hearing/seeing people who have never seen more than 1 GB of data saying that they do Big Data Analytics. Don’t learn or do something for the sake of doing it.

There is no short cut to a place worth going.

My favorite books on this topic.

If you want to know more about what I am learning, you can follow me in Twitter

Happy Learning!

Am developing a new product. What should i develop first?

We know what we want to build. But we are not sure, what to start with?

This is one of the questions i end up discussing very frequently during my consulting engagements.

Am sure we all agree that there is no straight forward answer to this.

Let us take my favorite example – an eCommerce store. Assume that you are developing an eCommerce application which is going to sell furniture’s.

What are the features one would require?

To sell the furniture, i need a way to
– manage the categories of furniture i am going to sell
– manage the furniture’s in the category
– manage the price for the furniture’s
– manage the orders and fulfill them
– Have a website where users can browse the furniture’s, search for a furniture, view the furniture detail, enter the quantity and place order.
– ….
– ….

Get Started

When i was a developer, if you would have asked me this question on what to start with, i would have said you need a good database design, screens to manage the master data before we can do the eCommerce application.

After burning my hands multiple times with new product development, my current answer would be to build the most important feature from in this list without which the eCommerce store is useless.

Take the Groupon example which Eric Ries talked in his book Lean Start up. Groupon first built their site using a word press blog, maintained those pages on a daily basis, went to market, took the feedback before building the actual website.

The Key here is to look at the pieces in the equation that cannot be removed.

I can maintain my master data even using scripts. But without the eCommerce website (it could even be a blog) you may not be able to sell anything. Start building things which are the core of your system.

Do not waste your time in building things which are not highly important. All those subsidiary features can be built later.

IMHO, The most important role in product development is Product Owner. If you have one, who can do the right prioritization i am sure you will have a successful product.

A Related example:

Start at the epicenter

Happy Learning!!!!

Image courtesy of idea go/ FreeDigitalPhotos.net

Lean Startup : Useful Pointers to get started

I started reading the book “The Lean Startup: How Today’s Entrepreneurs Use Continuous Innovation to Create Radically Successful Businesses” about an year back. For sure, it is an impressive thought process. From that time on wards, I have been reading, discussing, practicing ( some form), Selling Lean Start up ideas 🙂

The Lean Startup is a business approach coined by Eric Ries that aims to change the way that companies are built and new products are launched. he Lean Startup relies on validated learning, scientific experimentation, and iterative product releases to shorten product development cycles, measure progress, and gain valuable customer feedback. In this way, companies, especially startups, can design their products or services to meet the demands of their customer base without requiring large amounts of initial funding or expensive product launches.

Via : http://en.wikipedia.org/wiki/Lean_Startup

I found an interesting place where you can test your knowledge on lean start up.
http://www.veri.com/t/lean-startup/18
I scored 195 points 🙂

If you are new to the concept of lean start up, here is a list of useful pointers which can help you get started

Lean Startup – Book Summary

Principles of Lean Startups

Lean Startup: How to Learn fast about Customers, their Problems and Solutions

How to Lean Startup? A Flowchart

Using Lean Startup Principles

Lean Startup Cycle

Customer Development Engineering

Combining agile development with customer development

How Development looks different in a lean startup

Lean Startup (PHP World)

Implementing Minimum Viable Changes as Part of a Lean Startup for change approach

Iterative funding of start ups- an entrepreneur’s perspective (Check the Image)

Contrasts between Agile and Lean Startup

Continuous Value Delivery

Happy Learning!!!

Considerations on re-engineering legacy applications to newer technology platforms

Software applications has lifetime. Every Software product passes through the lifecycle and at the end of the cycle there is a need to rewrite/migration. Typically this is done to
1. Reduce Maintenance Cost (Technology keeps changing every year and it’s very hard to find resources who are working on a very old technology. Even if you find one, the cost might be very high).
a. Support for the existing technology is coming to an end
b. No developers available in the existing technology
c. Noone understands the existing code. Changes to the existing code base are becoming a real problem.
2. Target a larger customer base, new geography with additional functionalities.
3. Flexible Product Offerings (Support Multiple Verticals, Customizations at Vertical/Customer level).
4. Bring freshness to the product.
5. Address the pain points of the existing application.
6. Introduce Integration capabilities (to and from).

ReEngineering

If your existing product is a desktop product, moving to web is the last option one should consider. Yeah… you can say maintaining multiple versions are always a problem in the desktop world, but IMHO a desktop user will never accept a browser based solution (irrespective of the # of features, flexibility, blah, blah, blahs…). It’s a strict NO.

How do we start?
Step 0: Define your goals for the newer technology migration
Step 0: Decide your budget
Step 0: Do your math (ROI, IRR and NPV calculations)
You should do your math on when will you get your returns and what will be your returns after the migration. In my opinion, most of the companies fail to do this math at the initial stages and then the realization happens after making quite a bit of investment (though people pretend that this math has been done). This math has to be realistic and conservative.
Step 0: Decide the plans to migrate your existing customers and bring in new customers
Step 1: Choose a Technology Platform for the Migration/Reengineering effort.
Step 2: Decide Architecture (whether to move to a newer architecture/or live with the existing one)
Step 3: Go ahead with the Migration
1. Rewrite to a newer technology platform
a. Migrate using a Tool
b. Hand code
c. Use MDD to generate code instead of hand coding
d. Develop a framework and build application on top
2. Partial Rewrite (and eventually rewrite the whole application)
3. Live with your legacy application

Typical challenges involved in any platform reengineering efforts:
1. If your existing clients are used to doing work certain way, it’s next to impossible to change them. When designing the new application, it’s very important to keep the position of the old UI Elements and shortcuts intact.
2. Assume that the new reengineering effort is scheduled for next 1 year. In the meantime, your existing code base will go through changes (.Releases, patches) as no one will be in a position to stop the business for the new product. Now by the time the new product (technology) is ready you have some more effort is pending as there are new releases after you have taken the effort.
It’s very easy to say, we will communicate/coordinate this between the teams in such a way that the changes are always communicated and the teams will take care (blah, blah, blah), but in reality it will never happen. Either the newer additions will be never added or you will never complete you’re the newer version (Think about this in case of offshoring).
If we do not have the updated features (snapshot of last 1 year or so), then you will release a product which will have lesser features than your current product and will be considered as a subset. Very difficult to sell this in that case.
3. Performance of the newer version should at least match your existing products performance. Working with smaller customers (using ISAM databases) will provide you real challenges. Performance of a .NET application versus a native win32 application will also be a challenge.
4. People tend to become more aggressive with the newer product features (in terms of flexibility, re-usability etc…). Performance cannot be compromised for the flexibility and extensibility as your users are never bothered about what goes behind the scenes.
5. It’s not always just performance, but also with other aspects of the product (features, usability) etc. if the product is not par, people are not going to buy it for sure.
6. Technology changes very fast. If you choose something in Alpha or Beta Stage for your product development (Bleeding edge technologies), obviously you are going to bleed. If you choose a technology which is in the market, a newer version will be there in the market by the time you release a product. It’s a very tricky situation. Read this interesting post from Sendhil.

Conclusion
In my experience, it’s a very tricky situation and unless it’s planned carefully you will have all reasons to fail. I personally feel, going the route of adding features inside the existing product as a better option than an upfront rewrite (as you will have better control). Of course it depends on your existing technology and it has to support this model. Products are evolved over a period of time. Rewriting it in a shorter period time may not possibly work as it’s very difficult to track all the features, bugs etc. etc with your existing application.

Happy Migrations!!!