Data Infrastructure, Data Pipeline and Analytics – Reading List – Sep 27, 2016

Splunk vs ELK: The Log Management Tools Decision Making Guide
Much like promises made by politicians during an election campaign, production environments produce massive files filled with endless lines of text in the form of log files. Unlike election periods, they’re doing it all year around, with multiple GBs of unstructured plain text data generated each day.
http://blog.takipi.com/splunk-vs-elk-the-log-management-tools-decision-making-guide/

Building a Modern Bank Backend
https://monzo.com/blog/2016/09/19/building-a-modern-bank-backend/

An awesome list of Micro Services Architecture related principles and technologies.
https://github.com/mfornos/awesome-microservices#api-gateways–edge-services

Stream-based Architecture
Part of the Stream Architecture Book. An excellent overview on the topic.
https://www.mapr.com/ebooks/streaming-architecture/chapter-02-stream-based-architecture.html

The Hardest Part About Micro services: Your Data
Of the reasons we attempt a micro services architecture, chief among them is allowing your teams to be able to work on different parts of the system at different speeds with minimal impact across teams. So we want teams to be autonomous, capable of making decisions about how to best implement and operate their services, and free to make changes as quickly as the business may desire. If we have our teams organized to do this, then the reflection in our systems architecture will begin to evolve into something that looks like micro services.
http://blog.christianposta.com/microservices/the-hardest-part-about-microservices-data/

New Ways to Discover and Use Alexa Skills
Alexa, Amazon’s cloud-based voice service, powers voice experiences on millions of devices, including Amazon Echo and Echo Dot, Amazon Tap, Amazon Fire TV devices, and devices like Triby that use the Alexa Voice Service. One year ago, Amazon opened up Alexa to developers, enabling you to build Alexa skills with the Alexa Skills Kit and integrate Alexa into your own products with the Alexa Voice Service.
http://www.allthingsdistributed.com/2016/06/new-ways-to-discover-and-use-alexa-skills.html

Happy Learning!

Data Infrastructure, Data Pipeline and Analytics – Reading List – Sep 20, 2016

Hadoop architectural overview
An Excellent series of posts – talking about Hadoop and Related components, Key metrics to monitor in Production
https://www.datadoghq.com/blog/hadoop-architecture-overview/
Surviving and Thriving in a Hybrid Data Management World
The vast majority of our customers who are moving to cloud applications also have a significant current investment in on premise operational applications and on premise capabilities around data warehousing, business intelligence and analytics. That means that most of them will be working with a hybrid cloud/on premise data management environment for the foreseeable future.
http://blogs.informatica.com/2016/08/19/surviving-thriving-hybrid-data-management-world/#fbid=dlbfZB7A1Sd
Data Compression in Hadoop
File compression brings two major benefits: it reduces the space needed to store files, and it speeds up data transfer across the network or to or from disk. When dealing with large volumes of data, both of these savings can be significant, so it pays to carefully consider how to use compression in Hadoop.
http://comphadoop.weebly.com/
 What is “Just-Enough” Governance for the Data Lake?
Just-enough governance is similar to the Lean Startup methodology concept of building of a Minimum Viable Product (MVP). From an enterprise perspective, just-enough governance means building only the process and control necessary to solve a particular business problem.
https://infocus.emc.com/rachel_haines/just-enough-governance-data-lake/
Mind map on SAP HANA
https://www.mindmeister.com/353051849/sap-hana-platform
Should I use SQL or NoSQL?
Every application needs persistent storage — data that persists across program restarts. This includes usernames, passwords, account balances, and high scores. Deciding how to store your application’s important data is one of first and most important architectural decisions to be made.
https://www.databaselabs.io/blog/Should-I-use-SQL-or-NoSQL
Happy Learning!

Developing a Robust Data Platform : Key Considerations

key-considerations

Developing a robust data platform requires definitely more than HDFS, Hive, Sqoop and Pig. Today there is a real need for bringing data and compute as close as possible. More and more requirements are forcing us to deal with high-throughput/low-latency scenarios. Thanks to in-memory solutions, things definitely seems possible right now.

One of the lesson I have learnt in the last few years is that it is hard to resist developing your own technology infrastructure while developing a platform infrastructure. It is always important to remind ourselves that we are here to build solutions and not technology infrastructure.

Some of the key questions that needs to be considered while embarking on such journey is that

  1. How do we handle the ever growing volume of data (Data Repository)?
  2. How do we deal with the growing variety of data (Polyglot Persistence)?
  3. How do we ingest large volumes of data as we start growing (Ingestion Pipelines/Write Efficient)?
  4. How do we scale in-terms of faster data retrieval so that the Analytics engine can provide something meaningful at a decent pace?
  5. How do we deal with the need for Interactive Analytics with a large dataset?
  6. How do we keep our cost per terabyte low while taking care of our platform growth?
  7. How do we move data securely between on premise infrastructure to cloud infrastructure?
  8. How do we handle data governance, data lineage, data quality?
  9. What kind of monitoring infrastructure that would be required to support distributed processing?
  10. How do we model metadata so that we can address domain specific problems?
  11. How do we test this infrastructure? What kind of automation is required?
  12. How do we create a service delivery platform for build and deployment?

One of the challenges I am seeing right now is that the urge to use multiple technologies to solve similar problems.  Though this gives my developers the edge to do things differently/efficiently, from a platform perspective this would increase the total cost of operations.

  1. How do we support our customers in production?
  2. How can we make the life our operations teams better?
  3. How do we take care of reliability, durability, scalability, extensibility and Maintainability of this platform?

Will talk about the data repository and possible choices in the next post.

Happy Learning!

Data Infrastructure, Data Pipeline and Analytics – Reading List – Sep 12, 2016

Three incremental, manageable steps to building a “data first” data lake
Applications have always dictated the data. That has made sense historically, and to some extent, continues to be the case. But an “applications first” approach creates data silos that are causing operational problems and preventing organizations from getting the full value from their business intelligence initiatives.
http://www.networkworld.com/article/3098937/analytics/three-incremental-manageable-steps-to-building-a-data-first-data-lake.html

Azure SQL Data Warehouse: Introduction
Azure SQL Data Warehouse is a fully-managed and scalable cloud service.
https://www.simple-talk.com/cloud/azure-sql-data-warehouse/

The Informed Data Lake: Beyond Metadata
Historically, the volume and extent of data that an enterprise could store, assemble, analyze and act upon exceeded the capacity of their computing resources and was too expensive. The solution was to model some extract of a portion of the available data into a data model or schema, presupposing what was “important,” and then fit the incoming data into that structure.
https://hiredbrains.wordpress.com/2016/05/13/the-informed-data-lake-beyond-metadata/

Real Time Streaming with Spring xd, Apache Geode (Gemfire), and Greenplum
Spring xd is a unified, distributed, and extensible service for data ingestion, real time analytics, batch processing, and data export.
http://zdatainc.com/2016/01/real-time-streaming-with-spring-xd-apache-geode-gemfire-and-greenplum-earthquake-data-demo/

Data Orchestration using Hortonworks DataFlow (HDF)
Hortonworks Dataflow (HDF), powered by Apache NiFi, is the first integrated platform that solves the real time complexity and challenges of collecting and transporting data from a multitude of sources be they big or small, fast or slow, always connected or intermittently available
http://zdatainc.com/2016/02/hello-nifi-data-orchestration-using-hortonworks-dataflow-hdf/

Happy Learning!

Data Infrastructure, Data Pipeline and Analytics – Useful Links – September, 2016

Some of the Interesting links i have read in the last couple of weeks around Data, In-Memory Databases, Pipeline Development and Analytics.
In Search of Database Nirvana
An excellent post providing an in-depth look at the possibilities and the challenges for companies that long for a single query engine to rule them all.
https://www.oreilly.com/ideas/in-search-of-database-nirvana
http://www.slideshare.net/RohitJain0813/in-search-of-database-nirvana-the-challenges-of-delivering-hybrid-transactionanalytical-processing
Aerospike Vs Cassandra Comparison
Comparison on Aerospike with Apache Cassandra. Cassandra is a columnar NoSQL database that is great for ingesting and analyzing hundreds of terabytes of data stored on rotational disks. Aerospike is an in-memory, NoSQL database, a key-value store that can run purely in RAM and is also optimized for storing data in Flash (SSDs).
http://www.aerospike.com/when-to-use-aerospike-vs-cassandra/
An overview of Apache Streaming Technologies
A very good comparison comparing technologies around simple event processors, stream processors, and complex event processors.
https://databaseline.wordpress.com/2016/03/12/an-overview-of-apache-streaming-technologies/
Flow based Programming
Flow-Based Programming defines applications using the metaphor of a “data factory”. It views an application not as a single, sequential process, which starts at a point in time, and then does one thing at a time until it is finished, but as a network of asynchronous processes communicating by means of streams of structured data chunks, called “information packets” (IPs).
http://www.jpaulmorrison.com/fbp/introduction.html
Hadoop Deployment Cheat Sheet
If you are using, or planning to use the Hadoop framework for big data and Business Intelligence (BI) this document can help you navigate some of the technology and terminology, and guide you in setting up and configuring the system.
http://jethro.io/hadoop-deployment-cheat-sheet/
Amazon Redshift for Custom Analytics
Experience summary on building Custom Analytics on top of Redshift
https://www.alooma.com/blog/custom-analytics-amazon-redshift
Building Analytics at 500px
Experience summary on how they have built the ecosystem
https://medium.com/@samson_hu/building-analytics-at-500px-92e9a7005c83#.pdkk7xrui
How Artificial Intelligence Will Kickstart the Internet of Things?
IoT will produce a tsunami of big data, with the rapid expansion of devices and sensors connected to the Internet of Things continues, the sheer volume of data being created by them will increase to an astronomical level. This data will hold extremely valuable insights into what’s working well or what’s not.
https://datafloq.com/read/Artificial-Intelligence-Kickstart-Internet-Things/1776

Happy Learning!

Scaling data operations with in-memory OLTP

Data has become the center of our universe in modern digital world. Applications are designed to store and collect more and more data. Companies are looking to integrate and analyse the data to generate insights and take actions.

Data is a precious thing and will last longer than the systems themselves ~ Tim Berners-Lee

Can an existing relational database scale with high ingestion rates, improved read performance?Database

In-Memory OLTP seems to be the direction forward. This is considering your existing technology investments. Of course if the company is open to change technology there would be more options.

Found couple of very good articles posts related to SQL Server in-memory OLTP. Looks like SQL Server 2016 has fixes to most of the issues with in-memory OLTP.

I just think it is an amazing technology and if we can use it in the right way, will definitely yield great results for your customers.

Introducing SQL Server In-Memory OLTP
https://msdn.microsoft.com/en-in/library/dn133186.aspx
https://www.simple-talk.com/sql/learn-sql-server/introducing-sql-server-in-memory-oltp/
http://blog.sqlauthority.com/2014/08/08/sql-server-introduction-to-sql-server-2014-in-memory-oltp/

The Use Cases for SQL Server 2014 In-Memory OLTP
http://sqlturbo.com/the-use-cases-for-sql-server-2014-in-memory-oltp/

SQL Server In-Memory OLTP Internals Overview
https://msdn.microsoft.com/en-us/library/dn720242.aspx

The Promise – and the Pitfalls – of In-Memory OLTP
https://www.simple-talk.com/sql/performance/the-promise—and-the-pitfalls—of-in-memory-oltp/
https://msdn.microsoft.com/en-us/library/dn246937.aspx

SQL Server 2016 : In-Memory OLTP Enhancements
http://sqlperformance.com/2015/11/sql-server-2016/in-memory-oltp-enhancements

Speeding up Business Analytics Using In-Memory Technology
https://blogs.technet.microsoft.com/dataplatforminsider/2015/12/08/speeding-up-business-analytics-using-in-memory-technology/

Dynamic Data Masking in SQL Server 2016
http://www.codeproject.com/Articles/1084808/Dynamic-Data-Masking-in-SQL-Server
https://blogs.technet.microsoft.com/dataplatforminsider/2016/01/25/use-dynamic-data-masking-to-obfuscate-your-sensitive-data/

Happy Learning!

“Data is long-term, Applications are temporary.”

Think data first. Data is long-term, applications are temporary. I recently happened to read this in one of the blog post. I couldn’t agree more. Data remains one of the most strategic projects for most of the companies.

Every fifth person you talk to, every other start up you come across and job postings has something or other to mention about data, analytics etc. But, when I speak to the guys whoever I come across in my ecosystem, lot of guys think it is only doing cool stuff in R.

Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it.

If someone is an application developer for the last 10 years, can he/she suddenly become an expert in statistics and become an expert in Algorithms? Suddenly you start calling yourself a Data Scientist? May be… Nothing is impossible. But if that’s what is your passion you wouldn’t be an application developer for the last 10 years. Right?

Is there anything else one can learn and contribute in the data world? Thought of sharing couple of valuable links which can give you a very good idea on the various aspects and where one can fit in.

#1 Will Balkanization of Data Science led to one Empire or many Republics? Via http://www.kdnuggets.com/2015/11/balkanization-data-science.html
#2 Becoming a Data Scientist via http://nirvacana.com/thoughts/becoming-a-data-scientist/
#3 Difference between Data Engineering and Data Science via http://www.galvanize.com/blog/difference-between-data-engineering-and-data-science/
#4 The world of data science: Who does what in the data world? Via http://cloudtweaks.com/2015/11/booming-world-data-science/matrix-1013612_640

Data is one of the hottest stack right now and it is growing at a crazy speed. It would be extremely difficult for any single individual to cope up with this change unless one’s basics are right.

Once you have the basics right, it is about Meta learning and evolving from there.

Working with various large scale data related projects for the last 15 months, following is my high level list of items one need to know to have a reasonable understanding of data (Big/Small). This list is no specific order. 😦

General A Basic overview of what is Descriptive, Diagnostic, Prescriptive, Predictive and Cognitive Analytics? Understanding of the concepts and difference
Data Warehouses
  • OLAP VS OLTP
  • Dimensional Modelling (Star Schemas, Snowflake Schemas)
  • Difference between Multi-Dimensional, Relational, Hybrid
  • In-Memory OLAP
No SQL Databases
  • CAP Theorem
  • If you are from application development, this is where the most important change would be. So far, you would have dealt primarily with Key-Value stores and Document Stores. For Analytics purpose (Write Efficient), it is important to start understanding column databases (E.g.: Cassandra) and Graph (E.g.:Neo4J). This is again a big shift from what you would have done as an application developer. Spend some time on it.
  • In-Memory databases in general.
  • Apart from Cassandra and Neo4J, get an understanding of what MemSQL offers. Yes, it is MemSQL and not MySQL J seems very impressive.
Outside EDWs
  • MPPs/PDWs – Difference between traditional EDWs and MPPs?
  • DWH on cloud AWS Redshift, Azure SQL Data Warehouse
Data Mining
  • What does it mean?
  • Data Mining Algorithms
Hadoop
  • Hadoop and Various Hadoop Components
  • When to use Hadoop?
  • Parallelization and Map Reduce Fundamentals
Outside Hadoop
  • Difference between Hadoop, Spark and Storm (I personally prefer SPARK. RDDs give me the same comfort what I had with ADO.NET)
  • When to use Hadoop/Spark/Storm over MPP?
ETL
  • Data Munging/Wrangling
  • Scrubbing
  • Transforming
  • Reading and Loading Data
  • Exception Handling
  • Jobs/Tasks
Real time Analytics Working with Stream: Real time Analytics is something everyone talks about. But without understanding what it means by Stream processing you will never be able to figure out this.
From an application background

  • Reactive Architecture (Responsive, Resilient, Elastic and Message driven)
  • Understand the difference between an Event and a Transaction.
  • Event Processing(CQRS, Actor Model[Akka], Complex Event Processing)

If you don’t understand the above, then it would be difficult to move forward. Spend time on these before moving forward to other items
Messaging/Data bus

  • Kafka

Processing Streams

  • Spark/Storm

Lambda Architecture

Machine Learning Machine Learning

  • Difference between Data Mining and Machine Learning
  • ML Algorithms

Couple of very good posts to read in this
Machine Learning for Programmers: Leap from developer to machine learning practitioner via http://machinelearningmastery.com/machine-learning-for-programmers/
What Every Manager Should Know About Machine Learning via https://hbr.org/2015/07/what-every-manager-should-know-about-machine-learning
Most of what we are doing can be achieved at some level using Excel Analytics Data Pack. In fact, I would say Excel is the most powerful tool out there.

Recommendation Engines
  • Collaborative Filtering
  • Content-based Filtering
  • Hybrid

Once you are clear with the concepts start implementing using Apache Mahout

Communication Protocols
  • JSON, AVRO, Protocol Buffer, and Thrift: If you are from application development – you would have used JSON extensively. It is time to understand the other ones as well. I keep arguing this with my friend Sendhil (IMO, AVRO seems to be the way to go – where things are evolving and need for self-documentation – Cowboys Friendly).
Time Series
  • Modelling
  • Databases (OpenTSDB)
  • Forecasting
  • Trend Analysis
Modern day HOLAP Engines
  • Apache Kylin (My favourite at this point)
Data Visualization Self-Service is the Mantra here. Read this article: Data Scientists Should be Good Storytellers

Most of the people in an organization cannot understand the outcome of analytics, however they do need the proof of analysis and data. Data storytellers incorporate data and analytics in a compelling way as their stories involve real people and organizations” via https://dzone.com/articles/data-scientists-should-be-good-storytellers

  • How to represent data (Graphs/Charts)?
  • Excel Power Pivot/ Power BI (Polybase)
  • Lumira
  • D3.js
Deep Learning Though it may or may not be important at this point, try to understand what is deep learning. Read this : Deep Learning in a Nutshell: Core Concepts via http://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-core-concepts/
Data Lake One of my favorite topic and something I learnt after burning my hands is with data lake

  • Understand what Data Lakes mean? Why do you need one? How to build a data lake on your own?
  • Extract Load and Transform (ELT)
  • ELT vs ETL

Read this: https://azure.microsoft.com/en-in/solutions/data-lake/

Language Though there is a bunch of things to do with Python, R, Java etc. My choice is Scala (I love the way the language allows you to express. Wish someone can afford me as a developer again J)

If you have a good grasp on above, then it is time for you to figure our when to use what (Creating Solutions).

 “If all you have is a hammer, everything looks like a nail”

Read this:  The Ethics of Wielding an Analytical Hammer via http://sloanreview.mit.edu/article/the-ethics-of-wielding-an-analytical-hammer/

Data is having an impact on business models and profitability. It’s hard to find a non-trivial application that doesn’t use data in a significant manner ~ Ben Lorica, O’Reilly Media

Ok, this looks like a large list. Where do I start?

  1. Focus on the basics. Get a good overview of the ecosystem
  2. Decide your area of specialization.
  3. Focus on your specialization and build skills.
  4. Iterate and change course as required.
  • If you are more than 10 years of experience, understand the business situation and figure out when to use what. May be pick 1 or 2 items and start implementing in your environment.
  • If you are less than 10 years of experience, pick up a scenario and try to implement this and see if it makes any business sense.

What I have not covered in the list? I haven’t gone into the details of

  1. Hadoop Ecosystem and components (Pig/Hive etc.)
  2. Algorithms
    1. Nearest Neighbour
    2. K-Means Clustering
    3. Linear Regression
    4. Decision Trees etc.
  3. R in detail
  4. Infrastructure
    1. Env Setup
    2. Zookeeper, Yarn, Mesos
    3. Replication
  5. Vertical Industry Solutions
  6. Operational Systems (like Splunk)
  7. Data Governance

I keep hearing/seeing people who have never seen more than 1 GB of data saying that they do Big Data Analytics. Don’t learn or do something for the sake of doing it.

There is no short cut to a place worth going.

My favorite books on this topic.

If you want to know more about what I am learning, you can follow me in Twitter

Happy Learning!