Data Platform : Exploratory Data Analysis

Today, everyone talks about storing data in the raw format so that it can be analyzed and generate insights at a later point of time. Fantastic Idea.  Data Lakes just delivers that promise. However, the complexity of data is increasing day by day. And there are these new data sources that are getting added on a regular basis.

lakeswamp
Not every day you end up dealing with data sets which you are familiar with. Considering the kind of new type of data that gets added, most likely that one would end up dealing with data sets out of their comfort zone.
Data science teams spend most of their time with exploring and understanding data.

  • If you must deliver some quick insights on a set of data will you go through them manually to figure out or can we do something within the data lake that can be used?
  • What would be an easy way for the Data Science team to understand the data set quickly, understand patterns, relationships so that we could generate some hypothesis?

Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to

  1. Maximize insight into a data set;
  2. Uncover underlying structure;
  3. Extract important variables;
  4. Detect outliers and anomalies;
  5. Test underlying assumptions;
  6. Develop parsimonious models; and
  7. Determine optimal factor settings.

Via : http://www.itl.nist.gov/div898/handbook/eda/section1/eda11.htm
Exploratory Data Analysis : http://www.statgraphics.com/exploratory-data-analysis

If you google about exploratory data analysis, you will get tons and tons of material about doing EDA using R or Python.

Considering the shortened timelines, if you expect your data science teams to develop code to understand, you may not be able to deliver value at the speed in which your business is expecting results.

There are couple of tools which may help you understand your data faster. Google Data Profiling and you will get tons and tons of results on this topic. My favorite tools right now in this topic are

  1. Trifacta Wrangler
  2. Exploratory.io

Both are easy to use with a simple user interface. You can use the free version to get started. If you have an automated data pipeline using SPARK, you can also generate the profile statistics about the incoming data and store it as part of your Catalog.

I really like this presentation on this topic.. Data Profiling and Pipeline Processing with Spark.

Once you do this with Spark, you may want to update the data profile information and store it as part of your catalog. If you index your catalog with Elastic Search, you may be able to provide an API for your Data Science teams to search for the files with certain quality attributes etc.

The above tools will help you get a quick understanding of your data. But, what If you want pointers for analysis to get started about your data? Only a profiler will not help in this case. You may want to explore this product from IBM (yeah… you heard it right… it’s from IBM and I am using it daily). Check it out here… IBM Watson Analytics

Watson Analytics – is a SMART discovery service and it is super smart. It is available for $80 User/Month. For the value, you get out of it, $80 per month is really nothing.
You can use it for data exploration and predictive analytics and it is effortless. A free one month subscription is available for you to play with.predictive

I have looked around various products and I couldn’t find anything which is closer to what Watson offers. If i have to mention about a drawback, it doesn’t provide connectivity to S3. You may have to connect to Postgresql or Redshift to extract data.watsonconfig

If you can integrate it in your platform and use it effectively, you will be able to add value to your customers in literally no time.

Happy Learning!

Data Infrastructure, Data Pipeline and Analytics – Reading List – Sep 27, 2016

Splunk vs ELK: The Log Management Tools Decision Making Guide
Much like promises made by politicians during an election campaign, production environments produce massive files filled with endless lines of text in the form of log files. Unlike election periods, they’re doing it all year around, with multiple GBs of unstructured plain text data generated each day.
http://blog.takipi.com/splunk-vs-elk-the-log-management-tools-decision-making-guide/

Building a Modern Bank Backend
https://monzo.com/blog/2016/09/19/building-a-modern-bank-backend/

An awesome list of Micro Services Architecture related principles and technologies.
https://github.com/mfornos/awesome-microservices#api-gateways–edge-services

Stream-based Architecture
Part of the Stream Architecture Book. An excellent overview on the topic.
https://www.mapr.com/ebooks/streaming-architecture/chapter-02-stream-based-architecture.html

The Hardest Part About Micro services: Your Data
Of the reasons we attempt a micro services architecture, chief among them is allowing your teams to be able to work on different parts of the system at different speeds with minimal impact across teams. So we want teams to be autonomous, capable of making decisions about how to best implement and operate their services, and free to make changes as quickly as the business may desire. If we have our teams organized to do this, then the reflection in our systems architecture will begin to evolve into something that looks like micro services.
http://blog.christianposta.com/microservices/the-hardest-part-about-microservices-data/

New Ways to Discover and Use Alexa Skills
Alexa, Amazon’s cloud-based voice service, powers voice experiences on millions of devices, including Amazon Echo and Echo Dot, Amazon Tap, Amazon Fire TV devices, and devices like Triby that use the Alexa Voice Service. One year ago, Amazon opened up Alexa to developers, enabling you to build Alexa skills with the Alexa Skills Kit and integrate Alexa into your own products with the Alexa Voice Service.
http://www.allthingsdistributed.com/2016/06/new-ways-to-discover-and-use-alexa-skills.html

Happy Learning!

Data Infrastructure, Data Pipeline and Analytics – Reading List – Sep 20, 2016

Hadoop architectural overview
An Excellent series of posts – talking about Hadoop and Related components, Key metrics to monitor in Production
https://www.datadoghq.com/blog/hadoop-architecture-overview/
Surviving and Thriving in a Hybrid Data Management World
The vast majority of our customers who are moving to cloud applications also have a significant current investment in on premise operational applications and on premise capabilities around data warehousing, business intelligence and analytics. That means that most of them will be working with a hybrid cloud/on premise data management environment for the foreseeable future.
http://blogs.informatica.com/2016/08/19/surviving-thriving-hybrid-data-management-world/#fbid=dlbfZB7A1Sd
Data Compression in Hadoop
File compression brings two major benefits: it reduces the space needed to store files, and it speeds up data transfer across the network or to or from disk. When dealing with large volumes of data, both of these savings can be significant, so it pays to carefully consider how to use compression in Hadoop.
http://comphadoop.weebly.com/
 What is “Just-Enough” Governance for the Data Lake?
Just-enough governance is similar to the Lean Startup methodology concept of building of a Minimum Viable Product (MVP). From an enterprise perspective, just-enough governance means building only the process and control necessary to solve a particular business problem.
https://infocus.emc.com/rachel_haines/just-enough-governance-data-lake/
Mind map on SAP HANA
https://www.mindmeister.com/353051849/sap-hana-platform
Should I use SQL or NoSQL?
Every application needs persistent storage — data that persists across program restarts. This includes usernames, passwords, account balances, and high scores. Deciding how to store your application’s important data is one of first and most important architectural decisions to be made.
https://www.databaselabs.io/blog/Should-I-use-SQL-or-NoSQL
Happy Learning!

Data Infrastructure, Data Pipeline and Analytics – Reading List – Sep 12, 2016

Three incremental, manageable steps to building a “data first” data lake
Applications have always dictated the data. That has made sense historically, and to some extent, continues to be the case. But an “applications first” approach creates data silos that are causing operational problems and preventing organizations from getting the full value from their business intelligence initiatives.
http://www.networkworld.com/article/3098937/analytics/three-incremental-manageable-steps-to-building-a-data-first-data-lake.html

Azure SQL Data Warehouse: Introduction
Azure SQL Data Warehouse is a fully-managed and scalable cloud service.
https://www.simple-talk.com/cloud/azure-sql-data-warehouse/

The Informed Data Lake: Beyond Metadata
Historically, the volume and extent of data that an enterprise could store, assemble, analyze and act upon exceeded the capacity of their computing resources and was too expensive. The solution was to model some extract of a portion of the available data into a data model or schema, presupposing what was “important,” and then fit the incoming data into that structure.
https://hiredbrains.wordpress.com/2016/05/13/the-informed-data-lake-beyond-metadata/

Real Time Streaming with Spring xd, Apache Geode (Gemfire), and Greenplum
Spring xd is a unified, distributed, and extensible service for data ingestion, real time analytics, batch processing, and data export.
http://zdatainc.com/2016/01/real-time-streaming-with-spring-xd-apache-geode-gemfire-and-greenplum-earthquake-data-demo/

Data Orchestration using Hortonworks DataFlow (HDF)
Hortonworks Dataflow (HDF), powered by Apache NiFi, is the first integrated platform that solves the real time complexity and challenges of collecting and transporting data from a multitude of sources be they big or small, fast or slow, always connected or intermittently available
http://zdatainc.com/2016/02/hello-nifi-data-orchestration-using-hortonworks-dataflow-hdf/

Happy Learning!