Today, everyone talks about storing data in the raw format so that it can be analyzed and generate insights at a later point of time. Fantastic Idea. Data Lakes just delivers that promise. However, the complexity of data is increasing day by day. And there are these new data sources that are getting added on a regular basis.
Not every day you end up dealing with data sets which you are familiar with. Considering the kind of new type of data that gets added, most likely that one would end up dealing with data sets out of their comfort zone.
Data science teams spend most of their time with exploring and understanding data.
- If you must deliver some quick insights on a set of data will you go through them manually to figure out or can we do something within the data lake that can be used?
- What would be an easy way for the Data Science team to understand the data set quickly, understand patterns, relationships so that we could generate some hypothesis?
Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to
- Maximize insight into a data set;
- Uncover underlying structure;
- Extract important variables;
- Detect outliers and anomalies;
- Test underlying assumptions;
- Develop parsimonious models; and
- Determine optimal factor settings.
Via : http://www.itl.nist.gov/div898/handbook/eda/section1/eda11.htm
Exploratory Data Analysis : http://www.statgraphics.com/exploratory-data-analysis
If you google about exploratory data analysis, you will get tons and tons of material about doing EDA using R or Python.
Considering the shortened timelines, if you expect your data science teams to develop code to understand, you may not be able to deliver value at the speed in which your business is expecting results.
There are couple of tools which may help you understand your data faster. Google Data Profiling and you will get tons and tons of results on this topic. My favorite tools right now in this topic are
Both are easy to use with a simple user interface. You can use the free version to get started. If you have an automated data pipeline using SPARK, you can also generate the profile statistics about the incoming data and store it as part of your Catalog.
I really like this presentation on this topic.. Data Profiling and Pipeline Processing with Spark.
Once you do this with Spark, you may want to update the data profile information and store it as part of your catalog. If you index your catalog with Elastic Search, you may be able to provide an API for your Data Science teams to search for the files with certain quality attributes etc.
The above tools will help you get a quick understanding of your data. But, what If you want pointers for analysis to get started about your data? Only a profiler will not help in this case. You may want to explore this product from IBM (yeah… you heard it right… it’s from IBM and I am using it daily). Check it out here… IBM Watson Analytics
Watson Analytics – is a SMART discovery service and it is super smart. It is available for $80 User/Month. For the value, you get out of it, $80 per month is really nothing.
You can use it for data exploration and predictive analytics and it is effortless. A free one month subscription is available for you to play with.
I have looked around various products and I couldn’t find anything which is closer to what Watson offers. If i have to mention about a drawback, it doesn’t provide connectivity to S3. You may have to connect to Postgresql or Redshift to extract data.
If you can integrate it in your platform and use it effectively, you will be able to add value to your customers in literally no time.
Serverless Architecture is relatively very new. I’ve been exploring Serverless architecture for the new platform architecture off late. Though it is very interesting obviously there is a reasonable learning curve and I don’t see lot of best practices out there yet.
Everything looks green on the other side.. We will learn as we move forward..
Since, we use AWS as our cloud provider, most of the examples you will see are related to AWS Lambda.
Specific Reasons for exploring Serverless Architecture
- No operating systems to choose, secure, patch, or manage.
- No servers to right size, monitor, or scale out.
- No risk to your cost by over-provisioning.
- No risk to your performance by under-provisioning.
One thing i learnt in the last few years about developing distributed applications is that, it is not about learning new things… it is always about unlearning what you have done in the past.
If you are specific about Vendor lock-in then this may not be a choice at all for you…
Following is my reading list on Serverless Architecture.
What is Serverless?
What is Serverless Computing and Why is it Important?
Serverless Architecture in short
Is “Serverless” architecture just a finely-grained rebranding of PaaS?
Serverless Delivery: Architecture
Principles of Serverless Architectures
There are five principles of serverless architecture that describe how an ideal serverless system should be built. Use these principles to help guide your decisions when you create serverless architecture.
1. Use a compute service to execute code on demand (no servers)
2. Write single-purpose stateless functions
3. Design push-based, event-driven pipelines
4. Create thicker, more powerful front ends
5. Embrace third-party services
Serverless Architectures – Building a Serverless system to solve a problem
Serverless architecture: Driving toward autonomous operations
The essential guide to Serverless technologies and architectures
Using AWS Lambda and API Gateway to create a serverless schedule
Five Reasons to Consider Amazon API Gateway for Your Next Microservices Project
AWS Lambda and the Evolution of the Cloud
SquirrelBin: A Serverless Microservice Using AWS Lambda
A Crash Course in Amazon Serverless Architecture
AWS Lambda and Endless Serverless Possibilities
Awesome Serverless – A Curated List
Splunk vs ELK: The Log Management Tools Decision Making Guide
Much like promises made by politicians during an election campaign, production environments produce massive files filled with endless lines of text in the form of log files. Unlike election periods, they’re doing it all year around, with multiple GBs of unstructured plain text data generated each day.
Building a Modern Bank Backend
An awesome list of Micro Services Architecture related principles and technologies.
Part of the Stream Architecture Book. An excellent overview on the topic.
The Hardest Part About Micro services: Your Data
Of the reasons we attempt a micro services architecture, chief among them is allowing your teams to be able to work on different parts of the system at different speeds with minimal impact across teams. So we want teams to be autonomous, capable of making decisions about how to best implement and operate their services, and free to make changes as quickly as the business may desire. If we have our teams organized to do this, then the reflection in our systems architecture will begin to evolve into something that looks like micro services.
New Ways to Discover and Use Alexa Skills
Alexa, Amazon’s cloud-based voice service, powers voice experiences on millions of devices, including Amazon Echo and Echo Dot, Amazon Tap, Amazon Fire TV devices, and devices like Triby that use the Alexa Voice Service. One year ago, Amazon opened up Alexa to developers, enabling you to build Alexa skills with the Alexa Skills Kit and integrate Alexa into your own products with the Alexa Voice Service.
Hadoop architectural overview
An Excellent series of posts – talking about Hadoop and Related components, Key metrics to monitor in Production
Surviving and Thriving in a Hybrid Data Management World
The vast majority of our customers who are moving to cloud applications also have a significant current investment in on premise operational applications and on premise capabilities around data warehousing, business intelligence and analytics. That means that most of them will be working with a hybrid cloud/on premise data management environment for the foreseeable future.
Data Compression in Hadoop
File compression brings two major benefits: it reduces the space needed to store files, and it speeds up data transfer across the network or to or from disk. When dealing with large volumes of data, both of these savings can be significant, so it pays to carefully consider how to use compression in Hadoop.
What is “Just-Enough” Governance for the Data Lake?
Just-enough governance is similar to the Lean Startup methodology concept of building of a Minimum Viable Product (MVP). From an enterprise perspective, just-enough governance means building only the process and control necessary to solve a particular business problem.
Mind map on SAP HANA
Should I use SQL or NoSQL?
Every application needs persistent storage — data that persists across program restarts. This includes usernames, passwords, account balances, and high scores. Deciding how to store your application’s important data is one of first and most important architectural decisions to be made.
Developing a robust data platform requires definitely more than HDFS, Hive, Sqoop and Pig. Today there is a real need for bringing data and compute as close as possible. More and more requirements are forcing us to deal with high-throughput/low-latency scenarios. Thanks to in-memory solutions, things definitely seems possible right now.
One of the lesson I have learnt in the last few years is that it is hard to resist developing your own technology infrastructure while developing a platform infrastructure. It is always important to remind ourselves that we are here to build solutions and not technology infrastructure.
Some of the key questions that needs to be considered while embarking on such journey is that
- How do we handle the ever growing volume of data (Data Repository)?
- How do we deal with the growing variety of data (Polyglot Persistence)?
- How do we ingest large volumes of data as we start growing (Ingestion Pipelines/Write Efficient)?
- How do we scale in-terms of faster data retrieval so that the Analytics engine can provide something meaningful at a decent pace?
- How do we deal with the need for Interactive Analytics with a large dataset?
- How do we keep our cost per terabyte low while taking care of our platform growth?
- How do we move data securely between on premise infrastructure to cloud infrastructure?
- How do we handle data governance, data lineage, data quality?
- What kind of monitoring infrastructure that would be required to support distributed processing?
- How do we model metadata so that we can address domain specific problems?
- How do we test this infrastructure? What kind of automation is required?
- How do we create a service delivery platform for build and deployment?
One of the challenges I am seeing right now is that the urge to use multiple technologies to solve similar problems. Though this gives my developers the edge to do things differently/efficiently, from a platform perspective this would increase the total cost of operations.
- How do we support our customers in production?
- How can we make the life our operations teams better?
- How do we take care of reliability, durability, scalability, extensibility and Maintainability of this platform?
Will talk about the data repository and possible choices in the next post.
8 years back when I started my consulting journey, there were 2 major technology ecosystems to deal with. Depending on the problem evaluated it against business, application, development and operational considerations, was able to provide a solution which worked for most of my customers.
4 years back, things started changing. I was putting together the technology road map for the SBU I was working for. There were more than 2 technology ecosystems to deal with. One of the major question I had to answer at that point was if I choose a technology today and start developing, Can this survive the test of time? Can we avoid rewrite at least for next 5 years?
Fast forward to 2016, technology is changing so fast, 5 years seems to be a very long time. If you are starting to develop a product/framework/platform, you need to be prepared at least one or more of your components may change as you start developing and your ecosystem should be ready for that. This requires a different mindset with the business teams, architecture groups, development teams and your ops teams. As an architecture consultant, it is very important to communicate and bring everyone to same page. Having the expectations set right and being ready will avoid the frustrations towards the later part of the journey!