Hadoop architectural overview
An Excellent series of posts – talking about Hadoop and Related components, Key metrics to monitor in Production
Surviving and Thriving in a Hybrid Data Management World
The vast majority of our customers who are moving to cloud applications also have a significant current investment in on premise operational applications and on premise capabilities around data warehousing, business intelligence and analytics. That means that most of them will be working with a hybrid cloud/on premise data management environment for the foreseeable future.
Data Compression in Hadoop
File compression brings two major benefits: it reduces the space needed to store files, and it speeds up data transfer across the network or to or from disk. When dealing with large volumes of data, both of these savings can be significant, so it pays to carefully consider how to use compression in Hadoop.
What is “Just-Enough” Governance for the Data Lake?
Just-enough governance is similar to the Lean Startup methodology concept of building of a Minimum Viable Product (MVP). From an enterprise perspective, just-enough governance means building only the process and control necessary to solve a particular business problem.
Mind map on SAP HANA
Should I use SQL or NoSQL?
Every application needs persistent storage — data that persists across program restarts. This includes usernames, passwords, account balances, and high scores. Deciding how to store your application’s important data is one of first and most important architectural decisions to be made.
Developing a robust data platform requires definitely more than HDFS, Hive, Sqoop and Pig. Today there is a real need for bringing data and compute as close as possible. More and more requirements are forcing us to deal with high-throughput/low-latency scenarios. Thanks to in-memory solutions, things definitely seems possible right now.
One of the lesson I have learnt in the last few years is that it is hard to resist developing your own technology infrastructure while developing a platform infrastructure. It is always important to remind ourselves that we are here to build solutions and not technology infrastructure.
Some of the key questions that needs to be considered while embarking on such journey is that
- How do we handle the ever growing volume of data (Data Repository)?
- How do we deal with the growing variety of data (Polyglot Persistence)?
- How do we ingest large volumes of data as we start growing (Ingestion Pipelines/Write Efficient)?
- How do we scale in-terms of faster data retrieval so that the Analytics engine can provide something meaningful at a decent pace?
- How do we deal with the need for Interactive Analytics with a large dataset?
- How do we keep our cost per terabyte low while taking care of our platform growth?
- How do we move data securely between on premise infrastructure to cloud infrastructure?
- How do we handle data governance, data lineage, data quality?
- What kind of monitoring infrastructure that would be required to support distributed processing?
- How do we model metadata so that we can address domain specific problems?
- How do we test this infrastructure? What kind of automation is required?
- How do we create a service delivery platform for build and deployment?
One of the challenges I am seeing right now is that the urge to use multiple technologies to solve similar problems. Though this gives my developers the edge to do things differently/efficiently, from a platform perspective this would increase the total cost of operations.
- How do we support our customers in production?
- How can we make the life our operations teams better?
- How do we take care of reliability, durability, scalability, extensibility and Maintainability of this platform?
Will talk about the data repository and possible choices in the next post.
Three incremental, manageable steps to building a “data first” data lake
Applications have always dictated the data. That has made sense historically, and to some extent, continues to be the case. But an “applications first” approach creates data silos that are causing operational problems and preventing organizations from getting the full value from their business intelligence initiatives.
Azure SQL Data Warehouse: Introduction
Azure SQL Data Warehouse is a fully-managed and scalable cloud service.
The Informed Data Lake: Beyond Metadata
Historically, the volume and extent of data that an enterprise could store, assemble, analyze and act upon exceeded the capacity of their computing resources and was too expensive. The solution was to model some extract of a portion of the available data into a data model or schema, presupposing what was “important,” and then fit the incoming data into that structure.
Real Time Streaming with Spring xd, Apache Geode (Gemfire), and Greenplum
Spring xd is a unified, distributed, and extensible service for data ingestion, real time analytics, batch processing, and data export.
Data Orchestration using Hortonworks DataFlow (HDF)
Hortonworks Dataflow (HDF), powered by Apache NiFi, is the first integrated platform that solves the real time complexity and challenges of collecting and transporting data from a multitude of sources be they big or small, fast or slow, always connected or intermittently available
Some of the Interesting links i have read in the last couple of weeks around Data, In-Memory Databases, Pipeline Development and Analytics.
In Search of Database Nirvana
An excellent post providing an in-depth look at the possibilities and the challenges for companies that long for a single query engine to rule them all.
Aerospike Vs Cassandra Comparison
Comparison on Aerospike with Apache Cassandra. Cassandra is a columnar NoSQL database that is great for ingesting and analyzing hundreds of terabytes of data stored on rotational disks. Aerospike is an in-memory, NoSQL database, a key-value store that can run purely in RAM and is also optimized for storing data in Flash (SSDs).
An overview of Apache Streaming Technologies
A very good comparison comparing technologies around simple event processors, stream processors, and complex event processors.
Flow based Programming
Flow-Based Programming defines applications using the metaphor of a “data factory”. It views an application not as a single, sequential process, which starts at a point in time, and then does one thing at a time until it is finished, but as a network of asynchronous processes communicating by means of streams of structured data chunks, called “information packets” (IPs).
Hadoop Deployment Cheat Sheet
If you are using, or planning to use the Hadoop framework for big data and Business Intelligence (BI) this document can help you navigate some of the technology and terminology, and guide you in setting up and configuring the system.
Amazon Redshift for Custom Analytics
Experience summary on building Custom Analytics on top of Redshift
Building Analytics at 500px
Experience summary on how they have built the ecosystem
How Artificial Intelligence Will Kickstart the Internet of Things?
IoT will produce a tsunami of big data, with the rapid expansion of devices and sensors connected to the Internet of Things continues, the sheer volume of data being created by them will increase to an astronomical level. This data will hold extremely valuable insights into what’s working well or what’s not.