Developing a robust data platform requires definitely more than HDFS, Hive, Sqoop and Pig. Today there is a real need for bringing data and compute as close as possible. More and more requirements are forcing us to deal with high-throughput/low-latency scenarios. Thanks to in-memory solutions, things definitely seems possible right now.
One of the lesson I have learnt in the last few years is that it is hard to resist developing your own technology infrastructure while developing a platform infrastructure. It is always important to remind ourselves that we are here to build solutions and not technology infrastructure.
Some of the key questions that needs to be considered while embarking on such journey is that
- How do we handle the ever growing volume of data (Data Repository)?
- How do we deal with the growing variety of data (Polyglot Persistence)?
- How do we ingest large volumes of data as we start growing (Ingestion Pipelines/Write Efficient)?
- How do we scale in-terms of faster data retrieval so that the Analytics engine can provide something meaningful at a decent pace?
- How do we deal with the need for Interactive Analytics with a large dataset?
- How do we keep our cost per terabyte low while taking care of our platform growth?
- How do we move data securely between on premise infrastructure to cloud infrastructure?
- How do we handle data governance, data lineage, data quality?
- What kind of monitoring infrastructure that would be required to support distributed processing?
- How do we model metadata so that we can address domain specific problems?
- How do we test this infrastructure? What kind of automation is required?
- How do we create a service delivery platform for build and deployment?
One of the challenges I am seeing right now is that the urge to use multiple technologies to solve similar problems. Though this gives my developers the edge to do things differently/efficiently, from a platform perspective this would increase the total cost of operations.
- How do we support our customers in production?
- How can we make the life our operations teams better?
- How do we take care of reliability, durability, scalability, extensibility and Maintainability of this platform?
Will talk about the data repository and possible choices in the next post.