Error parsing XSLT file: \xslt\BlogRssFeedLink.xslt - Big Data Technologies for Data Professionals -Introduction

Big Data Technologies for Data Professionals -Introduction

Introduction

 

Since last year, I have been focusing my efforts in understanding Big Data technologies and how it can impact/improve my work around Data warehousing and Business Intelligence.

 

With innovation on multiple fronts like Cloud Infrastructure Services, Open Source Technologies, Hadoop Free Distributions(more on this later) its now possible, for data geeks to play with these big data technologies without huge capital investments.

 

In this introductory post, I will provide some reasons why its important for Data professionals to take note of these emerging trends. I will try to explain concepts in a very technology agnostic manner.

In later posts,I would like to show, how we can set-up a Big Data platform Cluster which can be used for various projects.

 

Note: In future, I also hope to post separate series on Data Science as it is distinct in its own domain but also aligns very well with Big Data trend

 

Inception

 

Why focus on Big Data Technologies? To understand that we have understand some basic characteristics of any Dataset and how its used in real world

 

  • 1) Data travels through a series of phases analogous  to a oil refinery/product manufacturing. Its not greatly useful for Analytics in its original raw form

 

Data Value Chain

 

  • 2) Recent trends show data generated in world is growing exponentially. Not only that ,but with advent of concepts like web API's,Mobile and Internet of Things various formats of data is available like never before. Speed of access is also considered as basic feature in technology.

 

3vs

 

  • 3) For a typical Business Analytics Solution ,there are wide range of Operational work-streams that has to be taken care of

Solutionoverview

 

In essence for a data project we are looking at 

 

All Architectures

 

For remainder of this post I will focus on Technical and Hardware Architecture.

 

Scalability

 

Scalability can take one of two strategies.We can either scale up (buying bigger machines) or scale out (adding small size machines incrementally).Scaling up requires a lot of operational overhead in terms of migration.Also cost increases exponentially for specialised hardware.

 

In Scale out model ,nodes are added as required and doesn't not require a system downtime. Cost is incremental and fairly low as these solutions typically is targeted against commodity hardware.

 

Vertical Scale
Horizontal Scale

 

With increasing pace of data availability across businesses and on web, Scale out strategy is best placed to handle big data . In terms of technology though, we have good foundations to set up scale out infrastructure and relevant communication protocols between machines,but traditional  RDBMS technologies cannot offer a higher level orchestration between workloads. Traditional RDBMS vendors do provide federation support,which essentially means breaking up components of RDBMS platform and dedicating machines to do a specific type of work on certain nodes.However it has few limitations 

 

  • - Co-ordination of work across machines  is left to architects and developers.Solution is very customised to a client/project and complex to maintain

 

  • -  Adding and removing nodes from cluster has a dependency on developers reconfiguring some parts to  make sure systems work correctly for end users

 

  •  -  These technologies typically have very limited support for parallel processing across machines. Parallel Execution optimisations are different,which all RDBMS vendors provide

 

  • - Developers cannot interface with system in a manner to package up a unit of work and instruct set of nodes to do same processing but local to their data.

 

Distributed Systems

 

So,as we understand from above we need a higher level technology platform which is very distributed in nature and can work in parallel to cope with scalability problem of big data.But what exactly do we need ??

Distributed Analytics

 

Distributed File System-Luckily,we don't require a specialised Operating Systems for this technology.That would be asking too much.Traditional Operating Systems are good enough for its part on keep the machine running on network. What we do need, is a solution to break the barrier of single machine address space.

 

The default File System (FS) can only understand a address location from instruction set within its own machine. Distributed File System on the other hand crosses across all machines in a cluster and can interact with address space across machines. This enables higher level components sitting on top to see DFS as a single server storage disk.

 

Side note on Windows OS vs Open Source OS: Because of the closed nature of Windows OS,there is virtually no DFS components available for Windows.There were some early works from Microsoft Research on this ,but either they were consumed in their commercial products or simply abandoned. On the Other hand, Linux OS has a very big popularity in Open Source world.Consequently many DFS modules are available to for Linux and today's big data technologies build on this. Of course Linux distribution vendors standardise these pluggable modules in their distro's and you would certainly want to pick a Vendor for Enterprise grade support.  I typically go for Ubuntu from Cannonical , because it has all the goodness of Linux OS but it has nice feel of GUI which windows users like me will be comfortable with.

 

FS Interface-In terms of Big Data Technology , FS Interface will act as a generic API to work across multiple DFS/FS components

 

The idea of Object Storage layer is to provide a methodology to break up large files into chunks and store them across few machines.With this we get advantages like replication redundancy ,Parallel Processing options within local data etc. A higher level abstract of meta-data exists to build the entire file for end user/client programs

 

Now these pieces of files sitting on many machines as such is not very useful in its binary form. We need a method to understand data stored in it and that's where Data Storage layer comes in. It needs mechanisms to store and modify meta-data and relationships. It also need to provide API's so that end client can query it . Again here we need a distributed system features to support horizontal scalability characteristics of big data.

 

Computation Layer-New distributed systems demands new ways of thinking about computation ,which in turn will effect how developers program their applications . Big data  technologies currently de facto to Application Programming model called Map-Reduce . But in future there will be more pluggable models for specific problem domain like e.g: Graph Analysis etc.. In-fact I have seen Microsoft Research was working on Project called DRAYD  (now inactive) which is much more complex than Map-Reduce.

 

If Data storage layer is heart of big data platform, Map-Reduce is the Brains.The Application Manager determines  how tasks are to be run(in parallel) and merged ,while Resource Scheduler co-ordinates which computing resources are to allocated for these jobs. 

 

Today a lot of eco system technologies are developed to work with these core layers of big data engine. In next post I will touch upon some technologies which are part of this big data platform.

 

 

 

Error parsing XSLT file: \xslt\BlogPostListComments.xslt

Post a comment

Blog Archive


    Categories