Introduction
Since last year, I have been focusing my efforts in
understanding Big Data technologies and how it can impact/improve
my work around Data warehousing and Business
Intelligence.
With innovation on multiple fronts like Cloud Infrastructure
Services, Open Source Technologies, Hadoop Free Distributions(more
on this later) its now possible, for data geeks to play with these
big data technologies without huge capital investments.
In this introductory post, I will provide some reasons why its
important for Data professionals to take note of these emerging
trends. I will try to explain concepts in a very technology
agnostic manner.
In later posts,I would like to show, how we can set-up a Big
Data platform Cluster which can be used for various projects.
Note: In
future, I also hope to post separate series on Data Science as
it is distinct in its own domain but also aligns very well with Big
Data trend
Inception
Why focus on Big Data Technologies? To understand that we have
understand some basic characteristics of any Dataset and how its
used in real world
- 1) Data travels through a series of phases analogous
to a oil refinery/product manufacturing. Its not greatly useful for
Analytics in its original raw form

- 2) Recent trends show data generated in world is growing
exponentially. Not only that ,but with advent of concepts like web
API's,Mobile and Internet of Things various formats of
data is available like never before. Speed of access is
also considered as basic feature in technology.

- 3) For a typical Business Analytics Solution ,there are wide
range of Operational work-streams that has to be taken
care of

In essence for a data project we are looking at
For remainder of this post I will focus on Technical and
Hardware Architecture.
Scalability
Scalability can take one of two strategies.We can either scale
up (buying bigger machines) or scale out (adding small size
machines incrementally).Scaling up requires a lot of operational
overhead in terms of migration.Also cost
increases exponentially for specialised hardware.
In Scale out model ,nodes are added as required and doesn't
not require a system downtime. Cost is incremental and fairly low
as these solutions typically is targeted against
commodity hardware.
With increasing pace of data availability across businesses and
on web, Scale out strategy is best placed to handle big data . In
terms of technology though, we have good foundations to set up
scale out infrastructure and relevant communication protocols
between machines,but traditional RDBMS technologies cannot
offer a higher level orchestration between workloads. Traditional
RDBMS vendors do provide federation support,which essentially means
breaking up components of RDBMS platform
and dedicating machines to do a specific type of work on
certain nodes.However it has few limitations
- - Co-ordination of work across machines is left to
architects and developers.Solution is very customised to a
client/project and complex to maintain
- - Adding and removing nodes from cluster has a dependency
on developers reconfiguring some parts to make sure systems
work correctly for end users
- - These technologies typically have very limited
support for parallel processing across machines. Parallel Execution
optimisations are different,which all RDBMS vendors provide
- - Developers cannot interface with system in a manner to
package up a unit of work and instruct set of nodes to do same
processing but local to their data.
Distributed
Systems
So,as we understand from above we need a higher level technology
platform which is very distributed in nature and can work in
parallel to cope with scalability problem of big data.But what
exactly do we need ??
Distributed File System-Luckily,we don't
require a specialised Operating Systems for this technology.That
would be asking too much.Traditional Operating Systems are good
enough for its part on keep the machine running on network. What we
do need, is a solution to break the barrier of single machine
address space.
The default File System (FS) can only
understand a address location from instruction set within its own
machine. Distributed File System on the other hand crosses across
all machines in a cluster and can interact with address space
across machines. This enables higher level components sitting on
top to see DFS as a single server storage disk.
Side note on Windows
OS vs Open Source OS: Because of the closed nature of
Windows OS,there is virtually no DFS components available for
Windows.There were some early works from Microsoft Research
on this ,but either they were consumed in their commercial products
or simply abandoned. On the Other hand, Linux OS has a very big
popularity in Open Source world.Consequently many DFS modules are
available to for Linux and today's big data technologies build on
this. Of course Linux distribution vendors standardise
these pluggable modules in their distro's and you would certainly
want to pick a Vendor for Enterprise grade support. I
typically go for Ubuntu from Cannonical , because it has
all the goodness of Linux OS but it has nice feel of GUI which
windows users like me will be comfortable with.
FS Interface-In terms of Big Data Technology ,
FS Interface will act as a generic API to work across multiple
DFS/FS components
The idea of Object Storage layer is
to provide a methodology to break up large files into chunks
and store them across few machines.With this we get advantages like
replication redundancy ,Parallel Processing options within local
data etc. A higher level abstract of meta-data exists to
build the entire file for end user/client programs
Now these pieces of files sitting on many machines as
such is not very useful in its binary form. We need a method to
understand data stored in it and that's where Data Storage
layer comes in. It needs mechanisms to store and modify
meta-data and relationships. It also need to provide API's so that
end client can query it . Again here we need a distributed system
features to support horizontal scalability characteristics of big
data.
Computation Layer-New distributed systems
demands new ways of thinking about computation ,which in turn will
effect how developers program their applications . Big data
technologies currently de facto to Application Programming
model called Map-Reduce .
But in future there will be more pluggable models for specific
problem domain like e.g: Graph Analysis etc.. In-fact I
have seen Microsoft
Research was working on Project called DRAYD
(now inactive) which is much more complex than Map-Reduce.
If Data storage layer is heart of big data platform, Map-Reduce
is the Brains.The Application Manager determines how tasks
are to be run(in parallel) and merged ,while Resource Scheduler
co-ordinates which computing resources are to allocated for these
jobs.
Today a lot of eco system technologies are developed to work
with these core layers of big data engine. In next post I will
touch upon some technologies which are part of this big data
platform.