Quantcast
Channel: Wingnity Blog » Hadoop
Viewing all articles
Browse latest Browse all 3

Big Data, Hadoop and the preparation for the future

$
0
0

Big Data, Hadoop and the preparation for the future

The backdrop and history:

Till very recently the general approach of data processing was the managing data through pre-defined structures, i.e. understanding Data and handling them through the metadata. We had the metadata defined and then we transformed data into those structures, stored the data in those formats and then processed them and brought them out in structures too! Data therefore were defined and shackled in the metadata. This was the age were data generation was a defined activity and that served well till the time when data were manageable.

The present world now revolves round Data; Data are created in every transaction, business, social and even in any human activities. Any transaction is data driven and data oriented. Any activity does interact with at least two individuals or institutions and any act they do generates not one element but a set of data elements. Typically every action generates many sets of data of various nature, format and contents. The data is generated quickly and at paces that are much more than the best processing speed. Data does not follow any fixed format or even any pre-conceived or pre-format.

Data from guided and regular processing is structured and can be treated through some or the other structures. These types of repetitive data could easily be handled in terms of rows with same columns, where only the values of the columns did differ but the rows were similar, that means the rows and columns were same and repeating, the values changed merely.

A whole new technology of handling them developed and ruled the technology for at least 4 decades. Taking the cue from Set and Group theories of Algebra, the technology of data processing worked wonders and the data processing standard was determined through what we knew as Relational or row based data base management system.

The problematic:

Data poured in with the most chaotic or irregularity. It came in hordes, it came in different formats and nature and it came out in changing patterns. The biggest challenge of data processing came out when in every moment data did not follow any regular and even any hidden rules and formats. Data ingest in the most fractured and most unruly formats. The structures are very random and varied- this is the nightmare. It is difficult to figure out what structure would the data of the next minute come out with. The dynamic nature and structure of data makes it humanly impossible to either impose a pre-conceived data structure-however smart they are and even dig out or fathom the data structure from within the data pool. The latter would have been possible if the innate structure would have been fixed, but in reality it is not.

Sources:

Data comes from business transactions and these are mostly structured or semi-structured at the worst. Data also come from social network sites, behavioural information of pleasantry exchanges, non-business interactions, and data from beyond the primary or direct interlocutors, for example a person is known by the friends he keeps, and that person is also known by the second level of friends- that is a friend of a friend and so forth. As an example a practitioner or a social activist may be having a friend circle of modest reputations but those “friends” may be having renowned people.

The social position of this person is ameliorated through the secondary level of friendship. Businesses would like to know the popular people who have a strong and broad friend circle and also those who have a very strong and influential friend circle. The messages that are transmitted through these interactions also give a picture of the personality. The subject matter becomes important. Book sellers and other commodity sales outfits use this knowledge to offer individuals very focussed and persuasive suggestions.

Perusing through the unstructured non-business data is an activity that needs to sift through a huge data base of most unstructured data. The known applications of social network are so varying in orientation and in the data holding that no one application or tool can bring out the same type of information. One has to use different methods in figuring out the results.

Big Data:

Big data is now a generic name that really deals with very varying data of varying nature and format and in very big proportions. Till now we actually were using application integration and then processing the final data from one single massive data pool. The world has changed owing to the pressure of the massivity. Big Data is a technique of storing and retrieving data that are stored in distributed clusters.

Data search is parallel and the sending of the data to the central processor is also made parallel. The request/response protocol is to be treated in massively parallel threads. Handling with huge databases through clustering, processing data in batch and historically, using non-row based technology like NoSQL for multi-structured data in real time and web-based applications, and complimented by the massively parallel analytic databases complete with what-if perturbations, predictive models, are the areas where Big Data are being treated. This is the place that is helped by frameworks, and infrastructure platforms like Hadoop.

Hadoop as a special aid:

Hadoop is a composite structure of applications and technology that

  1. Uses a special kind of file structure (HDFS) in managing clusters.
  2. Name Node: The node within a cluster to figure out in which nodes within the cluster related data exist.
  3. Secondary Node: a backup to the name node, it periodically replicates and stores data from the name node, if and when the particular primary node fails.
  4. Job tracker: the particular node that initiatives and coordinates the processing of job.
  5. Slave nodes: nodes that take data from the Job tracker to the direction to process.

There are techniques called mapReduce to navigate within the Hadoop data through data mapping and data distribution. For example an unstructured query can be shot to all possible data types in differing data sources. Such parallel shooting of requests would bring in data from various sources that are in differing formats. These data would then be processed through very innovative techniques and the optimized final result is thrown out as the possible answer to the query.

Through many simulations Data scientist can continually refine and test queries with deeper and evolving insights. But the downside however is the work-in-process. Hadoop technology is not fully proven and is going through a myriad of new innovation and stabilization.

Skill that needs to be developed:

Hadoop is not intuitive. Big Data handling is neither. Skilled personalities are very few and far between, as the processing need of data that are in the penta to hepta or octabyte are also very few. Data processing technology through Hadoop is something that not only needs initiation and training but also practice and experience of handling with massive databases in various applications.

This is difficult. It is always prudent to learn it from experts like Wingnity who provide Hadoop and Big Data handling teaching to Java experts. This is one such company that scales and competes with Big ticket companies at much affordable rates. Experts are better repositories of this kind of advanced technology and it is always prudent to learn from hands on experts who need more people to assist in their venture to tackle the data need of the Business world.

The post Big Data, Hadoop and the preparation for the future appeared first on Wingnity Blog.


Viewing all articles
Browse latest Browse all 3

Latest Images

Trending Articles





Latest Images