Thursday, 11 April 2013

COMPARISON WITH OTHER SYSTEMS | Hadoop Tutorial pdf

Comparison with RDBMS
Unless we are dealing with very large volumes of unstructured data (hundreds of GB, TB’s or PB’s) and have large numbers of machines available you will likely find the performance of Hadoop running a Map/Reduce query much slower than a comparable SQL query on a relational database. Hadoop uses a brute force access method whereas
RDBMS’s have optimization methods for accessing data such as indexes and readahead.
The benefits really do only come into play when the positive of mass parallelism is achieved, or the data is unstructured to the point where no RDBMS optimizations can be applied to help the performance of queries.
But with all benchmarks everything has to be taken into consideration. For example, if the data starts life in a text file in the file system (e.g. a log file) the cost associated with extracting that data from the text file and structuring it into a standard schema and loading it into the RDBMS has to be considered. And if you have to do that for 1000 or 10,000 log files that may take minutes or hours or days to do (with Hadoop you still have to copy the files to its file system). It may also be practically impossible to load such data into a RDBMS for some environments as data could be generated in such a volume that a load process into a RDBMS cannot keep up. So while using Hadoop your query time may be slower (speed improves with more nodes in the cluster) but potentially your access time to the data may be improved.
Also as there aren’t any mainstream RDBMS’s that scale to thousands of nodes, at some point the sheer mass of brute force processing power will outperform the optimized, but restricted on scale, relational access method. In our current RDBMSdependent web stacks, scalability problems tend to hit the hardest at the database level.
For applications with just a handful of common use cases that access a lot of the same data, distributed in-memory caches, such as memcached provide some relief. However, for interactive applications that hope to reliably scale and support vast amounts of IO, the traditional RDBMS setup isn’t going to cut it. Unlike small applications that can fit their most active data into memory, applications that sit on top of massive stores of shared content require a distributed solution if they hope to survive the long tail usage pattern commonly found on content-rich site. We can’t use databases with lots of disks to do large-scale batch analysis. This is because seek time is improving more slowly than transfer rate. Seeking is the process of moving the disk’s head to a particular place on the disk to read or write data. It characterizes the latency of a disk operation, whereas the transfer rate corresponds to a disk’s bandwidth. If the data access pattern is dominated by seeks, it will take longer to read or write large portions of the dataset than streaming through it, which operates at the transfer rate. On the other hand, for updating a small proportion of records in a database, a traditional B-Tree (the data structure used in relational databases, which is limited by the rate it can perform seeks) works well. For updating the majority of a database, a B-Tree is less efficient than MapReduce, which uses Sort/Merge to rebuild the database.
Another difference between MapReduce and an RDBMS is the amount of structure in the datasets that they operate on. Structured data is data that is organized into entities that have a defined format, such as XML documents or database tables that conform to a particular predefined schema. This is the realm of the RDBMS. Semi-structured data, on the other hand, is looser, and though there may be a schema, it is often ignored, so it may be used only as a guide to the structure of the data: for example, a spreadsheet, in which the structure is the grid of cells, although the cells themselves may hold any form of data. Unstructured data does not have any particular internal structure: for example, plain text or image data. MapReduce works well on unstructured or semistructured data, since it is designed to interpret the data at processing time. In other words, the input keys and values for MapReduce are not an intrinsic property of the data, but they are chosen by the person analyzing the data. Relational data is often normalized to retain its integrity, and remove redundancy. Normalization poses problems for MapReduce, since it makes reading a record a nonlocal operation, and one of the central assumptions that MapReduce makes is that it is possible to perform (high-speed) streaming reads and writes.
                                   Traditional RDBMS               MapReduce
Data size                          Gigabytes                           Petabytes
Access                         Interactive and batch               Batch
Updates                       Read and write many times     Write once, read many times
Structure                      Static schema                         Dynamic schema
Integrity                        High                                      Low
Scaling                         Non linear                              Linear
But hadoop hasn’t been much popular yet. MySQL and other RDBMS’s have stratospherically more market share than Hadoop, but like any investment, it’s the future you should be considering. The industry is trending towards distributed systems, and Hadoop is a major player.

No comments: