Is your solution suitable?

suited-mannequins-558x295

It’s important to understand what your solution was designed for in the first place. Then you can ensure they are a good fit for what you want to use them for. Read more about how Big Data and Hadoop have been hyped up

Inside Analysis 

What Hadoop Is. What Hadoop Isn’t. 

Mark Madsen
December 7, 2012 

The Hadoop stack is a data processing platform. It combines elements of databases, data integration tools and parallel coding environments into a new and interesting mix. The problem with the IT market today is that it distorts the view of Hadoop by looking at it as a replacement for one of these technologies. Database vendors see it as a database and challenge it on those grounds. Data integration vendors see it as an ETL tool and challenge it on those grounds. Analytics vendors see it as a replacement for their engines and challenge it through that view. In doing so, each vendor community overestimates Hadoop’s potential for displacement of their product, while simultaneously underestimating the impact that it will have on the environment and architecture they operate in.

Read the full article at Inside Analysis »

……………………………………………………………………………………………………………….

Computerworld

Moving beyond Hadoop for big data needs

Jaikumar Vijayan
October 29, 2012

Hadoop isn’t enough anymore for enterprises that need new and faster ways to extract business value from massive datasets.

Hadoop and MapReduce have long been mainstays of the big data movement, but some companies now need new and faster ways to extract business value from massive — and constantly growing — datasets.

While many large organizations are still turning to the open source Hadoop big data framework, its creator, Google, and others have already moved on to newer technologies.

The Apache Hadoop platform is an open source version of the Google File System and Google MapReduce technology. It was developed by the search engine giant to manage and process huge volumes of data on commodity hardware.

It’s been a core part of the processing technology used by Google to crawl and index the Web.

Computerworld – Hadoop and MapReduce have long been mainstays of the big data movement, but some companies now need new and faster ways to extract business value from massive — and constantly growing — datasets.

While many large organizations are still turning to the open source Hadoop big data framework, its creator, Google, and others have already moved on to newer technologies.

The Apache Hadoop platform is an open source version of the Google File System and Google MapReduce technology. It was developed by the search engine giant to manage and process huge volumes of data on commodity hardware.

It’s been a core part of the processing technology used by Google to crawl and index the Web.

Hundreds of enterprises have adopted Hadoop over the past three or so years to manage fast-growing volumes of structured, semi-structured and unstructured data.

The open source technology has proved to be a cheaper option than traditional enterprise data warehousing technologies for applications such as log and event data analysis, security event management, social media analytics and other applications involving petabyte-scale data sets.

Analysts note that some enterprises have started looking beyond Hadoop not because of limitations in the technology, but for the purposes it was designed.

Hadoop is built for handling batch-processing jobs where data is collected and processed in batches. Data in a Hadoop environment is broken up and stored in a cluster of highly distributed commodity servers or nodes.

In order to get a report from the data, users have to first write a job, submit it and wait for it to get distributed to all of the nodes and get processed.

While the Hadoop platform performs well, it’s not fast enough for some key applications, says Curt Monash, a database and analytics expert and principal at Monash Research. For instance, Hadoop does not fare well in running interactive, ad hoc queries against large datasets, he said.

“Hadoop has trouble with is interactive responses,” Monash said. “If you can stand latencies of a few seconds, Hadoop is fine. But Hadoop MapReduce is never going to be useful for sub-second latencies.”

Companies needing such capabilities are already looking beyond Hadoop for their big data analytics needs.

Google, in fact, started using an internally developed technology called Dremel some five years ago to interactively analyze or “query” massive amounts of log data generated by its thousands of servers around the world.

Google says the Dremel technology supports “interactive analysis of very large datasets over shared clusters of commodity machines.”

Google uses Dremel in conjunction with MapReduce, he said. Hadoop MapReduce is used to prepare, clean, transform and stage massive amounts of server log data, and then Dremel is used to analyze the data.

“Dremel allows us to go into the system and start to interrogate those logs with speculative queries,” Kwek said. A Google engineer could say, “show me all the response times that were above 10 seconds. Now show it to me by region,” Kwek said. Dremel allows engineers to very quickly pinpoint where the slowdown was occurring, Kwek said.

“Dremel distributes data across many, many machines and it distributes the query to all of the servers and asks each one ‘do you have my answer?’ It then aggregates it and gets back the answer in literally seconds.”

Using Hadoop and MapReduce for the same task would take longer because it requires writing a job, launching it and waiting for it to spread across the cluster before the information can be sent back to a user. “You can do it, but it’s messy. It’s like trying to use a cup to slice bread,” Kwek said.

Major vendors of business intelligence products, including SAS Institute, SAP, Oracle, Teradata and Hewlett-Packard Co., have been rushing to deliver tools that deliver improved data analytics capabilities. Like Google, most of these vendors see Hadoop platform mainly as a massive data store for preparing and staging multi-structured data for analysis by other tools.

“Dremel was architected from the ground up to be an analytical data store,” Driscoll said. Its column-oriented, parallelized, in-memory design makes it several orders of magnitude faster than a traditional data store, he said.

Hadoop, he said, is simply too slow for companies that need sub-millisecond query response times. Analytics technologies such as those being offered by the traditional enterprise vendors are faster than Hadoop but still don’t scale as well as a Dremel or a Druid, Driscoll said.

Read the article at Computerworld »

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s