Want to understand whether Netezza
or Hadoop
is the right choice for the below purposes:
Pull feed files from several online sources of considerable size at times more than a GB.
Clean, filter, transform and compute further information from the feeds.
Generate metrics on different dimensions akin to how data warehouse cubes do it, and
Aid webapps to access the final data/metrics faster using SQL or any other standard mechanisms.
How it works:
As the data is loaded into the Appliance, it intelligently separates each table across the 108 SPUs.
Typically,
the hard disk is the slowest part of a computer. Imagine 108 of these spinning up at once, loading a small
piece of the table. This is how Netezza achieves a 500 Gigabyte an hour load time.
After a piece of the table is loaded and stored on each SPU (computer on an integrated circuit card), each
column is analyzed to gain descriptive statistics such as minimum and maximum values. These values are
stored on each of the 108 SPUs, instead of indexes, which take time to create, updated and take up
unnecessary space.
Imagine your environment without the need to create indexes.
When it is time to query the data, a master computer inside of the Appliance queries the SPUs to see which
ones contain the data required.
Only the SPUs that contain appropriate data return information, therefore
less movement of information across the network to the Business Intelligence/Analytics Server.
For joining data, it gets even better.
The Appliance distributes data in multiple tables across multiple SPUs
by a key. Each SPU contains partial data for multiple tables. It joins parts of each table locally on each SPU
returning only the local result. All of the ‘local results’ are assembled internally in the cabinet and then
returned to the Business Intelligence/Analytics Server as a query result. This methodology also contributes
to the speed story.
The key to all of this is ‘less movement of data across the network’. The Appliance only returns data
required back to the Business Intelligence/Analytics server across the organization’s 1000/100 MB network.
This is very different from traditional processing where the Business Intelligence/Analytics software typically
extracts most of the data from the database to do its processing on its own server. The database does the
work to determine the data needed, returning a smaller subset result to the Business Intelligence/Analytics
server.
Backup And Redundancy
To understand how the data and system are set up for almost 100% uptime, it is important to understand
the internal design. It uses the outer, fastest, one-third part of each 400-Gigabyte disk for data storage and
retrieval. One-third of the disk stores descriptive statistics and the other third stores hot data back up of
other SPUs. Each Appliance cabinet also contains 4 additional SPUs for automatic fail over of any of the 108
SPUs.
Took from http://www2.sas.com