Top "Bigdata" questions

Big data is a concept that deals with data sets of extreme volumes.

What is apache zeppelin?

As we are hearing often about apache zeppelin, So few questions comes to our mind: What is Apache zeppelin? What …

apache-spark bigdata apache-zeppelin
Apache Spark-SQL vs Sqoop benchmarking while transferring data from RDBMS to hdfs

I am working on a use case where I have to transfer data from RDBMS to HDFS. We have done …

hadoop apache-spark-sql sqoop bigdata
how to do subqueries in bigquery?

Im trying to play with the reddit data on bigquery and I want to see comments and replies in one …

sql subquery google-bigquery reddit bigdata
Spark Scala Understanding reduceByKey(_ + _)

I can't understand reduceByKey(_ + _) in the first example of spark with scala object WordCount { def main(args: Array[String]): Unit = { …

scala apache-spark word-count bigdata
Postgresql - performance of using array in big database

Let say we have a table with 6 million records. There are 16 integer columns and few text column. It is read-only …

arrays performance postgresql join bigdata
Apache Drill vs Spark

I have some expirience with Apache Spark and Spark-SQL. Recently I've found Apache Drill project. Could you describe me what …

hadoop apache-spark bigdata apache-drill
Load data into Hive with custom delimiter

I'm trying to create an internal (managed) table in hive that can store my incremental log data. The table goes …

hadoop hive loaddata bigdata
How to speed up GLM estimation?

I am using RStudio 0.97.320 (R 2.15.3) on Amazon EC2. My data frame has 200k rows and 12 columns. I am trying to …

performance r bigdata
Why 'mapred-site.xml' is not included in the latest Hadoop 2.2.0?

Latest build of Hadoop provides mapred-site.xml.template Do we need to create a new mapred-site.xml file using this? …

apache hadoop mapreduce bigdata
Finding Minimum hamming distance of a set of strings in python

I have a set of n (~1000000) strings (DNA sequences) stored in a list trans. I have to find the minimum …

python algorithm bigdata hamming-distance