Preparing the NCDC Weather Data for Hadoop

I’m exploring Hadoop with the book Hadoop: The Definitive Guide. Appendix A shows how to download NCDC Weather data from S3 and put it into Hadoop. I didn’t want to download from S3 or load the entire dataset so here’s what I did instead.

Here’s a little bash script I used to download the data. You might want to do this if you want more up-to-date data, or if you only want to work with a subset. If you only want data for a certain year just append that year to the url in $source_url.


if [ ! -d "$download_to" ]; then 
    mkdir "$download_to"; 

wget -r -c --progress=bar --no-parent -P "$download_to" "$source_url";

I’ve modified the script from the Hadoop book to work with local files. I’m just working with files from 2012. Modify the url in target if you want something different.

#!/usr/bin/env bash 

# NCDC Weather file to load into hadoop 

# Un-gzip each station file and concat into one file 
echo "reporter:status:Un-gzipping $target" >&2 
for file in $target/* do 
    gunzip -c $file >> $target.all 
    echo "reporter:status:Processed $file" >&2 

# Put gzipped version into HDFS 
echo "reporter:status:Gzipping $target and putting in HDFS" >&2 
gzip -c $target.all | $HADOOP_INSTALL/bin/hadoop fs -put - gz/$target.gz

The script will unzip all the files, combine them, you should see output similar to this.

reporter:status:Processed /home/rhys/ncdc_data/ 
reporter:status:Processed /home/rhys/ncdc_data/ 
reporter:status:Processed /home/rhys/ncdc_data/

When it’s finished combining all the files it will store the data in Hadoop.

 reporter:status:Gzipping /home/rhys/ncdc_data/ and putting in HDFS 13/01/11 21:37:52 
INFO util.NativeCodeLoader: Loaded the native-hadoop library

Once the process has completed you should be able to confirm the storage of your data in Hadoop with the following command;

 rhys@linux-g1rx:~/hadoop_scripts> hadoop fs -ls gz/home/rhys/ncdc_data/
 Found 1 items 
-rwxrwxrwx 1 rhys users 4870924294 2013-01-11 23:11 /home/rhys/hadoop_scripts/gz/home/rhys/ncdc_data/

Now I have data in Hadoop it’s time to start writing MapReduce jobs!


  1. Anand says:


    The web URL returns no data/directory found. Can you be kind enough to provide me a http URL where the data is stored. I have been struggling for days to get the data directly from the website, but cannot locate it. Shall be thankful if you can help.



  2. Rhys says:

    It’s been removed. I guess starting here could be a good bet…

  3. Rhys says:

    Thanks for sharing.


Leave a Reply