Functions & sargable queries

Using functions improperly in your where clauses always prevents index usage right? I’ve been reviewing some queries generated by linq and I’ve found out this isn’t always the case. A quick demo…

Create a temp tables and insert some data

CREATE TABLE #test
(
	id INTEGER NOT NULL PRIMARY KEY CLUSTERED,
	[name] VARCHAR(100) NOT NULL
);
 
INSERT INTO #test
(
	id,
	[name]
)
VALUES
(
	1, 'Rhys Campbell'
),
(
	2, 'Rhys J Campbell'
),
(
	3, 'R J Campbell' 
),
(
	4, 'R Campbell'
),
(
	5, 'Mr J Campbell'
);

Create an index on name.

CREATE INDEX UIX_name 
ON #test
(
	[name]
);

You can see this query here uses an index seek despite the use of the the CONVERT function.

SELECT *
FROM #test
WHERE CONVERT(INT, id) = 4;
execution plan1 Functions & sargable queries

SQL Server Execution Plan

But here you can see the optimiser has been force to perform an index scan to locate our row.

SELECT *
FROM #test
WHERE CAST(id AS CHAR(1)) = '4';
execution plan2 Functions & sargable queries

SQL Server Execution Plan

You can see this query using an index seek.

SELECT *
FROM #test
WHERE [name] = 'Rhys Campbell';
execution plan3 Functions & sargable queries

SQL Server Execution Plan

The next two queries produce the same execution plan. Note the index scan despite the conversion matching the underlying data type in query #1.

SELECT *
FROM #test
WHERE CAST([name] AS VARCHAR(100)) = 'Rhys Campbell';
 
SELECT *
FROM #test
WHERE CONVERT(CHAR(13), [name]) = 'Rhys Campbell';
execution plan4 Functions & sargable queries

SQL Server Execution Plan

So the Query Optimiser can do some magic with integers but not so much with string data type columns. It seems the function / sargable rule is not absolute and perhaps we will see improvements to this in the future. Even so, I’d like to see this unnecessary conversions removed from queries. At the very least it makes the SQL more readable but it should help give the optimiser a better chance of producing a good plan.


[ERROR] Native table ‘performance_schema’.'table name’ has the wrong structure

After I upgraded an instance to MySQL 5.7 I noted the following errors in the log;

2013-05-23 11:40:24 30199 [ERROR] Native table 'performance_schema'.'events_statements_current' has the wrong structure
2013-05-23 11:40:24 30199 [ERROR] Native table 'performance_schema'.'events_statements_history' has the wrong structure
2013-05-23 11:40:24 30199 [ERROR] Native table 'performance_schema'.'events_statements_history_long' has the wrong structure
2013-05-23 11:40:24 30199 [ERROR] Native table 'performance_schema'.'events_statements_summary_by_thread_by_event_name' has the wrong structure
2013-05-23 11:40:24 30199 [ERROR] Native table 'performance_schema'.'events_statements_summary_by_account_by_event_name' has the wrong structure
2013-05-23 11:40:24 30199 [ERROR] Native table 'performance_schema'.'events_statements_summary_by_user_by_event_name' has the wrong structure
2013-05-23 11:40:24 30199 [ERROR] Native table 'performance_schema'.'events_statements_summary_by_host_by_event_name' has the wrong structure
2013-05-23 11:40:24 30199 [ERROR] Native table 'performance_schema'.'events_statements_summary_global_by_event_name' has the wrong structure
2013-05-23 11:40:24 30199 [ERROR] Native table 'performance_schema'.'events_statements_summary_by_digest' has the wrong structure
2013-05-23 11:40:24 30199 [ERROR] Native table 'performance_schema'.'users' has the wrong structure
2013-05-23 11:40:24 30199 [ERROR] Native table 'performance_schema'.'accounts' has the wrong structure
2013-05-23 11:40:24 30199 [ERROR] Native table 'performance_schema'.'hosts' has the wrong structure
2013-05-23 11:40:24 30199 [ERROR] Native table 'performance_schema'.'socket_instances' has the wrong structure
2013-05-23 11:40:24 30199 [ERROR] Native table 'performance_schema'.'socket_summary_by_instance' has the wrong structure
2013-05-23 11:40:24 30199 [ERROR] Native table 'performance_schema'.'socket_summary_by_event_name' has the wrong structure
2013-05-23 11:40:24 30199 [ERROR] Native table 'performance_schema'.'session_connect_attrs' has the wrong structure
2013-05-23 11:40:24 30199 [ERROR] Native table 'performance_schema'.'session_account_connect_attrs' has the wrong structure

You can fix this with mysql_upgrade. Be aware of what the tool does before running this. Most importantly bear this in mind;

Because mysql_upgrade invokes mysqlcheck with the –all-databases option, it processes all tables in all databases, which might take a long time to complete. Each table is locked and therefore unavailable to other sessions while it is being processed. Check and repair operations can be time-consuming, particularly for large tables.

mysql_upgrade -h localhost -u root -p

After you have restarted the instance the errors will no longer be reported.


Are you checking for possible integer overflow?

I realized I wasn’t! We run a couple of systems that I know stick a mass of records through on a daily basis. Better start doing this then or I might end up doing a whoopsie!

Here’s a script I’ve quickly knocked up to make checking this simple. This script will check the specified tables TINYINT, SMALLINT, INT & BIGINT values against the allowable maximum.

A few things to note about this script;

  • Don’t run on a live system.
  • To check all tables in a database you can remove the line specifying TABLE_NAME in the cursor. Be aware this may take a long time and stress your server.
  • % are expressed as a value between 0 and 1 i.e. 0.5 = 50%.
  • % may be slightly off but should not matter considering the scales involved.
  • % also ignores negative numbers since we typically start IDENTITY columns at 1.
  • Change @used_warn_level_percent to a sensible value according to your system.
  • If you observe “Warning: Null value is eliminated by an aggregate or other SET operation.” then your table has a column only containing nulls.
DECLARE @table_schema VARCHAR(50),
		@table_name VARCHAR(50),
		@column_name VARCHAR(50),
		@data_type VARCHAR(30),
		@max_id BIGINT,
		@used_warn_level_percent DECIMAL(4,2);
 
SET @used_warn_level_percent = 49.0;
 
SET NOCOUNT ON;
 
DECLARE myCursor CURSOR LOCAL FAST_FORWARD FOR  SELECT TOP 10 TABLE_SCHEMA, 
													   TABLE_NAME, 
													   COLUMN_NAME,
													   DATA_TYPE
												FROM INFORMATION_SCHEMA.COLUMNS
												WHERE DATA_TYPE IN ('tinyint', 'smallint', 'int', 'bigint')
												AND TABLE_CATALOG = DB_NAME()
												AND TABLE_NAME = 'test_numbers';
 
OPEN myCursor;
FETCH NEXT FROM myCursor INTO @table_schema,
							  @table_name,
							  @column_name,
							  @data_type;
 
DECLARE @max_id_table TABLE
(
	id BIGINT
);
 
WHILE (@@FETCH_STATUS = 0)
BEGIN
 
	INSERT INTO @max_id_table
	EXEC('SELECT MAX(' + @column_name + ') FROM ' + @table_schema + '.' + @table_name);
 
	SET @max_id = (SELECT id FROM @max_id_table);
 
	IF(@data_type = 'bigint')
	BEGIN
		IF(@max_id > FLOOR(((9223372036854775807 / 100.0) * @used_warn_level_percent)))
		BEGIN
			PRINT @column_name + '.' + @table_schema + '.' + @table_name + ' max id is greater than warning level: ' + CONVERT(VARCHAR(30), @max_id);
		END
		PRINT @column_name + '.' + @table_schema + '.' + @table_name + ': ' + CONVERT(VARCHAR(30), CAST(@max_id AS FLOAT) / CAST(9223372036854775807 AS FLOAT)) + ' int space used.';
	END
	ELSE IF (@data_type = 'int')
	BEGIN
		IF(@max_id > FLOOR(((2147483647 / 100.0) * @used_warn_level_percent)))
		BEGIN
			PRINT @column_name + '.' + @table_schema + '.' + @table_name + ' max id is greater than warning level: ' + CONVERT(VARCHAR(30), @max_id);
		END	
		PRINT @column_name + '.' + @table_schema + '.' + @table_name + ': ' + CONVERT(VARCHAR(30), CAST(@max_id AS FLOAT) / CAST(2147483647 AS FLOAT)) + ' int space used.';		
	END
	ELSE IF (@data_type = 'smallint')
	BEGIN
		IF(@max_id > FLOOR(((32767 / 100.0) * @used_warn_level_percent)))
		BEGIN
			PRINT @column_name + '.' + @table_schema + '.' + @table_name + ' max id is greater than warning level: ' + CONVERT(VARCHAR(30), @max_id);
		END
		PRINT @column_name + '.' + @table_schema + '.' + @table_name + ': ' + CONVERT(VARCHAR(30), CAST(@max_id AS FLOAT) / CAST(32767 AS FLOAT)) + ' int space used.';					
	END
	ELSE IF (@data_type = 'tinyint')
	BEGIN
		IF(@max_id > FLOOR(((255 / 100.0) * @used_warn_level_percent)))
		BEGIN
			PRINT @column_name + '.' + @table_schema + '.' + @table_name + ' max id is greater than warning level: ' + CONVERT(VARCHAR(30), @max_id);
		END
		PRINT @column_name + '.' + @table_schema + '.' + @table_name + ': ' + CONVERT(VARCHAR(30), CAST(@max_id AS FLOAT) / CAST(255 AS FLOAT)) + ' int space used.';	
	END
 
	FETCH NEXT FROM myCursor INTO @table_schema,
								  @table_name,
								  @column_name,
								  @data_type;
 
	DELETE FROM @max_id_table;
 
END;
 
CLOSE myCursor;
DEALLOCATE myCursor;

Now we’re ready for testing!

CREATE TABLE test_numbers
(
	t_int TINYINT,
	s_int SMALLINT,
	i_int INT,
	b_int BIGINT
);
-- 100% usage
INSERT INTO test_numbers VALUES (255, 32767, 2147483647, 9223372036854775807);

Run the script and we should be warned!

t_int.dbo.test_numbers max id is greater than warning level: 255
t_int.dbo.test_numbers: 1 int space used.
s_int.dbo.test_numbers max id is greater than warning level: 32767
s_int.dbo.test_numbers: 1 int space used.
i_int.dbo.test_numbers max id is greater than warning level: 2147483647
i_int.dbo.test_numbers: 1 int space used.
b_int.dbo.test_numbers max id is greater than warning level: 9223372036854775807
b_int.dbo.test_numbers: 1 int space used.
TRUNCATE TABLE test_numbers;
-- 99% (ish)
INSERT INTO test_numbers VALUES (252, 32439, 2126008810, 9131138316486228048);
t_int.dbo.test_numbers max id is greater than warning level: 252
t_int.dbo.test_numbers: 0.988235 int space used.
s_int.dbo.test_numbers max id is greater than warning level: 32439
s_int.dbo.test_numbers: 0.98999 int space used.
i_int.dbo.test_numbers max id is greater than warning level: 2126008810
i_int.dbo.test_numbers: 0.99 int space used.
b_int.dbo.test_numbers max id is greater than warning level: 9131138316486228048
b_int.dbo.test_numbers: 0.99 int space used.

And finally;

TRUNCATE TABLE test_numbers ;
-- 50% (ish)
INSERT INTO test_numbers VALUES (127, 16383, 1073741823, 4611686018427387903);
t_int.dbo.test_numbers max id is greater than warning level: 127
t_int.dbo.test_numbers: 0.498039 int space used.
s_int.dbo.test_numbers max id is greater than warning level: 16383
s_int.dbo.test_numbers: 0.499985 int space used.
i_int.dbo.test_numbers max id is greater than warning level: 1073741823
i_int.dbo.test_numbers: 0.5 int space used.
b_int.dbo.test_numbers max id is greater than warning level: 4611686018427387903
b_int.dbo.test_numbers: 0.5 int space used.
-- clean up
DROP TABLE test_numbers;

Compare AD Group Memberships with Powershell

Here’s a quick Powershell script I knocked up to help me check AD Group Memberships between two user accounts. Just set the $user1 and $user2 variables and you’re good to go.

?View Code POWERSHELL
Import-Module ActiveDirectory;
 
$user1 = "username1";
$user2 = "username2";
 
$groups1 = Get-ADPrincipalGroupMembership –Identity $user1 | Select-Object -Property Name;
$groups2 = Get-ADPrincipalGroupMembership –Identity $user2 | Select-Object -Property Name;
 
if($groups1.Count -ne $groups2.Count)
{
	Write-Host "The two accounts contain a different number of groups.";
}
 
foreach($group in $groups1)
{
	echo "1 $group";
	if($groups2 -match $group)
	{
		Write-Host "$user1 & $users2 are both members of $group.";
	}
	else
	{
		Write-host "$user2 is not a member of $group";
	}
}
 
# Reverse check
foreach($group in $groups2)
{
	if($groups1 -match $group)
	{
		# No need to reoutput message;
	}
	else
	{
		Write-host "$user1 is not a member of $group";
	}
}

Output will resemble below;

The two accounts contain a different number of groups.
username1 &  are both members of @{Name=Domain Users}.
username2 is not a member of @{Name=AD Group One}
username1 &  are both members of @{Name=VPN Group}.
username2 is not a member of @{Name=Development AD Group}
username2 is not a member of @{Name=SQL Admins Group}
username2 is not a member of @{Name=AD Group Four}

Failed to configure Node and Disk Majority quorum with ‘[Disk Group Name]‘.

I was changing the drive used as a disk quorum today and received the following error at the end of the wizard in Failover Cluster Manager;

Failed to configure Node and Disk Majority quorum with '[Disk Group Name]'.

I then noticed the drive I was trying to add was in the “SQL Server (MSSQLSERVER)” group and not in the “Available Storage” group. To resolve this I went to the “Server Server (MSSQLSERVER)” node under “Services and applications”. Under the “Disk Drive” section I right clicked the drive I wanted to use for the Quorum and selected “Remove from SQL Server (MSSQLSERVER)”.

This made the drive appear in the “Available Storage” group. I was then able to successfully complete to wizard to reconfigure the quorum.


Powershell to get Windows Startup & Shutdown times

Here’s a quick Powershell snippet to get the startup and shutdown times for a windows system after a specific point.

Get-EventLog -LogName System -ComputerName myHost -After 12/03/2013 -Source "Microsoft-Windows-Kernel-General" | Where-Object { $_.EventId -eq 12 -or $_.EventId -eq 13; } | Select-Object EventId, TimeGenerated, UserName, Source | Sort-Object TimeGenerated | Format-Table -Autosize;

Id 12 indicates a startup event while 13 a shutdown event.

EventID TimeGenerated       UserName            Source
------- -------------       --------            ------
     13 12/03/2013 07:41:58                     Microsoft-Windows-Kernel-General
     12 12/03/2013 07:44:06 NT AUTHORITY\SYSTEM Microsoft-Windows-Kernel-General

Migrate users between MySQL Servers with pt-show-grants

If you use MySQL but don’t use Percona Toolkit you’re really missing a trick. It contains a whole host of useful tools including pt-show-grants which I use to migrate users between servers easily.

pt-show-grants --host mysqlservername --user username --password secret | mysql -h localhost -u username -p

If you want to filter out any specific users, or databases, that’s easy enough with grep;

pt-show-grants --host mysqlservername --user username --password secret | grep -v exclude_database_name | mysql -h localhost -u username -p

Compiling Hadoop example MaxTemperature.java

I’m working through some of the examples in this Hadoop book. I’m a little rusty on compiling java programs and had a little trouble with this one so I’m documenting it here for anyone else how might be having issues.

Firstly, I tried compiling the examples like this;

javac MaxTemperature.java

That wasn’t too successful;

MaxTemperature.java:3: error: package org.apache.hadoop.fs does not exist
import org.apache.hadoop.fs.Path;
                           ^
MaxTemperature.java:4: error: package org.apache.hadoop.io does not exist
import org.apache.hadoop.io.IntWritable;
                           ^
MaxTemperature.java:5: error: package org.apache.hadoop.io does not exist
import org.apache.hadoop.io.Text;
                           ^
MaxTemperature.java:6: error: package org.apache.hadoop.mapreduce does not exist
import org.apache.hadoop.mapreduce.Job;
                                  ^
MaxTemperature.java:7: error: package org.apache.hadoop.mapreduce.lib.input does not exist
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
                                            ^
MaxTemperature.java:8: error: package org.apache.hadoop.mapreduce.lib.output does not exist
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
                                             ^
MaxTemperature.java:18: error: cannot find symbol
    Job job = new Job();
    ^
  symbol:   class Job
  location: class MaxTemperature
MaxTemperature.java:18: error: cannot find symbol
    Job job = new Job();
                  ^
  symbol:   class Job
  location: class MaxTemperature
MaxTemperature.java:22: error: cannot find symbol
    FileInputFormat.addInputPath(job, new Path(args[0]));
                                          ^
  symbol:   class Path
  location: class MaxTemperature
MaxTemperature.java:22: error: cannot find symbol
    FileInputFormat.addInputPath(job, new Path(args[0]));
    ^
  symbol:   variable FileInputFormat
  location: class MaxTemperature
MaxTemperature.java:23: error: cannot find symbol
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
                                            ^
  symbol:   class Path
  location: class MaxTemperature
MaxTemperature.java:23: error: cannot find symbol
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    ^
  symbol:   variable FileOutputFormat
  location: class MaxTemperature
MaxTemperature.java:28: error: cannot find symbol
    job.setOutputKeyClass(Text.class);
                          ^
  symbol:   class Text
  location: class MaxTemperature
MaxTemperature.java:29: error: cannot find symbol
    job.setOutputValueClass(IntWritable.class);
                            ^
  symbol:   class IntWritable
  location: class MaxTemperature
14 errors

After a little messing about I found the correct procedure. When executing these commands you must be in the MaxTemperature project directory. First compile the MaxTemperatureMapper.java file. The classpath should contain the path to the hadoop-core-1.0.4.jar file.

javac -verbose -classpath /home/rhys/hadoop-1.0.4/hadoop-core-1.0.4.jar MaxTemperatureMapper.java

Next we can compile the MaxTemperature.java file. This time the classpath contain the path to the hadoop-core-1.0.4.jar file as well as the MaxTemperatire project directory where we compiled MaxTemperatureMapper.java

javac -classpath /home/rhys/hadoop-1.0.4/hadoop-core-1.0.4.jar:/home/rhys/Downloads/hadoop-book-master/ch02/src/main/java  MaxTemperature.java

That should compile, if so we can then run the job with the provided sample data;

hadoop MaxTemperature ../../../../input/ncdc/sample.txt output

You should see output similar to below;

13/01/27 15:08:16 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/01/27 15:08:16 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/01/27 15:08:16 WARN mapred.JobClient: No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
13/01/27 15:08:16 INFO input.FileInputFormat: Total input paths to process : 1
13/01/27 15:08:16 WARN snappy.LoadSnappy: Snappy native library not loaded
13/01/27 15:08:17 INFO mapred.JobClient: Running job: job_local_0001
13/01/27 15:08:18 INFO util.ProcessTree: setsid exited with exit code 0
13/01/27 15:08:18 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@71780051
13/01/27 15:08:18 INFO mapred.MapTask: io.sort.mb = 100
13/01/27 15:08:19 INFO mapred.JobClient:  map 0% reduce 0%
13/01/27 15:08:20 INFO mapred.MapTask: data buffer = 79691776/99614720
13/01/27 15:08:20 INFO mapred.MapTask: record buffer = 262144/327680
13/01/27 15:08:20 INFO mapred.MapTask: Starting flush of map output
13/01/27 15:08:20 INFO mapred.MapTask: Finished spill 0
13/01/27 15:08:20 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
13/01/27 15:08:21 INFO mapred.LocalJobRunner: 
13/01/27 15:08:21 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.
13/01/27 15:08:21 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@114f6322
13/01/27 15:08:21 INFO mapred.LocalJobRunner: 
13/01/27 15:08:21 INFO mapred.Merger: Merging 1 sorted segments
13/01/27 15:08:21 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 57 bytes
13/01/27 15:08:21 INFO mapred.LocalJobRunner: 
13/01/27 15:08:21 INFO mapred.Task: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
13/01/27 15:08:21 INFO mapred.LocalJobRunner: 
13/01/27 15:08:21 INFO mapred.Task: Task attempt_local_0001_r_000000_0 is allowed to commit now
13/01/27 15:08:21 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to output
13/01/27 15:08:22 INFO mapred.JobClient:  map 100% reduce 0%
13/01/27 15:08:24 INFO mapred.LocalJobRunner: reduce > reduce
13/01/27 15:08:24 INFO mapred.Task: Task 'attempt_local_0001_r_000000_0' done.
13/01/27 15:08:25 INFO mapred.JobClient:  map 100% reduce 100%
13/01/27 15:08:25 INFO mapred.JobClient: Job complete: job_local_0001
13/01/27 15:08:25 INFO mapred.JobClient: Counters: 20
13/01/27 15:08:25 INFO mapred.JobClient:   File Output Format Counters 
13/01/27 15:08:25 INFO mapred.JobClient:     Bytes Written=29
13/01/27 15:08:25 INFO mapred.JobClient:   FileSystemCounters
13/01/27 15:08:25 INFO mapred.JobClient:     FILE_BYTES_READ=1493
13/01/27 15:08:25 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=63627
13/01/27 15:08:25 INFO mapred.JobClient:   File Input Format Counters 
13/01/27 15:08:25 INFO mapred.JobClient:     Bytes Read=529
13/01/27 15:08:25 INFO mapred.JobClient:   Map-Reduce Framework
13/01/27 15:08:25 INFO mapred.JobClient:     Reduce input groups=2
13/01/27 15:08:25 INFO mapred.JobClient:     Map output materialized bytes=61
13/01/27 15:08:25 INFO mapred.JobClient:     Combine output records=0
13/01/27 15:08:25 INFO mapred.JobClient:     Map input records=5
13/01/27 15:08:25 INFO mapred.JobClient:     Reduce shuffle bytes=0
13/01/27 15:08:25 INFO mapred.JobClient:     Physical memory (bytes) snapshot=0
13/01/27 15:08:25 INFO mapred.JobClient:     Reduce output records=2
13/01/27 15:08:25 INFO mapred.JobClient:     Spilled Records=10
13/01/27 15:08:25 INFO mapred.JobClient:     Map output bytes=45
13/01/27 15:08:25 INFO mapred.JobClient:     CPU time spent (ms)=0
13/01/27 15:08:25 INFO mapred.JobClient:     Total committed heap usage (bytes)=230694912
13/01/27 15:08:25 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=0
13/01/27 15:08:25 INFO mapred.JobClient:     Combine input records=0
13/01/27 15:08:25 INFO mapred.JobClient:     Map output records=5
13/01/27 15:08:25 INFO mapred.JobClient:     SPLIT_RAW_BYTES=131
13/01/27 15:08:25 INFO mapred.JobClient:     Reduce input records=5

Hadoop VersionInfo Issue on OpenSuSE 12

I was getting the following error when attempting to run hadoop version.

The java class is not found: org.apache.hadoop.util.VersionInfo
Unable to determine Hadoop version information.
'hadoop version' returned:
The java class is not found: org.apache.hadoop.util.VersionInfo

This was due to having the OpenJDK installed rather than the one from Sun/Oracle. To resolve this simply uninstall the openjdk packages…

sudo zypper remove java

Download one of the Sun/Oracle java packages for your platform. I installed it with…

sudo rpm -iv jdk-7u11-linux-x64.rpm

I also had to set my java home to /usr. Remember to set this is your .profile or somewhere else appropriate. Now when I run hadoop version I get…

Hadoop 2.0.2-alpha
Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-2.0.2-alpha/hadoop-common-project/hadoop-common -r 1392682
Compiled by hortonmu on Tue Oct  2 00:44:10 UTC 2012
From source with checksum efbdb59af73bfc103f1945d65dbf3071

Happy Hadoop’ing!


Preparing the NCDC Weather Data for Hadoop

I’m exploring Hadoop with the book Hadoop: The Definitive Guide. Appendix A shows how to download NCDC Weather data from S3 and put it into Hadoop. I didn’t want to download from S3 or load the entire dataset so here’s what I did instead.

Here’s a little bash script I used to download the data. You might want to do this if you want more up-to-date data, or if you only want to work with a subset. If you only want data for a certain year just append that year to the url in $source_url.
#!/bin/bash 
 
source_url="ftp://ftp3.ncdc.noaa.gov/pub/data/noaa/"; 
download_to="~/ncdc_data"; 
 
if [ ! -d "$download_to" ]; then 
    mkdir "$download_to"; 
fi 
 
wget -r -c --progress=bar --no-parent -P "$download_to" "$source_url";

I’ve modified the script from the Hadoop book to work with local files. I’m just working with files from 2012. Modify the url in target if you want something different.

#!/usr/bin/env bash 
 
# NCDC Weather file to load into hadoop 
target="/home/rhys/ncdc_data/ftp3.ncdc.noaa.gov/pub/data/noaa/2012"; 
 
# Un-gzip each station file and concat into one file 
echo "reporter:status:Un-gzipping $target" >&2 
for file in $target/* do 
    gunzip -c $file >> $target.all 
    echo "reporter:status:Processed $file" >&2 
done 
 
# Put gzipped version into HDFS 
echo "reporter:status:Gzipping $target and putting in HDFS" >&2 
gzip -c $target.all | $HADOOP_INSTALL/bin/hadoop fs -put - gz/$target.gz

The script will unzip all the files, combine them, you should see output similar to this.

reporter:status:Processed /home/rhys/ncdc_data/ftp3.ncdc.noaa.gov/pub/data/noaa/2012/999999-94996-2012.gz 
reporter:status:Processed /home/rhys/ncdc_data/ftp3.ncdc.noaa.gov/pub/data/noaa/2012/999999-96404-2012.gz 
reporter:status:Processed /home/rhys/ncdc_data/ftp3.ncdc.noaa.gov/pub/data/noaa/2012/999999-99999-2012.gz

When it’s finished combining all the files it will store the data in Hadoop.

 reporter:status:Gzipping /home/rhys/ncdc_data/ftp3.ncdc.noaa.gov/pub/data/noaa/2012 and putting in HDFS 13/01/11 21:37:52 
INFO util.NativeCodeLoader: Loaded the native-hadoop library

Once the process has completed you should be able to confirm the storage of your data in Hadoop with the following command;

 rhys@linux-g1rx:~/hadoop_scripts> hadoop fs -ls gz/home/rhys/ncdc_data/ftp3.ncdc.noaa.gov/pub/data/noaa/2012.gz
 Found 1 items 
-rwxrwxrwx 1 rhys users 4870924294 2013-01-11 23:11 /home/rhys/hadoop_scripts/gz/home/rhys/ncdc_data/ftp3.ncdc.noaa.gov/pub/data/noaa/2012.gz

Now I have data in Hadoop it’s time to start writing MapReduce jobs!