Comparing accented strings with TSQL

Today a colleague was running into some difficulties matching football team names containing accented characters. For example Olympique Alès and Olympique Ales were not matching when he wanted them to. The issue here is all to do with collations. We can use the Latin1_General_CI_AI collation in our queries to force the comparison to ignore accents (AI stands for Accent Insensitive).

DECLARE @string1 VARCHAR(30), @string2 VARCHAR(30)
 
SET @string1 = 'Olympique Alès'; -- With accented e
SET @string2 = 'Olympique Ales'; -- Without accented e
 
IF @string1 = @string2
BEGIN
	PRINT 'Strings do not match!'; -- This will not print
END
 
IF @string1 COLLATE Latin1_General_CI_AI = @string2 COLLATE Latin1_General_CI_AI
BEGIN
	PRINT 'Strings match if we use COLLATE Latin1_General_CI_AI!'; -- This will print
END

Here’s how it works with GROUP BY.

CREATE TABLE dbo.Places
(
	Id INTEGER IDENTITY(1,1) NOT NULL PRIMARY KEY,
	Place VARCHAR(30) NOT NULL
);
 
-- Insert some test data
INSERT INTO dbo.Places 
(
	Place
)
VALUES
(
	'Olympique Alès'
),
(
	'Olympique Ales'
),
(
	'Gazélec Ajaccio'
),
(
	'Gazelec Ajaccio'
);
 
 
SELECT Place, COUNT(*)
FROM dbo.Places
GROUP BY Place;

group by no collate

SELECT Place COLLATE Latin1_General_CI_AI, COUNT(*)
FROM dbo.Places
GROUP BY Place COLLATE Latin1_General_CI_AI;

group by with collate

You can also use this with JOINS.

-- No COLLATE
SELECT *
FROM dbo.Places t1
INNER JOIN dbo.Places t2
	ON t1.Place = t2.Place;

join no collate

-- With COLLATE
SELECT *
FROM dbo.Places t1
INNER JOIN dbo.Places t2
	ON t1.Place COLLATE Latin1_General_CI_AI = t2.Place COLLATE Latin1_General_CI_AI;

join with collate

So that’s how you can force strings to match if they contain accented characters. I was using this for a quick data mapping task but be wary of using the COLLATE clause, in any applications or processes, where performance may become an issue.

If you use the COLLATE clause then this will mean the database engine cannot use any index on the referenced column. You could replace the accented characters before inserting them into your database but here’s a solution I prefer using persisted computed columns.

-- Alter Places table. Note we specify the collation
ALTER TABLE dbo.Places ADD CleanPlaceName AS Place COLLATE Latin1_General_CI_AI PERSISTED;
-- Index this column!
CREATE INDEX idx_CleanPlaceName ON dbo.Places (CleanPlaceName);
 
-- Note accents in CleanPlaceName are preserved
SELECT *
FROM dbo.Places
 
-- No COLLATE needed!
SELECT CleanPlaceName, COUNT(*)
FROM dbo.Places
GROUP BY CleanPlaceName;

group by no collate needed

SELECT *
FROM dbo.Places t1
INNER JOIN dbo.Places t2
	ON t1.CleanPlaceName = t2.CleanPlaceName;

join no collate needed

This solution removes the need to include the COLLATE clause in your queries and keeps the possibility of using indices open!


Leave a Reply