Compartir a través de


Path Normalization

Here is part two of my discussion on Windows paths- the normalization of paths. See the first post at Path Format Overview.

Important: Some things I discuss aren't explicitly or centrally documented. While you can discover these details by thorough searching and experimentation, be very cautious of second guessing Windows APIs. If you need functionality a Windows API provides, use it, don't attempt to replicate it.

Overview

Almost all paths that are passed to Windows APIs get normalized. Normalization does a few main things:

  • Canonicalizes component/directory separators
  • Applies current directories to partially qualified (relative) paths
  • Evaluates relative directory components (current . and parent ..)
  • Trims certain characters

This normalization happens implicitly, but you can explicitly do it by calling GetFullPathName().

Identifying the Path

Identifying the type of path is the first order of business. Paths fall into one of a few categories:

  1. Begins with a single component separator \ (current drive root relative)
  2. Begins with two separators and a question mark or period \\?, \\. (device path)
  3. Begins with two separators and not a question mark or period (UNC)
  4. Begins with a drive letter, a volume separator, and a component separator C:\ (fully qualified)
  5. Begins with a drive letter, a volume separator, and no component separator C: (specified drive current directory relative)
  6. Is a legacy device CON, LPT1, etc. (device)
  7. Begins with anything else (current directory relative)

The type of the path determines whether or not a current directory is applied in some way. It also determines what the "root" of the path is- the root will never be eaten into by parent (..) directory segments.

Legacy Devices

If the path is a legacy DOS device such as CON, COM1, LPT1 it is converted into a device path by prepending \\.\ and returned.

Applying the Current Directory

If a path isn't fully qualified it will need the current directory applied. This involves cases 1, 5, and 7 described above. UNCs and device paths do not have the current directory applied. Neither does a full drive with separator C:\.

If the path starts with a single component separator (1)  the drive from the current directory is applied. If you pass \bar and the current directory is C:\foo\ you would get C:\bar.

If the path starts with a drive letter, volume separator, and no component separator (5) the last current directory set from the command shell for the specified drive is applied, or the drive alone if none is set. If you pass D:bar and the current directory is C:\foo\ and the last current directory on D: was D:\bar\ you would get D:\bar\bar. These "Drive Relative" paths are a common source of program and script logic errors. Assuming that a path beginning with a letter and a colon isn't relative is obviously not correct.

The last case is starts with something other than a separator (7) . If you pass bar and the current directory is C:\foo\ you would get C:\foo\bar.

Note that relative paths are dangerous in multithreaded programs (e.g. most programs) as the current directory is a process-wide setting. Any thread can change the current directory at any time. I'll discuss ways to deal with this in a future post.

Canonicalizing Separators

All forward slashes (/) are converted into the standard Windows separator- the back slash (\). Runs of slashes are collapsed into a single slash, after the first two slashes if present.

When identifying paths for normalization purposes, the initial direction of the slash does not matter. It is important to recognize, however, that forward slashes are not supported in Windows outside of this normalization step. This is critically important when it comes to skipping normalization, which we'll discuss shortly.

Evaluating Relative Components

As the path is processed, any components/segments that are comprised of a single or double period are evaluated. For a single period (.) the current segment is removed (as it means current directory). For a double period (..) the current segment and the parent segment are removed (as it means parent directory).

Parent directories are only removed if they aren't past the "root" of the path. The root of the path depends on the type of path. It is the drive (C:\) for DOS paths, the server/share for UNCs (\\Server\Share), and the device path prefix for device paths (\\?\ or \\.\).

Trimming Characters

Some characters will be removed (other than runs of separators and relative segments).

If a segment ends in a single period, that period will be removed. A segment of a single or double period falls under the relative component rule above. A segment of three periods (or more) doesn't hit any of these rules and is actually a valid file/directory name.

If the path doesn't end in a separator, all trailing periods and spaces (charater code 32 only) will be removed. If the last segment is simply a single or double period it falls under the relative components rule above. This rule leads to the possibly surprising ability to create a directory with a trailing space. You simply need to add a trailing separator to do so.

Skipping Normalization

Normally any path passed to a Windows API is (effectively) passed to GetFullPathName() and normalized. There is one important exception- if you have a device path that begins with a question mark instead of a period.  It must use the canonical backslash- if the path does not start with exactly \\?\ it will be normalized.

Why would you want to skip normalization? One reason is to get access to paths that are normally unavailable, but legal in NTFS/FAT/etc. A file or directory called "foo." for example, is impossible to access any other way. You also get to avoid some cycles by skipping normalization if you've already normalized.

The last reason is that the MAX_PATH check for path length is skipped as well, allowing for paths that are greater than 259 characters long. Most APIs will allow this, with some notable exceptions, such as Get/SetCurrentDirectory.

Skipping normalization and max path checks is the only difference between the two device path syntaxes- they are otherwise identical. Tread carefully with skipping normalization as you can easily create paths that are difficult for "normal" applications to deal with.

Paths that start with \\?\ are normalized if you explicitly pass them to GetFullPathName(). Don't forget, however, that rooting is different with device syntax (C:\.. does not normalize the same as \\?\C:\..). Note that you can pass > MAX_PATH paths to GetFullPathName() without \\?\. It supports arbitrary length paths (well, currently up to the maximum string size that Windows can handle, see UNICODE_STRING).

Up Next

The journey of a path from DOS to NT format. See how everything maps together...

Stupid DOS Tricks

C:\Sample>md "bar "

 04/21/2016  03:06 PM                   bar

C:\Sample>md "bar \"

 04/21/2016  03:06 PM                   bar
04/21/2016  03:06 PM                   bar

C:\Sample>echo foo > "foo."

 04/21/2016  03:06 PM                   bar
04/21/2016  03:06 PM                   bar 
04/21/2016  03:31 PM                 6 foo

C:\Sample>echo foo > "\\?\c:\sample\foo."

 04/21/2016  03:06 PM                   bar
04/21/2016  03:06 PM                   bar 
04/21/2016  03:31 PM                 6 foo
04/21/2016  03:32 PM                 6 foo.

Experimenting in Code

Output when run from the "D:" drive.

 Path 'C:\Foo\ ' becomes 'C:\Foo\'
Path 'C:\Foo\ . . .' becomes 'C:\Foo\'
Path 'C:\Foo\.' becomes 'C:\Foo'
Path 'C:\Foo\ \' becomes 'C:\Foo\ \'
Path 'C:\Foo\ . . .\' becomes 'C:\Foo\ . . \'
Path 'C:\Foo\a.\' becomes 'C:\Foo\a\'
Path 'C:\Foo\a. \' becomes 'C:\Foo\a. \'
Path 'C:\Foo\a..\' becomes 'C:\Foo\a..\'
Path 'C:\Foo\.\' becomes 'C:\Foo\'
Path 'C:\Foo\..\' becomes 'C:\'
Path '\\LOCALHOST\Share\Foo\..\' becomes '\\LOCALHOST\Share\'
Path 'C:\Foo\..\..\' becomes 'C:\'
Path '\\LOCALHOST\Share\Foo\..\..\' becomes '\\LOCALHOST\Share\'
Path 'C:\..' becomes 'C:\'
Path '\\LOCALHOST\Share\..' becomes '\\LOCALHOST\Share'
Path '\\?\C:\Foo\..\..' becomes '\\?\'
Path '\\.\C:\Foo\..\..' becomes '\\.\'
Path 'CON' becomes '\\.\CON'
Path 'LPT1' becomes '\\.\LPT1'
Path '\Foo' becomes 'D:\Foo'
Path 'Foo' becomes 'D:\projects\GetFullPathNameSample\GetFullPathNameSample\bin\Debug\Foo'
Path 'C:Foo\Bar' becomes 'C:\Program Files\Foo\Bar'
Path 'C:/Foo/' becomes 'C:\Foo\'
Path 'C://\Foo/' becomes 'C:\Foo\'
Path '//.' becomes '\\.\'
Path '//?' becomes '\\?\'
Path '//LOCALHOST/Share/..' becomes '\\LOCALHOST\Share'
Path '///LOCALHOST/Share/..' becomes '\\\LOCALHOST'
Path '////LOCALHOST/Share/..' becomes '\\\LOCALHOST'

Comments

  • Anonymous
    October 18, 2017
    The comment has been removed
    • Anonymous
      October 23, 2017
      I personally lean towards not putting trailing slashes on unless they are absolutely needed. Technically trailing slashes are only ever valid in NT if you're talking about the root directory on a drive. (e.g. C:) Many tools and .NET will helpfully trim the final slash before passing the path to Win32 as the paths will eventually get kicked back.
  • Anonymous
    December 19, 2017
    For completeness, since this memo is so comprehensive on the topic, you may want discuss issues related to paths which use volume GUID syntax, i.e.: \?\Volume{efab0df5-8438-4657-919f-ec4d6177d9e2}...Perhaps mention any special issues with these forms, whether they differ from--or how they're distinguished from--UNC paths, and add some relevant examples to your useful demonstration code?Thanks.
    • Anonymous
      January 23, 2018
      Thanks for the suggestion. How the aliases are generated and how volumes are managed is a pretty meaty, but interesting topic.