Rework API graphique Vulkan - EnTT pour ECS + Chargement modèle 3D assimp + SDL3 pour events input et fenetre + mesh texture camera transform ok + attention tous les assets nouveaus ne sont pas commités et il y a du code test en dur dans scene addentity + restructuration globale

2026-03-14 20:24:17 +01:00
parent 7c352bc280
commit 6695d46bcd
672 changed files with 238656 additions and 1821 deletions
@@ -0,0 +1,29 @@
+SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+
+    Licensed under the Apache License, Version 2.0 (the "License");
+    you may not use this file except in compliance with the License.
+    You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+    Unless required by applicable law or agreed to in writing, software
+    distributed under the License is distributed on an "AS IS" BASIS,
+    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+    See the License for the specific language governing permissions and
+    limitations under the License.
+
+LLVM Exceptions to the Apache 2.0 License
+
+As an exception, if, as a result of your compiling your source code, portions
+of this Software are embedded into an Object form of such source code, you
+may redistribute such embedded portions in such Object form without complying
+with the conditions of Sections 4(a), 4(b) and 4(d) of the License.
+
+In addition, if you combine or link compiled forms of this Software with
+software that is licensed under the GPLv2 ("Combined Software") and if a
+court of competent jurisdiction determines that the patent provision (Section
+3), the indemnity provision (Section 9) or other Section of the License
+conflicts with the conditions of the GPLv2, you may retroactively and
+prospectively choose to deem waived or otherwise exclude such Section(s) of
+the License, but only in their entirety and only with respect to the Combined
+Software.
@@ -0,0 +1,156 @@
+Slang
+=====
+![CI Status](https://github.com/shader-slang/slang/actions/workflows/ci.yml/badge.svg?branch=master)
+![CTS Status](https://github.com/shader-slang/slang/actions/workflows/vk-gl-cts-nightly.yml/badge.svg)
+
+Slang is a shading language that makes it easier to build and maintain large shader codebases in a modular and extensible fashion, while also maintaining the highest possible performance on modern GPUs and graphics APIs.
+Slang is based on years of collaboration between researchers at NVIDIA, Carnegie Mellon University, Stanford, MIT, UCSD and the University of Washington.
+
+
+Why Slang?
+---------------
+
+The Slang shading language is designed to enable real-time graphics developers to work with large-scale, high-performance shader code.
+
+### Write Shaders Once, Run Anywhere
+
+The Slang compiler can generate code for a wide variety of targets: D3D12, Vulkan, Metal, D3D11, CUDA, and even generate code to run on a CPU. For textual targets, such as Metal Shading Language (MSL) and CUDA, Slang produces readable code that preserves original identifier names, as well as the type and call structure, making it easier to debug.
+
+### Access the Latest GPU Features
+
+Slang code is highly portable, but can still leverage unique platform capabilities, including the latest features in Direct3D and Vulkan. For example, developers can make full use of [pointers](https://shader-slang.com/slang/user-guide/convenience-features.html#pointers-limited) when generating SPIR-V.
+Slang's [capability system](https://shader-slang.com/slang/user-guide/capabilities.html) helps applications manage feature set differences across target platforms by ensuring code only uses available features during the type-checking step, before generating final code. Additionally, Slang provides [flexible interop](https://shader-slang.com/slang/user-guide/a1-04-interop.html) features to enable directly embedding target code or SPIR-V into generated shaders.
+
+### Leverage Neural Graphics with Automatic Differentiation
+
+Slang can [automatically generate both forward and backward derivative propagation code](https://shader-slang.com/slang/user-guide/autodiff.html) for complex functions that involve arbitrary control flow and dynamic dispatch. This allows existing rendering codebases to easily become differentiable, or for Slang to serve as the kernel language in a PyTorch-driven machine learning framework via [`slangtorch`](https://shader-slang.com/slang/user-guide/a1-02-slangpy.html).
+
+### Scalable Software Development with Modules
+
+Slang provides a [module system](https://shader-slang.com/slang/user-guide/modules.html) that enables logical organization of code for separate compilation. Slang modules can be independently compiled offline to a custom IR (with optional obfuscation) and then linked at runtime to generate code in formats such as DXIL or SPIR-V.
+
+### Code Specialization that Works with Modules
+
+Slang supports [generics and interfaces](https://shader-slang.com/slang/user-guide/interfaces-generics.html) (a.k.a. type traits/protocols), allowing for clear expression of shader specialization without the need for preprocessor techniques or string-pasting. Unlike C++ templates, Slang's generics are pre-checked and don't produce cascading error messages that are difficult to diagnose. The same generic shader can be specialized for a variety of different types to produce specialized code ahead of time, or on the fly, entirely under application control.
+
+### Easy On-ramp for HLSL and GLSL Codebases
+
+Slang's syntax is similar to HLSL, and most existing HLSL code can be compiled with the Slang compiler out-of-the-box, or with just minor modifications. This allows existing shader codebases to immediately benefit from Slang without requiring a complete rewrite or port.
+
+Slang provides a compatibility module that enables the use of most GLSL intrinsic functions and GLSL's parameter binding syntax.
+
+### Comprehensive Tooling Support
+
+Slang comes with full support of IntelliSense editing features in Visual Studio Code and Visual Studio through the Language Server Protocol.
+Full debugging capabilities are also available through RenderDoc and SPIR-V based tools.
+
+Getting Started
+---------------
+
+The fastest way to get started using Slang in your own development is to use a pre-built binary package, available through GitHub [releases](https://github.com/shader-slang/slang/releases).
+Slang binaries are also included in the [Vulkan SDK](https://vulkan.lunarg.com/sdk/home) since version 1.3.296.0.
+
+There are packages built for x86_64 and aarch64 Windows, Linux and macOS.
+Each binary release includes the command-line `slangc` compiler, a shared library for the compiler, and the `slang.h` header.
+
+See the user-guide for info on using the `slangc` command-line tool: [Slang Command Line Usage](
+https://shader-slang.com/slang/user-guide/compiling.html#command-line-compilation-with-slangc).
+
+If you want to try out the Slang language without installing anything, a fast and simple way is to use the [Slang Playground](https://shader-slang.com/slang-playground). The playground allows you to compile Slang code to a variety of targets, and even run some simple shaders directly within the browser. The playground loads Slang compiler to your browser and runs all compilation locally. No data will be sent to any servers.
+
+If you would like to build Slang from source, please consult the [build instructions](docs/building.md).
+
+Documentation
+-------------
+
+The Slang project provides a variety of different [documentation](docs/), but most users would be well served starting with the [User's Guide](https://shader-slang.github.io/slang/user-guide/).
+
+For developers writing Slang code, the [Slang Core Module Reference](https://shader-slang.com/stdlib-reference/) provides detailed documentation on Slang's built-in types and functions.
+
+We also provide a few [examples](examples/) of how to integrate Slang into a rendering application.
+
+These examples use a graphics layer that we include with Slang called "GFX" which is an abstraction library of various graphics APIs (D3D11, D2D12, OpenGL, Vulkan, CUDA, and the CPU) to support cross-platform applications using GPU graphics and compute capabilities. 
+GFX is being deprecated in favor of [slang-rhi](https://github.com/shader-slang/slang-rhi).
+
+Additionally, we recommend checking out [Vulkan Mini Examples](https://github.com/nvpro-samples/vk_mini_samples/) for more examples of using Slang's language features available on Vulkan, such as pointers and the ray tracing intrinsics.
+
+Contributing
+------------
+
+If you'd like to contribute to the project, we are excited to have your input.
+The following guidelines should be observed by contributors:
+
+* Please follow the contributor [Code of Conduct](CODE_OF_CONDUCT.md).
+* Bugs reports and feature requests should go through the GitHub issue tracker
+* Changes should ideally come in as small pull requests on top of `master`, coming from your own personal fork of the project
+* Large features that will involve multiple contributors or a long development time should be discussed in issues, and broken down into smaller pieces that can be implemented and checked in in stages
+
+[Contribution guide](CONTRIBUTING.md) describes the workflow for contributors at more detail.
+
+Limitations and Support
+-----------------------
+
+### Platform support
+
+The Slang compiler and libraries can be built on the following platforms:
+
+|  Windows  |   Linux   |   MacOS   |  WebAssembly |
+|:---------:|:---------:|:---------:|:------------:|
+| supported | supported | supported | experimental |
+
+Both `x86_64` and `aarch64` architectures are supported on Windows, Linux and MacOS platforms.
+
+### Target support
+
+Slang can compile shader code to the following targets:
+
+|    Target   |                                         Status                                        |                          Output Formats                          |
+|:-----------:|:-------------------------------------------------------------------------------------:|:----------------------------------------------------------------:|
+| Direct3D 11 |    [supported](https://shader-slang.com/slang/user-guide/targets.html#direct3d-11)    |                               HLSL                               |
+| Direct3D 12 |    [supported](https://shader-slang.com/slang/user-guide/targets.html#direct3d-12)    |                               HLSL                               |
+|    Vulkan   |       [supported](https://shader-slang.com/slang/user-guide/targets.html#vulkan)      |                            SPIRV, GLSL                           |
+|    Metal    |     [experimental*](https://shader-slang.com/slang/user-guide/targets.html#metal)     |                      Metal Shading Language                      |
+|    WebGPU   |                                     experimental**                                    |                               WGSL                               |
+|     CUDA    |   [supported](https://shader-slang.com/slang/user-guide/targets.html#cuda-and-optix)  |                        C++ (compute only)                        |
+|    Optix    | [experimental](https://shader-slang.com/slang/user-guide/targets.html#cuda-and-optix) |                             C++ (WIP)                            |
+|     CPU     |   [experimental](https://shader-slang.com/slang/user-guide/targets.html#cpu-compute)  | C++ (kernel), C++ (host), standalone executable, dynamic library |
+
+> *Slang currently supports generating vertex, fragment, compute, task and mesh
+> shaders for Metal.
+
+> **WGSL support is still work in-progress.
+
+For greater detail, see the [Supported Compilation
+Targets](https://shader-slang.com/slang/user-guide/targets.html) section of the
+[User Guide](https://shader-slang.github.io/slang/user-guide/)
+
+The Slang project has been used for production applications and large shader
+codebases, but it is still under active development. Support is currently
+focused on the platforms (Windows, Linux) and target APIs (Direct3D 12, Vulkan)
+where Slang is used most heavily. Users who are looking for support on other
+platforms or APIs should coordinate with the development team via the issue
+tracker to make sure that their use cases can be supported.
+
+License
+-------
+
+The Slang code itself is under the Apache 2.0 with LLVM Exception license (see [LICENSE](LICENSE)).
+
+Builds of the core Slang tools depend on the following projects, either automatically or optionally, which may have their own licenses:
+
+* [`glslang`](https://github.com/KhronosGroup/glslang) (BSD)
+* [`lz4`](https://github.com/lz4/lz4) (BSD)
+* [`miniz`](https://github.com/richgel999/miniz) (MIT)
+* [`spirv-headers`](https://github.com/KhronosGroup/SPIRV-Headers) (Modified MIT)
+* [`spirv-tools`](https://github.com/KhronosGroup/SPIRV-Tools) (Apache 2.0)
+* [`ankerl::unordered_dense::{map, set}`](https://github.com/martinus/unordered_dense) (MIT)
+
+Slang releases may include [LLVM](https://github.com/llvm/llvm-project) under the license:
+
+* [`llvm`](https://llvm.org/docs/DeveloperPolicy.html#new-llvm-project-license-framework) (Apache 2.0 License with LLVM exceptions)
+
+Some of the tests and example programs that build with Slang use the following projects, which may have their own licenses:
+
+* [`glm`](https://github.com/g-truc/glm) (MIT)
+* `stb_image` and `stb_image_write` from the [`stb`](https://github.com/nothings/stb) collection of single-file libraries (Public Domain)
+* [`tinyobjloader`](https://github.com/tinyobjloader/tinyobjloader) (MIT)
@@ -0,0 +1,446 @@
+public namespace slang
+{
+
+public typedef int32_t Result;
+public typedef uint64_t Size;
+public typedef int64_t Int;
+public typedef uint64_t UInt;
+
+/*!
+@brief Severity of a diagnostic generated by the compiler.
+Values come from the enum below, with higher values representing more severe
+conditions, and all values >= SLANG_SEVERITY_ERROR indicating compilation
+failure.
+*/
+public enum SlangSeverity
+{
+    SLANG_SEVERITY_DISABLED = 0, /**< A message that is disabled, filtered out. */
+    SLANG_SEVERITY_NOTE,         /**< An informative message. */
+    SLANG_SEVERITY_WARNING,      /**< A warning, which indicates a possible proble. */
+    SLANG_SEVERITY_ERROR,        /**< An error, indicating that compilation failed. */
+    SLANG_SEVERITY_FATAL,        /**< An unrecoverable error, which forced compilation to abort. */
+    SLANG_SEVERITY_INTERNAL,     /**< An internal error, indicating a logic error in the compiler. */
+};
+
+public enum SlangDiagnosticFlags
+{
+    SLANG_DIAGNOSTIC_FLAG_VERBOSE_PATHS = 0x01,
+    SLANG_DIAGNOSTIC_FLAG_TREAT_WARNINGS_AS_ERRORS = 0x02
+};
+
+public enum SlangBindableResourceType
+{
+    SLANG_NON_BINDABLE = 0,
+    SLANG_TEXTURE,
+    SLANG_SAMPLER,
+    SLANG_UNIFORM_BUFFER,
+    SLANG_STORAGE_BUFFER,
+};
+
+public enum SlangCompileTarget
+{
+    SLANG_TARGET_UNKNOWN,
+    SLANG_TARGET_NONE,
+    SLANG_GLSL,
+    SLANG_GLSL_VULKAN,          //< deprecated: just use `SLANG_GLSL`
+    SLANG_GLSL_VULKAN_ONE_DESC, //< deprecated
+    SLANG_HLSL,
+    SLANG_SPIRV,
+    SLANG_SPIRV_ASM,
+    SLANG_DXBC,
+    SLANG_DXBC_ASM,
+    SLANG_DXIL,
+    SLANG_DXIL_ASM,
+    SLANG_C_SOURCE,              ///< The C language
+    SLANG_CPP_SOURCE,            ///< C++ code for shader kernels.
+    SLANG_CPP_PYTORCH_BINDING,
+    SLANG_HOST_EXECUTABLE,       ///<  Standalone binary executable (for hosting CPU/OS)
+    SLANG_SHADER_SHARED_LIBRARY, ///< A shared library/Dll for shader kernels (for hosting CPU/OS)
+    SLANG_SHADER_HOST_CALLABLE,  ///< A CPU target that makes the compiled shader code available to be run immediately
+    SLANG_CUDA_SOURCE,           ///< Cuda source
+    SLANG_PTX,                   ///< PTX
+    SLANG_OBJECT_CODE,           ///< Object code that can be used for later linking
+    SLANG_HOST_CPP_SOURCE,       ///< C++ code for host library or executable.
+    SLANG_HOST_HOST_CALLABLE,    ///<
+    SLANG_CPP_HEADER,            ///< C++ header for shader kernels.
+    SLANG_CUDA_HEADER,           ///< Cuda header
+    SLANG_TARGET_COUNT_OF,
+};
+
+/* A "container format" describes the way that the outputs
+for multiple files, entry points, targets, etc. should be
+combined into a single artifact for output. */
+public enum SlangContainerFormat
+{
+    /* Don't generate a container. */
+    SLANG_CONTAINER_FORMAT_NONE,
+
+    /* Generate a container in the `.slang-module` format,
+    which includes reflection information, compiled kernels, etc. */
+    SLANG_CONTAINER_FORMAT_SLANG_MODULE,
+};
+
+public enum SlangPassThrough : int
+{
+    SLANG_PASS_THROUGH_NONE,
+    SLANG_PASS_THROUGH_FXC,
+    SLANG_PASS_THROUGH_DXC,
+    SLANG_PASS_THROUGH_GLSLANG,
+    SLANG_PASS_THROUGH_SPIRV_DIS,
+    SLANG_PASS_THROUGH_CLANG,         ///< Clang C/C++ compiler
+    SLANG_PASS_THROUGH_VISUAL_STUDIO, ///< Visual studio C/C++ compiler
+    SLANG_PASS_THROUGH_GCC,           ///< GCC C/C++ compiler
+    SLANG_PASS_THROUGH_GENERIC_C_CPP, ///< Generic C or C++ compiler, which is decided by the source type
+    SLANG_PASS_THROUGH_NVRTC,         ///< NVRTC Cuda compiler
+    SLANG_PASS_THROUGH_LLVM,          ///< LLVM 'compiler' - includes LLVM and Clang
+    SLANG_PASS_THROUGH_SPIRV_OPT,
+    SLANG_PASS_THROUGH_COUNT_OF,
+};
+
+/* Defines an archive type used to holds a 'file system' type structure. */
+public enum SlangArchiveType : int
+{
+    SLANG_ARCHIVE_TYPE_UNDEFINED,
+    SLANG_ARCHIVE_TYPE_ZIP,
+    SLANG_ARCHIVE_TYPE_RIFF, ///< Riff container with no compression
+    SLANG_ARCHIVE_TYPE_RIFF_DEFLATE,
+    SLANG_ARCHIVE_TYPE_RIFF_LZ4,
+    SLANG_ARCHIVE_TYPE_COUNT_OF,
+};
+
+/*!
+Flags to control compilation behavior.
+*/
+public enum SlangCompileFlags
+{
+    /* Do as little mangling of names as possible, to try to preserve original names */
+    SLANG_COMPILE_FLAG_NO_MANGLING = 1 << 3,
+
+    /* Skip code generation step, just check the code and generate layout */
+    SLANG_COMPILE_FLAG_NO_CODEGEN = 1 << 4,
+
+    /* Obfuscate shader names on release products */
+    SLANG_COMPILE_FLAG_OBFUSCATE = 1 << 5,
+
+    /* Deprecated flags: kept around to allow existing applications to
+    compile. Note that the relevant features will still be left in
+    their default state. */
+    SLANG_COMPILE_FLAG_NO_CHECKING = 0,
+    SLANG_COMPILE_FLAG_SPLIT_MIXED_TYPES = 0,
+};
+
+/*!
+@brief Flags to control code generation behavior of a compilation target */
+public enum SlangTargetFlags
+{
+    None = 0,
+
+    /* When compiling for a D3D Shader Model 5.1 or higher target, allocate
+       distinct register spaces for parameter blocks.
+
+       @deprecated This behavior is now enabled unconditionally.
+    */
+    SLANG_TARGET_FLAG_PARAMETER_BLOCKS_USE_REGISTER_SPACES = 1 << 4,
+
+    /* When set, will generate target code that contains all entrypoints defined
+       in the input source or specified via the `spAddEntryPoint` function in a
+       single output module (library/source file).
+    */
+    SLANG_TARGET_FLAG_GENERATE_WHOLE_PROGRAM = 1 << 8,
+
+    /* When set, will dump out the IR between intermediate compilation steps.*/
+    SLANG_TARGET_FLAG_DUMP_IR = 1 << 9,
+
+    /* When set, will generate SPIRV directly instead of going through glslang. */
+    SLANG_TARGET_FLAG_GENERATE_SPIRV_DIRECTLY = 1 << 10,
+};
+
+/*!
+@brief Options to control floating-point precision guarantees for a target.
+*/
+public enum SlangFloatingPointMode
+{
+    SLANG_FLOATING_POINT_MODE_DEFAULT = 0,
+    SLANG_FLOATING_POINT_MODE_FAST,
+    SLANG_FLOATING_POINT_MODE_PRECISE,
+};
+
+/*!
+@brief Options to control emission of `#line` directives
+*/
+public enum SlangLineDirectiveMode
+{
+    SLANG_LINE_DIRECTIVE_MODE_DEFAULT = 0, /**< Default behavior: pick behavior base on target. */
+    SLANG_LINE_DIRECTIVE_MODE_NONE,        /**< Don't emit line directives at all. */
+    SLANG_LINE_DIRECTIVE_MODE_STANDARD,    /**< Emit standard C-style `#line` directives. */
+    SLANG_LINE_DIRECTIVE_MODE_GLSL,        /**< Emit GLSL-style directives with file *number* instead of name */
+};
+
+public enum SlangSourceLanguage : int
+{
+    SLANG_SOURCE_LANGUAGE_UNKNOWN,
+    SLANG_SOURCE_LANGUAGE_SLANG,
+    SLANG_SOURCE_LANGUAGE_HLSL,
+    SLANG_SOURCE_LANGUAGE_GLSL,
+    SLANG_SOURCE_LANGUAGE_C,
+    SLANG_SOURCE_LANGUAGE_CPP,
+    SLANG_SOURCE_LANGUAGE_CUDA,
+    SLANG_SOURCE_LANGUAGE_COUNT_OF,
+};
+
+public enum SlangProfileID
+{
+    SLANG_PROFILE_UNKNOWN,
+};
+
+public enum SlangCapabilityID
+{
+    SLANG_CAPABILITY_UNKNOWN = 0,
+};
+
+public enum SlangMatrixLayoutMode
+{
+    SLANG_MATRIX_LAYOUT_MODE_UNKNOWN = 0,
+    SLANG_MATRIX_LAYOUT_ROW_MAJOR,
+    SLANG_MATRIX_LAYOUT_COLUMN_MAJOR,
+};
+
+public enum SlangStage
+{
+    SLANG_STAGE_NONE,
+    SLANG_STAGE_VERTEX,
+    SLANG_STAGE_HULL,
+    SLANG_STAGE_DOMAIN,
+    SLANG_STAGE_GEOMETRY,
+    SLANG_STAGE_FRAGMENT,
+    SLANG_STAGE_COMPUTE,
+    SLANG_STAGE_RAY_GENERATION,
+    SLANG_STAGE_INTERSECTION,
+    SLANG_STAGE_ANY_HIT,
+    SLANG_STAGE_CLOSEST_HIT,
+    SLANG_STAGE_MISS,
+    SLANG_STAGE_CALLABLE,
+    SLANG_STAGE_MESH,
+    SLANG_STAGE_AMPLIFICATION,
+};
+
+public enum SlangDebugInfoLevel
+{
+    SLANG_DEBUG_INFO_LEVEL_NONE = 0, /**< Don't emit debug information at all. */
+    SLANG_DEBUG_INFO_LEVEL_MINIMAL,  /**< Emit as little debug information as possible, while still supporting stack trackes. */
+    SLANG_DEBUG_INFO_LEVEL_STANDARD, /**< Emit whatever is the standard level of debug information for each target. */
+    SLANG_DEBUG_INFO_LEVEL_MAXIMAL,  /**< Emit as much debug infromation as possible for each target. */
+};
+
+public enum SlangOptimizationLevel
+{
+    SLANG_OPTIMIZATION_LEVEL_NONE = 0, /**< Don't optimize at all. */
+    SLANG_OPTIMIZATION_LEVEL_DEFAULT,  /**< Default optimization level: balance code quality and compilation time. */
+    SLANG_OPTIMIZATION_LEVEL_HIGH,     /**< Optimize aggressively. */
+    SLANG_OPTIMIZATION_LEVEL_MAXIMAL,  /**< Include optimizations that may take a very long time, or may involve severe space-vs-speed tradeoffs */
+};
+public enum SlangTypeKind
+{
+    NONE,
+    STRUCT,
+    ARRAY,
+    MATRIX,
+    VECTOR,
+    SCALAR,
+    CONSTANT_BUFFER,
+    RESOURCE,
+    SAMPLER_STATE,
+    TEXTURE_BUFFER,
+    SHADER_STORAGE_BUFFER,
+    PARAMETER_BLOCK,
+    GENERIC_TYPE_PARAMETER,
+    INTERFACE,
+    OUTPUT_STREAM,
+    SPECIALIZED,
+    FEEDBACK,
+    COUNT,
+};
+
+public enum SlangScalarType
+{
+    NONE,
+    VOID,
+    BOOL,
+    INT32,
+    UINT32,
+    INT64,
+    UINT64,
+    FLOAT16,
+    FLOAT32,
+    FLOAT64,
+    INT8,
+    UINT8,
+    INT16,
+    UINT16,
+};
+
+public struct TypeReflection
+{
+};
+
+public enum CompileStdLibFlags
+{
+    WriteDocumentation = 0x1,
+};
+
+[COM("8BA5FB08-5195-40e2-AC58-0D-98-9C-3A-01-02")]
+public interface ISlangBlob
+{
+    public void *getBufferPointer();
+    public Size getBufferSize();
+};
+
+/** Description of a code generation target.
+ */
+public struct TargetDesc
+{
+    /** The size of this structure, in bytes.
+     */
+    public Size structureSize = 40;
+
+    /** The target format to generate code for (e.g., SPIR-V, DXIL, etc.)
+     */
+    public SlangCompileTarget format = SlangCompileTarget.SLANG_TARGET_UNKNOWN;
+
+    /** The compilation profile supported by the target (e.g., "Shader Model 5.1")
+     */
+    public SlangProfileID profile = SlangProfileID.SLANG_PROFILE_UNKNOWN;
+
+    /** Flags for the code generation target. Currently unused. */
+    public SlangTargetFlags flags = SlangTargetFlags.None;
+
+    /** Default mode to use for floating-point operations on the target.
+     */
+    public SlangFloatingPointMode floatingPointMode = SlangFloatingPointMode.SLANG_FLOATING_POINT_MODE_DEFAULT;
+
+    /** Optimization level to use for the target.
+     */
+    public SlangOptimizationLevel optimizationLevel = SlangOptimizationLevel.SLANG_OPTIMIZATION_LEVEL_DEFAULT;
+
+    /** The line directive mode for output source code.
+     */
+    public SlangLineDirectiveMode lineDirectiveMode = SlangLineDirectiveMode.SLANG_LINE_DIRECTIVE_MODE_DEFAULT;
+
+    /** Whether to force `scalar` layout for glsl shader storage buffers.
+     */
+    public bool forceGLSLScalarBufferLayout = false;
+};
+
+public enum SessionFlags
+{
+    kSessionFlags_None = 0
+};
+
+public struct PreprocessorMacroDesc
+{
+    public NativeString name;
+    public NativeString value;
+};
+
+public struct SessionDesc
+{
+    /** The size of this structure, in bytes.
+     */
+    public Size structureSize = 72;
+
+    /** Code generation targets to include in the session.
+     */
+    public TargetDesc *targets = nullptr;
+    public Int targetCount = 0;
+
+    /** Flags to configure the session.
+     */
+    public SessionFlags flags = SessionFlags.kSessionFlags_None;
+
+    /** Default layout to assume for variables with matrix types.
+     */
+    public SlangMatrixLayoutMode defaultMatrixLayoutMode = SlangMatrixLayoutMode.SLANG_MATRIX_LAYOUT_ROW_MAJOR;
+
+    /** Paths to use when searching for `#include`d or `import`ed files.
+     */
+    public NativeString *searchPaths = nullptr;
+    public Int searchPathCount = 0;
+
+    public PreprocessorMacroDesc *preprocessorMacros = nullptr;
+    public Int preprocessorMacroCount = 0;
+
+    public void *fileSystem = nullptr;
+};
+
+/** A global session for interaction with the Slang library.
+
+An application may create and re-use a single global session across
+multiple sessions, in order to amortize startups costs (in current
+Slang this is mostly the cost of loading the Slang standard library).
+
+The global session is currently *not* thread-safe and objects created from
+a single global session should only be used from a single thread at
+a time.
+*/
+[COM("c140b5fd-0c78-452e-ba7c-1a-1e-70-c7-f7-1c")]
+public interface IGlobalSession
+{
+};
+
+public enum class ContainerType
+{
+    None, UnsizedArray, StructuredBuffer, ConstantBuffer, ParameterBlock
+};
+
+/** A session provides a scope for code that is loaded.
+
+A session can be used to load modules of Slang source code,
+and to request target-specific compiled binaries and layout
+information.
+
+In order to be able to load code, the session owns a set
+of active "search paths" for resolving `#include` directives
+and `import` declrations, as well as a set of global
+preprocessor definitions that will be used for all code
+that gets `import`ed in the session.
+
+If multiple user shaders are loaded in the same session,
+and import the same module (e.g., two source files do `import X`)
+then there will only be one copy of `X` loaded within the session.
+
+In order to be able to generate target code, the session
+owns a list of available compilation targets, which specify
+code generation options.
+
+Code loaded and compiled within a session is owned by the session
+and will remain resident in memory until the session is released.
+Applications wishing to control the memory usage for compiled
+and loaded code should use multiple sessions.
+*/
+[COM("67618701-d116-468f-ab3b-47-4b-ed-ce-0e-3d")]
+public interface ISession
+{
+};
+
+[COM("5bc42be8-5c50-4929-9e5e-d15e7c24015f")]
+public interface IComponentType
+{
+}
+
+public struct TypeLayoutReflection { }
+
+/** The kind of specialization argument. */
+public enum class SpecializationArgKind : int32_t
+{
+    Unknown, /**< An invalid specialization argument. */
+    Type,    /**< Specialize to a type. */
+};
+
+public struct SpecializationArg
+{
+    public SpecializationArgKind kind;
+    /** A type specialization argument, used for `Kind::Type`. */
+    public TypeReflection *type;
+}
+
+}
@@ -0,0 +1,191 @@
+#ifndef SLANG_COM_HELPER_H
+#define SLANG_COM_HELPER_H
+
+/** \file slang-com-helper.h
+ */
+
+#include "slang.h"
+
+#include <algorithm>
+#include <atomic>
+#include <iterator>
+
+/* !!!!!!!!!!!!!!!!!!!!! Macros to help checking SlangResult !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!*/
+
+/*! Set SLANG_HANDLE_RESULT_FAIL(x) to code to be executed whenever an error occurs, and is detected
+ * by one of the macros */
+#ifndef SLANG_HANDLE_RESULT_FAIL
+    #define SLANG_HANDLE_RESULT_FAIL(x)
+#endif
+
+//! Helper macro, that makes it easy to add result checking to calls in functions/methods that
+//! themselves return Result.
+#define SLANG_RETURN_ON_FAIL(x)             \
+    {                                       \
+        SlangResult _res = (x);             \
+        if (SLANG_FAILED(_res))             \
+        {                                   \
+            SLANG_HANDLE_RESULT_FAIL(_res); \
+            return _res;                    \
+        }                                   \
+    }
+//! Helper macro that can be used to test the return value from a call, and will return in a void
+//! method/function
+#define SLANG_RETURN_VOID_ON_FAIL(x)        \
+    {                                       \
+        SlangResult _res = (x);             \
+        if (SLANG_FAILED(_res))             \
+        {                                   \
+            SLANG_HANDLE_RESULT_FAIL(_res); \
+            return;                         \
+        }                                   \
+    }
+//! Helper macro that will return false on failure.
+#define SLANG_RETURN_FALSE_ON_FAIL(x)       \
+    {                                       \
+        SlangResult _res = (x);             \
+        if (SLANG_FAILED(_res))             \
+        {                                   \
+            SLANG_HANDLE_RESULT_FAIL(_res); \
+            return false;                   \
+        }                                   \
+    }
+//! Helper macro that will return nullptr on failure.
+#define SLANG_RETURN_NULL_ON_FAIL(x)        \
+    {                                       \
+        SlangResult _res = (x);             \
+        if (SLANG_FAILED(_res))             \
+        {                                   \
+            SLANG_HANDLE_RESULT_FAIL(_res); \
+            return nullptr;                 \
+        }                                   \
+    }
+
+//! Helper macro that will assert if the return code from a call is failure, also returns the
+//! failure.
+#define SLANG_ASSERT_ON_FAIL(x) \
+    {                           \
+        SlangResult _res = (x); \
+        if (SLANG_FAILED(_res)) \
+        {                       \
+            assert(false);      \
+            return _res;        \
+        }                       \
+    }
+//! Helper macro that will assert if the result from a call is a failure, also returns.
+#define SLANG_ASSERT_VOID_ON_FAIL(x) \
+    {                                \
+        SlangResult _res = (x);      \
+        if (SLANG_FAILED(_res))      \
+        {                            \
+            assert(false);           \
+            return;                  \
+        }                            \
+    }
+
+/* !!!!!!!!!!!!!!!!!!!!!!! C++ helpers !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!*/
+
+#if defined(__cplusplus)
+namespace Slang
+{
+
+// Alias SlangResult to Slang::Result
+typedef SlangResult Result;
+// Alias SlangUUID to Slang::Guid
+typedef SlangUUID Guid;
+
+} // namespace Slang
+
+// Operator == and != for Guid/SlangUUID
+
+SLANG_FORCE_INLINE bool operator==(const Slang::Guid& aIn, const Slang::Guid& bIn)
+{
+    return aIn.data1 == bIn.data1 && aIn.data2 == bIn.data2 && aIn.data3 == bIn.data3 &&
+           std::equal(aIn.data4, aIn.data4 + std::size(aIn.data4), bIn.data4);
+}
+
+SLANG_FORCE_INLINE bool operator!=(const Slang::Guid& a, const Slang::Guid& b)
+{
+    return !(a == b);
+}
+
+    /* !!!!!!!! Macros to simplify implementing COM interfaces !!!!!!!!!!!!!!!!!!!!!!!!!!!! */
+
+    /* Assumes underlying implementation has a member m_refCount that is initialized to 0 and can
+    have ++ and -- operate on it. For SLANG_IUNKNOWN_QUERY_INTERFACE to work - must have a method
+    'getInterface' that returns valid pointers for the Guid, or nullptr if not found. */
+
+    #define SLANG_IUNKNOWN_QUERY_INTERFACE                     \
+        SLANG_NO_THROW SlangResult SLANG_MCALL queryInterface( \
+            SlangUUID const& uuid,                             \
+            void** outObject) SLANG_OVERRIDE                   \
+        {                                                      \
+            ISlangUnknown* intf = getInterface(uuid);          \
+            if (intf)                                          \
+            {                                                  \
+                addRef();                                      \
+                *outObject = intf;                             \
+                return SLANG_OK;                               \
+            }                                                  \
+            return SLANG_E_NO_INTERFACE;                       \
+        }
+
+    #define SLANG_IUNKNOWN_ADD_REF                                  \
+        SLANG_NO_THROW uint32_t SLANG_MCALL addRef() SLANG_OVERRIDE \
+        {                                                           \
+            return ++m_refCount;                                    \
+        }
+
+    #define SLANG_IUNKNOWN_RELEASE                                   \
+        SLANG_NO_THROW uint32_t SLANG_MCALL release() SLANG_OVERRIDE \
+        {                                                            \
+            --m_refCount;                                            \
+            if (m_refCount == 0)                                     \
+            {                                                        \
+                delete this;                                         \
+                return 0;                                            \
+            }                                                        \
+            return m_refCount;                                       \
+        }
+
+    #define SLANG_IUNKNOWN_ALL         \
+        SLANG_IUNKNOWN_QUERY_INTERFACE \
+        SLANG_IUNKNOWN_ADD_REF         \
+        SLANG_IUNKNOWN_RELEASE
+
+    // ------------------------ RefObject IUnknown -----------------------------
+
+    #define SLANG_REF_OBJECT_IUNKNOWN_QUERY_INTERFACE          \
+        SLANG_NO_THROW SlangResult SLANG_MCALL queryInterface( \
+            SlangUUID const& uuid,                             \
+            void** outObject) SLANG_OVERRIDE                   \
+        {                                                      \
+            void* intf = getInterface(uuid);                   \
+            if (intf)                                          \
+            {                                                  \
+                addReference();                                \
+                *outObject = intf;                             \
+                return SLANG_OK;                               \
+            }                                                  \
+            return SLANG_E_NO_INTERFACE;                       \
+        }
+
+    #define SLANG_REF_OBJECT_IUNKNOWN_ADD_REF                       \
+        SLANG_NO_THROW uint32_t SLANG_MCALL addRef() SLANG_OVERRIDE \
+        {                                                           \
+            return (uint32_t)addReference();                        \
+        }
+    #define SLANG_REF_OBJECT_IUNKNOWN_RELEASE                        \
+        SLANG_NO_THROW uint32_t SLANG_MCALL release() SLANG_OVERRIDE \
+        {                                                            \
+            return (uint32_t)releaseReference();                     \
+        }
+
+    #define SLANG_REF_OBJECT_IUNKNOWN_ALL         \
+        SLANG_REF_OBJECT_IUNKNOWN_QUERY_INTERFACE \
+        SLANG_REF_OBJECT_IUNKNOWN_ADD_REF         \
+        SLANG_REF_OBJECT_IUNKNOWN_RELEASE
+
+#endif // defined(__cplusplus)
+
+#endif
@@ -0,0 +1,210 @@
+#ifndef SLANG_COM_PTR_H
+#define SLANG_COM_PTR_H
+
+#include "slang-com-helper.h"
+
+#include <assert.h>
+#include <cstddef>
+
+namespace Slang
+{
+
+/*! \brief ComPtr is a simple smart pointer that manages types which implement COM based interfaces.
+\details A class that implements a COM, must derive from the IUnknown interface or a type that
+matches it's layout exactly (such as ISlangUnknown). Trying to use this template with a class that
+doesn't follow these rules, will lead to undefined behavior. This is a 'strong' pointer type, and
+will AddRef when a non null pointer is set and Release when the pointer leaves scope. Using 'detach'
+allows a pointer to be removed from the management of the ComPtr. To set the smart pointer to null,
+there is the method setNull, or alternatively just assign SLANG_NULL/nullptr.
+
+One edge case using the template is that sometimes you want access as a pointer to a pointer.
+Sometimes this is to write into the smart pointer, other times to pass as an array. To handle these
+different behaviors there are the methods readRef and writeRef, which are used instead of the &
+(ref) operator. For example
+
+\code
+Void doSomething(ID3D12Resource** resources, IndexT numResources);
+// ...
+ComPtr<ID3D12Resource> resources[3];
+doSomething(resources[0].readRef(), SLANG_COUNT_OF(resource));
+\endcode
+
+A more common scenario writing to the pointer
+
+\code
+IUnknown* unk = ...;
+
+ComPtr<ID3D12Resource> resource;
+Result res = unk->QueryInterface(resource.writeRef());
+\endcode
+*/
+
+// Enum to force initializing as an attach (without adding a reference)
+enum InitAttach
+{
+    INIT_ATTACH
+};
+
+template<class T>
+class ComPtr
+{
+public:
+    typedef T Type;
+    typedef ComPtr ThisType;
+    typedef ISlangUnknown* Ptr;
+
+    /// Constructors
+    /// Default Ctor. Sets to nullptr
+    SLANG_FORCE_INLINE ComPtr()
+        : m_ptr(nullptr)
+    {
+    }
+    SLANG_FORCE_INLINE ComPtr(std::nullptr_t)
+        : m_ptr(nullptr)
+    {
+    }
+    /// Sets, and ref counts.
+    SLANG_FORCE_INLINE explicit ComPtr(T* ptr)
+        : m_ptr(ptr)
+    {
+        if (ptr)
+            ((Ptr)ptr)->addRef();
+    }
+    /// The copy ctor
+    SLANG_FORCE_INLINE ComPtr(const ThisType& rhs)
+        : m_ptr(rhs.m_ptr)
+    {
+        if (m_ptr)
+            ((Ptr)m_ptr)->addRef();
+    }
+
+    /// Ctor without adding to ref count.
+    SLANG_FORCE_INLINE explicit ComPtr(InitAttach, T* ptr)
+        : m_ptr(ptr)
+    {
+    }
+    /// Ctor without adding to ref count
+    SLANG_FORCE_INLINE ComPtr(InitAttach, const ThisType& rhs)
+        : m_ptr(rhs.m_ptr)
+    {
+    }
+
+#ifdef SLANG_HAS_MOVE_SEMANTICS
+    /// Move Ctor
+    SLANG_FORCE_INLINE ComPtr(ThisType&& rhs)
+        : m_ptr(rhs.m_ptr)
+    {
+        rhs.m_ptr = nullptr;
+    }
+    /// Move assign
+    SLANG_FORCE_INLINE ComPtr& operator=(ThisType&& rhs)
+    {
+        T* swap = m_ptr;
+        m_ptr = rhs.m_ptr;
+        rhs.m_ptr = swap;
+        return *this;
+    }
+#endif
+
+    /// Destructor releases the pointer, assuming it is set
+    SLANG_FORCE_INLINE ~ComPtr()
+    {
+        if (m_ptr)
+            ((Ptr)m_ptr)->release();
+    }
+
+    // !!! Operators !!!
+
+    /// Returns the dumb pointer
+    SLANG_FORCE_INLINE operator T*() const { return m_ptr; }
+
+    SLANG_FORCE_INLINE T& operator*() { return *m_ptr; }
+    /// For making method invocations through the smart pointer work through the dumb pointer
+    SLANG_FORCE_INLINE T* operator->() const { return m_ptr; }
+
+    /// Assign
+    SLANG_FORCE_INLINE const ThisType& operator=(const ThisType& rhs);
+    /// Assign from dumb ptr
+    SLANG_FORCE_INLINE T* operator=(T* in);
+
+    /// Get the pointer and don't ref
+    SLANG_FORCE_INLINE T* get() const { return m_ptr; }
+    /// Release a contained nullptr pointer if set
+    SLANG_FORCE_INLINE void setNull();
+
+    /// Detach
+    SLANG_FORCE_INLINE T* detach()
+    {
+        T* ptr = m_ptr;
+        m_ptr = nullptr;
+        return ptr;
+    }
+    /// Set to a pointer without changing the ref count
+    SLANG_FORCE_INLINE void attach(T* in) { m_ptr = in; }
+
+    /// Get ready for writing (nulls contents)
+    SLANG_FORCE_INLINE T** writeRef()
+    {
+        setNull();
+        return &m_ptr;
+    }
+    /// Get for read access
+    SLANG_FORCE_INLINE T* const* readRef() const { return &m_ptr; }
+
+    /// Swap
+    void swap(ThisType& rhs);
+
+protected:
+    /// Gets the address of the dumb pointer.
+    // Disabled: use writeRef and readRef to get a reference based on usage.
+#ifndef SLANG_COM_PTR_ENABLE_REF_OPERATOR
+    SLANG_FORCE_INLINE T** operator&() = delete;
+#endif
+
+    T* m_ptr;
+};
+
+//----------------------------------------------------------------------------
+template<typename T>
+void ComPtr<T>::setNull()
+{
+    if (m_ptr)
+    {
+        ((Ptr)m_ptr)->release();
+        m_ptr = nullptr;
+    }
+}
+//----------------------------------------------------------------------------
+template<typename T>
+const ComPtr<T>& ComPtr<T>::operator=(const ThisType& rhs)
+{
+    if (rhs.m_ptr)
+        ((Ptr)rhs.m_ptr)->addRef();
+    if (m_ptr)
+        ((Ptr)m_ptr)->release();
+    m_ptr = rhs.m_ptr;
+    return *this;
+}
+//----------------------------------------------------------------------------
+template<typename T>
+T* ComPtr<T>::operator=(T* ptr)
+{
+    if (ptr)
+        ((Ptr)ptr)->addRef();
+    if (m_ptr)
+        ((Ptr)m_ptr)->release();
+    m_ptr = ptr;
+    return m_ptr;
+}
+//----------------------------------------------------------------------------
+template<typename T>
+void ComPtr<T>::swap(ThisType& rhs)
+{
+    T* tmp = m_ptr;
+    m_ptr = rhs.m_ptr;
+    rhs.m_ptr = tmp;
+}
+
+} // namespace Slang
+
+#endif // SLANG_COM_PTR_H
@@ -0,0 +1,65 @@
+#ifndef SLANG_CPP_HOST_PRELUDE_H
+#define SLANG_CPP_HOST_PRELUDE_H
+
+#include <cmath>
+#include <cstdio>
+#include <cstring>
+
+#define SLANG_COM_PTR_ENABLE_REF_OPERATOR 1
+
+#include "../source/slang-rt/slang-rt.h"
+#include "slang-com-ptr.h"
+#include "slang-cpp-types.h"
+
+#ifdef SLANG_LLVM
+#include "slang-llvm.h"
+#else // SLANG_LLVM
+#if SLANG_GCC_FAMILY && __GNUC__ < 6
+#include <cmath>
+#define SLANG_PRELUDE_STD std::
+#else
+#include <math.h>
+#define SLANG_PRELUDE_STD
+#endif
+
+#include <assert.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <string.h>
+#endif // SLANG_LLVM
+
+// Is intptr_t not equal to equal-width sized integer type?
+#if defined(__APPLE__)
+#define SLANG_INTPTR_TYPE_IS_DISTINCT 1
+#else
+#define SLANG_INTPTR_TYPE_IS_DISTINCT 0
+#endif
+
+#if defined(_MSC_VER)
+#define SLANG_PRELUDE_SHARED_LIB_EXPORT __declspec(dllexport)
+#else
+#define SLANG_PRELUDE_SHARED_LIB_EXPORT __attribute__((__visibility__("default")))
+// #   define SLANG_PRELUDE_SHARED_LIB_EXPORT __attribute__ ((dllexport))
+// __attribute__((__visibility__("default")))
+#endif
+
+#ifdef __cplusplus
+#define SLANG_PRELUDE_EXTERN_C extern "C"
+#define SLANG_PRELUDE_EXTERN_C_START \
+    extern "C"                       \
+    {
+#define SLANG_PRELUDE_EXTERN_C_END }
+#else
+#define SLANG_PRELUDE_EXTERN_C
+#define SLANG_PRELUDE_EXTERN_C_START
+#define SLANG_PRELUDE_EXTERN_C_END
+#endif
+
+#include "slang-cpp-scalar-intrinsics.h"
+
+using namespace Slang;
+
+template<typename TResult, typename... Args>
+using Slang_FuncType = TResult(SLANG_MCALL*)(Args...);
+
+#endif
@@ -0,0 +1,333 @@
+#ifndef SLANG_CPP_PRELUDE_H
+#define SLANG_CPP_PRELUDE_H
+
+// Because the signature of isnan, isfinite, and is isinf changed in C++, we use the macro
+// to use the version in the std namespace.
+// https://stackoverflow.com/questions/39130040/cmath-hides-isnan-in-math-h-in-c14-c11
+
+#ifdef SLANG_LLVM
+#include "slang-llvm.h"
+#else // SLANG_LLVM
+#if SLANG_GCC_FAMILY && __GNUC__ < 6
+#include <cmath>
+#define SLANG_PRELUDE_STD std::
+#else
+#include <math.h>
+#define SLANG_PRELUDE_STD
+#endif
+
+#include <assert.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <string.h>
+#endif // SLANG_LLVM
+
+// Is intptr_t not equal to equal-width sized integer type?
+#if defined(__APPLE__)
+#define SLANG_INTPTR_TYPE_IS_DISTINCT 1
+#else
+#define SLANG_INTPTR_TYPE_IS_DISTINCT 0
+#endif
+
+#if defined(_MSC_VER)
+#define SLANG_PRELUDE_SHARED_LIB_EXPORT __declspec(dllexport)
+#else
+#define SLANG_PRELUDE_SHARED_LIB_EXPORT __attribute__((__visibility__("default")))
+// #   define SLANG_PRELUDE_SHARED_LIB_EXPORT __attribute__ ((dllexport))
+// __attribute__((__visibility__("default")))
+#endif
+
+#ifdef __cplusplus
+#define SLANG_PRELUDE_EXTERN_C extern "C"
+#define SLANG_PRELUDE_EXTERN_C_START \
+    extern "C"                       \
+    {
+#define SLANG_PRELUDE_EXTERN_C_END }
+#else
+#define SLANG_PRELUDE_EXTERN_C
+#define SLANG_PRELUDE_EXTERN_C_START
+#define SLANG_PRELUDE_EXTERN_C_END
+#endif
+
+#define SLANG_PRELUDE_EXPORT SLANG_PRELUDE_EXTERN_C SLANG_PRELUDE_SHARED_LIB_EXPORT
+#define SLANG_PRELUDE_EXPORT_START SLANG_PRELUDE_EXTERN_C_START SLANG_PRELUDE_SHARED_LIB_EXPORT
+#define SLANG_PRELUDE_EXPORT_END SLANG_PRELUDE_EXTERN_C_END
+
+#ifndef INFINITY
+// Must overflow for double
+#define INFINITY float(1e+300 * 1e+300)
+#endif
+
+#ifndef SLANG_INFINITY
+#define SLANG_INFINITY INFINITY
+#endif
+
+// Detect the compiler type
+
+#ifndef SLANG_COMPILER
+#define SLANG_COMPILER
+
+/*
+Compiler defines, see http://sourceforge.net/p/predef/wiki/Compilers/
+NOTE that SLANG_VC holds the compiler version - not just 1 or 0
+*/
+#if defined(_MSC_VER)
+#if _MSC_VER >= 1900
+#define SLANG_VC 14
+#elif _MSC_VER >= 1800
+#define SLANG_VC 12
+#elif _MSC_VER >= 1700
+#define SLANG_VC 11
+#elif _MSC_VER >= 1600
+#define SLANG_VC 10
+#elif _MSC_VER >= 1500
+#define SLANG_VC 9
+#else
+#error "unknown version of Visual C++ compiler"
+#endif
+#elif defined(__clang__)
+#define SLANG_CLANG 1
+#elif defined(__SNC__)
+#define SLANG_SNC 1
+#elif defined(__ghs__)
+#define SLANG_GHS 1
+#elif defined(__GNUC__) /* note: __clang__, __SNC__, or __ghs__ imply __GNUC__ */
+#define SLANG_GCC 1
+#else
+#error "unknown compiler"
+#endif
+/*
+Any compilers not detected by the above logic are now now explicitly zeroed out.
+*/
+#ifndef SLANG_VC
+#define SLANG_VC 0
+#endif
+#ifndef SLANG_CLANG
+#define SLANG_CLANG 0
+#endif
+#ifndef SLANG_SNC
+#define SLANG_SNC 0
+#endif
+#ifndef SLANG_GHS
+#define SLANG_GHS 0
+#endif
+#ifndef SLANG_GCC
+#define SLANG_GCC 0
+#endif
+#endif /* SLANG_COMPILER */
+
+/*
+The following section attempts to detect the target platform being compiled for.
+
+If an application defines `SLANG_PLATFORM` before including this header,
+they take responsibility for setting any compiler-dependent macros
+used later in the file.
+
+Most applications should not need to touch this section.
+*/
+#ifndef SLANG_PLATFORM
+#define SLANG_PLATFORM
+/**
+Operating system defines, see http://sourceforge.net/p/predef/wiki/OperatingSystems/
+*/
+#if defined(WINAPI_FAMILY) && WINAPI_FAMILY == WINAPI_PARTITION_APP
+#define SLANG_WINRT 1 /* Windows Runtime, either on Windows RT or Windows 8 */
+#elif defined(XBOXONE)
+#define SLANG_XBOXONE 1
+#elif defined(_WIN64) /* note: XBOXONE implies _WIN64 */
+#define SLANG_WIN64 1
+#elif defined(_M_PPC)
+#define SLANG_X360 1
+#elif defined(_WIN32) /* note: _M_PPC implies _WIN32 */
+#define SLANG_WIN32 1
+#elif defined(__ANDROID__)
+#define SLANG_ANDROID 1
+#elif defined(__linux__) || defined(__CYGWIN__) /* note: __ANDROID__ implies __linux__ */
+#define SLANG_LINUX 1
+#elif defined(__APPLE__) && !defined(SLANG_LLVM)
+#include "TargetConditionals.h"
+#if TARGET_OS_MAC
+#define SLANG_OSX 1
+#else
+#define SLANG_IOS 1
+#endif
+#elif defined(__APPLE__)
+// On `slang-llvm` we can't inclue "TargetConditionals.h" in general, so for now assume its
+// OSX.
+#define SLANG_OSX 1
+#elif defined(__CELLOS_LV2__)
+#define SLANG_PS3 1
+#elif defined(__ORBIS__)
+#define SLANG_PS4 1
+#elif defined(__SNC__) && defined(__arm__)
+#define SLANG_PSP2 1
+#elif defined(__ghs__)
+#define SLANG_WIIU 1
+#else
+#error "unknown target platform"
+#endif
+
+
+/*
+Any platforms not detected by the above logic are now now explicitly zeroed out.
+*/
+#ifndef SLANG_WINRT
+#define SLANG_WINRT 0
+#endif
+#ifndef SLANG_XBOXONE
+#define SLANG_XBOXONE 0
+#endif
+#ifndef SLANG_WIN64
+#define SLANG_WIN64 0
+#endif
+#ifndef SLANG_X360
+#define SLANG_X360 0
+#endif
+#ifndef SLANG_WIN32
+#define SLANG_WIN32 0
+#endif
+#ifndef SLANG_ANDROID
+#define SLANG_ANDROID 0
+#endif
+#ifndef SLANG_LINUX
+#define SLANG_LINUX 0
+#endif
+#ifndef SLANG_IOS
+#define SLANG_IOS 0
+#endif
+#ifndef SLANG_OSX
+#define SLANG_OSX 0
+#endif
+#ifndef SLANG_PS3
+#define SLANG_PS3 0
+#endif
+#ifndef SLANG_PS4
+#define SLANG_PS4 0
+#endif
+#ifndef SLANG_PSP2
+#define SLANG_PSP2 0
+#endif
+#ifndef SLANG_WIIU
+#define SLANG_WIIU 0
+#endif
+#endif /* SLANG_PLATFORM */
+
+/* Shorthands for "families" of compilers/platforms */
+#define SLANG_GCC_FAMILY (SLANG_CLANG || SLANG_SNC || SLANG_GHS || SLANG_GCC)
+#define SLANG_WINDOWS_FAMILY (SLANG_WINRT || SLANG_WIN32 || SLANG_WIN64)
+#define SLANG_MICROSOFT_FAMILY (SLANG_XBOXONE || SLANG_X360 || SLANG_WINDOWS_FAMILY)
+#define SLANG_LINUX_FAMILY (SLANG_LINUX || SLANG_ANDROID)
+#define SLANG_APPLE_FAMILY (SLANG_IOS || SLANG_OSX) /* equivalent to #if __APPLE__ */
+#define SLANG_UNIX_FAMILY \
+    (SLANG_LINUX_FAMILY || SLANG_APPLE_FAMILY) /* shortcut for unix/posix platforms */
+
+// GCC Specific
+#if SLANG_GCC_FAMILY
+
+#if INTPTR_MAX == INT64_MAX
+#define SLANG_64BIT 1
+#else
+#define SLANG_64BIT 0
+#endif
+
+#define SLANG_BREAKPOINT(id) __builtin_trap()
+
+// Use this macro instead of offsetof, because gcc produces warning if offsetof is used on a
+// non POD type, even though it produces the correct result
+#define SLANG_OFFSET_OF(T, ELEMENT) (size_t(&((T*)1)->ELEMENT) - 1)
+#endif // SLANG_GCC_FAMILY
+
+// Microsoft VC specific
+#if SLANG_VC
+
+#define SLANG_BREAKPOINT(id) __debugbreak();
+
+#endif // SLANG_VC
+
+// Default impls
+
+#ifndef SLANG_OFFSET_OF
+#define SLANG_OFFSET_OF(X, Y) offsetof(X, Y)
+#endif
+
+#ifndef SLANG_BREAKPOINT
+// Make it crash with a write to 0!
+#define SLANG_BREAKPOINT(id) (*((int*)0) = int(id));
+#endif
+
+// If slang.h has been included we don't need any of these definitions
+#ifndef SLANG_H
+
+/* Macro for declaring if a method is no throw. Should be set before the return parameter. */
+#ifndef SLANG_NO_THROW
+#if SLANG_WINDOWS_FAMILY && !defined(SLANG_DISABLE_EXCEPTIONS)
+#define SLANG_NO_THROW __declspec(nothrow)
+#endif
+#endif
+#ifndef SLANG_NO_THROW
+#define SLANG_NO_THROW
+#endif
+
+/* The `SLANG_STDCALL` and `SLANG_MCALL` defines are used to set the calling
+convention for interface methods.
+*/
+#ifndef SLANG_STDCALL
+#if SLANG_MICROSOFT_FAMILY
+#define SLANG_STDCALL __stdcall
+#else
+#define SLANG_STDCALL
+#endif
+#endif
+#ifndef SLANG_MCALL
+#define SLANG_MCALL SLANG_STDCALL
+#endif
+
+#ifndef SLANG_FORCE_INLINE
+#define SLANG_FORCE_INLINE inline
+#endif
+
+// TODO(JS): Should these be in slang-cpp-types.h?
+// They are more likely to clash with slang.h
+
+struct SlangUUID
+{
+    uint32_t data1;
+    uint16_t data2;
+    uint16_t data3;
+    uint8_t data4[8];
+};
+
+typedef int32_t SlangResult;
+
+struct ISlangUnknown
+{
+    virtual SLANG_NO_THROW SlangResult SLANG_MCALL
+    queryInterface(SlangUUID const& uuid, void** outObject) = 0;
+    virtual SLANG_NO_THROW uint32_t SLANG_MCALL addRef() = 0;
+    virtual SLANG_NO_THROW uint32_t SLANG_MCALL release() = 0;
+};
+
+#define SLANG_COM_INTERFACE(a, b, c, d0, d1, d2, d3, d4, d5, d6, d7)             \
+public:                                                                          \
+    SLANG_FORCE_INLINE static const SlangUUID& getTypeGuid()                     \
+    {                                                                            \
+        static const SlangUUID guid = {a, b, c, d0, d1, d2, d3, d4, d5, d6, d7}; \
+        return guid;                                                             \
+    }
+#endif // SLANG_H
+
+// Includes
+
+#include "slang-cpp-scalar-intrinsics.h"
+#include "slang-cpp-types.h"
+
+// TODO(JS): Hack! Output C++ code from slang can copy uninitialized variables.
+#if defined(_MSC_VER)
+#pragma warning(disable : 4700)
+#endif
+
+#ifndef SLANG_UNROLL
+#define SLANG_UNROLL
+#endif
+
+#endif
@@ -0,0 +1,712 @@
+#ifndef SLANG_PRELUDE_CPP_TYPES_CORE_H
+#define SLANG_PRELUDE_CPP_TYPES_CORE_H
+
+#ifndef SLANG_PRELUDE_ASSERT
+#ifdef SLANG_PRELUDE_ENABLE_ASSERT
+#define SLANG_PRELUDE_ASSERT(VALUE) assert(VALUE)
+#else
+#define SLANG_PRELUDE_ASSERT(VALUE)
+#endif
+#endif
+
+// Since we are using unsigned arithmatic care is need in this comparison.
+// It is *assumed* that sizeInBytes >= elemSize. Which means (sizeInBytes >= elemSize) >= 0
+// Which means only a single test is needed
+
+// Asserts for bounds checking.
+// It is assumed index/count are unsigned types.
+#define SLANG_BOUND_ASSERT(index, count) SLANG_PRELUDE_ASSERT(index < count);
+#define SLANG_BOUND_ASSERT_BYTE_ADDRESS(index, elemSize, sizeInBytes) \
+    SLANG_PRELUDE_ASSERT(index <= (sizeInBytes - elemSize) && (index & 3) == 0);
+
+// Macros to zero index if an access is out of range
+#define SLANG_BOUND_ZERO_INDEX(index, count) index = (index < count) ? index : 0;
+#define SLANG_BOUND_ZERO_INDEX_BYTE_ADDRESS(index, elemSize, sizeInBytes) \
+    index = (index <= (sizeInBytes - elemSize)) ? index : 0;
+
+// The 'FIX' macro define how the index is fixed. The default is to do nothing. If
+// SLANG_ENABLE_BOUND_ZERO_INDEX the fix macro will zero the index, if out of range
+#ifdef SLANG_ENABLE_BOUND_ZERO_INDEX
+#define SLANG_BOUND_FIX(index, count) SLANG_BOUND_ZERO_INDEX(index, count)
+#define SLANG_BOUND_FIX_BYTE_ADDRESS(index, elemSize, sizeInBytes) \
+    SLANG_BOUND_ZERO_INDEX_BYTE_ADDRESS(index, elemSize, sizeInBytes)
+#define SLANG_BOUND_FIX_FIXED_ARRAY(index, count) SLANG_BOUND_ZERO_INDEX(index, count)
+#else
+#define SLANG_BOUND_FIX(index, count)
+#define SLANG_BOUND_FIX_BYTE_ADDRESS(index, elemSize, sizeInBytes)
+#define SLANG_BOUND_FIX_FIXED_ARRAY(index, count)
+#endif
+
+#ifndef SLANG_BOUND_CHECK
+#define SLANG_BOUND_CHECK(index, count) \
+    SLANG_BOUND_ASSERT(index, count) SLANG_BOUND_FIX(index, count)
+#endif
+
+#ifndef SLANG_BOUND_CHECK_BYTE_ADDRESS
+#define SLANG_BOUND_CHECK_BYTE_ADDRESS(index, elemSize, sizeInBytes) \
+    SLANG_BOUND_ASSERT_BYTE_ADDRESS(index, elemSize, sizeInBytes)    \
+    SLANG_BOUND_FIX_BYTE_ADDRESS(index, elemSize, sizeInBytes)
+#endif
+
+#ifndef SLANG_BOUND_CHECK_FIXED_ARRAY
+#define SLANG_BOUND_CHECK_FIXED_ARRAY(index, count) \
+    SLANG_BOUND_ASSERT(index, count) SLANG_BOUND_FIX_FIXED_ARRAY(index, count)
+#endif
+
+struct TypeInfo
+{
+    size_t typeSize;
+};
+
+template<typename T, size_t SIZE>
+struct FixedArray
+{
+    const T& operator[](size_t index) const
+    {
+        SLANG_BOUND_CHECK_FIXED_ARRAY(index, SIZE);
+        return m_data[index];
+    }
+    T& operator[](size_t index)
+    {
+        SLANG_BOUND_CHECK_FIXED_ARRAY(index, SIZE);
+        return m_data[index];
+    }
+
+    T m_data[SIZE];
+};
+
+// An array that has no specified size, becomes a 'Array'. This stores the size so it can
+// potentially do bounds checking.
+template<typename T>
+struct Array
+{
+    const T& operator[](size_t index) const
+    {
+        SLANG_BOUND_CHECK(index, count);
+        return data[index];
+    }
+    T& operator[](size_t index)
+    {
+        SLANG_BOUND_CHECK(index, count);
+        return data[index];
+    }
+
+    T* data;
+    size_t count;
+};
+
+/* Constant buffers become a pointer to the contained type, so ConstantBuffer<T> becomes T* in C++
+ * code.
+ */
+
+template<typename T, int COUNT>
+struct Vector;
+
+template<typename T>
+struct Vector<T, 1>
+{
+    T x;
+    const T& operator[](size_t /*index*/) const { return x; }
+    T& operator[](size_t /*index*/) { return x; }
+    operator T() const { return x; }
+    Vector() = default;
+    Vector(T scalar) { x = scalar; }
+    template<typename U>
+    Vector(Vector<U, 1> other)
+    {
+        x = (T)other.x;
+    }
+    template<typename U, int otherSize>
+    Vector(Vector<U, otherSize> other)
+    {
+        int minSize = 1;
+        if (otherSize < minSize)
+            minSize = otherSize;
+        for (int i = 0; i < minSize; i++)
+            (*this)[i] = (T)other[i];
+    }
+};
+
+template<typename T>
+struct Vector<T, 2>
+{
+    T x, y;
+    const T& operator[](size_t index) const { return index == 0 ? x : y; }
+    T& operator[](size_t index) { return index == 0 ? x : y; }
+    Vector() = default;
+    Vector(T scalar) { x = y = scalar; }
+    Vector(T _x, T _y)
+    {
+        x = _x;
+        y = _y;
+    }
+    template<typename U>
+    Vector(Vector<U, 2> other)
+    {
+        x = (T)other.x;
+        y = (T)other.y;
+    }
+    template<typename U, int otherSize>
+    Vector(Vector<U, otherSize> other)
+    {
+        int minSize = 2;
+        if (otherSize < minSize)
+            minSize = otherSize;
+        for (int i = 0; i < minSize; i++)
+            (*this)[i] = (T)other[i];
+    }
+};
+
+template<typename T>
+struct Vector<T, 3>
+{
+    T x, y, z;
+    const T& operator[](size_t index) const { return *((T*)(this) + index); }
+    T& operator[](size_t index) { return *((T*)(this) + index); }
+
+    Vector() = default;
+    Vector(T scalar) { x = y = z = scalar; }
+    Vector(T _x, T _y, T _z)
+    {
+        x = _x;
+        y = _y;
+        z = _z;
+    }
+    template<typename U>
+    Vector(Vector<U, 3> other)
+    {
+        x = (T)other.x;
+        y = (T)other.y;
+        z = (T)other.z;
+    }
+    template<typename U, int otherSize>
+    Vector(Vector<U, otherSize> other)
+    {
+        int minSize = 3;
+        if (otherSize < minSize)
+            minSize = otherSize;
+        for (int i = 0; i < minSize; i++)
+            (*this)[i] = (T)other[i];
+    }
+};
+
+template<typename T>
+struct Vector<T, 4>
+{
+    T x, y, z, w;
+
+    const T& operator[](size_t index) const { return *((T*)(this) + index); }
+    T& operator[](size_t index) { return *((T*)(this) + index); }
+    Vector() = default;
+    Vector(T scalar) { x = y = z = w = scalar; }
+    Vector(T _x, T _y, T _z, T _w)
+    {
+        x = _x;
+        y = _y;
+        z = _z;
+        w = _w;
+    }
+    template<typename U, int otherSize>
+    Vector(Vector<U, otherSize> other)
+    {
+        int minSize = 4;
+        if (otherSize < minSize)
+            minSize = otherSize;
+        for (int i = 0; i < minSize; i++)
+            (*this)[i] = (T)other[i];
+    }
+};
+
+template<typename T, int N>
+SLANG_FORCE_INLINE Vector<T, N> _slang_select(
+    Vector<bool, N> condition,
+    Vector<T, N> v0,
+    Vector<T, N> v1)
+{
+    Vector<T, N> result;
+    for (int i = 0; i < N; i++)
+    {
+        result[i] = condition[i] ? v0[i] : v1[i];
+    }
+    return result;
+}
+
+template<typename T>
+SLANG_FORCE_INLINE T _slang_select(bool condition, T v0, T v1)
+{
+    return condition ? v0 : v1;
+}
+
+template<typename T, int N>
+SLANG_FORCE_INLINE T _slang_vector_get_element(Vector<T, N> x, int index)
+{
+    return x[index];
+}
+
+template<typename T, int N>
+SLANG_FORCE_INLINE const T* _slang_vector_get_element_ptr(const Vector<T, N>* x, int index)
+{
+    return &((*const_cast<Vector<T, N>*>(x))[index]);
+}
+
+template<typename T, int N>
+SLANG_FORCE_INLINE T* _slang_vector_get_element_ptr(Vector<T, N>* x, int index)
+{
+    return &((*x)[index]);
+}
+
+template<typename T, int n, typename OtherT, int m>
+SLANG_FORCE_INLINE Vector<T, n> _slang_vector_reshape(const Vector<OtherT, m> other)
+{
+    Vector<T, n> result;
+    for (int i = 0; i < n; i++)
+    {
+        OtherT otherElement = T(0);
+        if (i < m)
+            otherElement = _slang_vector_get_element(other, i);
+        *_slang_vector_get_element_ptr(&result, i) = (T)otherElement;
+    }
+    return result;
+}
+
+typedef uint32_t uint;
+
+#define SLANG_VECTOR_BINARY_OP(T, op)            \
+    template<int n>                              \
+    SLANG_FORCE_INLINE Vector<T, n> operator op( \
+        const Vector<T, n>& thisVal,             \
+        const Vector<T, n>& other)               \
+    {                                            \
+        Vector<T, n> result;                     \
+        for (int i = 0; i < n; i++)              \
+            result[i] = thisVal[i] op other[i];  \
+        return result;                           \
+    }
+#define SLANG_VECTOR_BINARY_COMPARE_OP(T, op)       \
+    template<int n>                                 \
+    SLANG_FORCE_INLINE Vector<bool, n> operator op( \
+        const Vector<T, n>& thisVal,                \
+        const Vector<T, n>& other)                  \
+    {                                               \
+        Vector<bool, n> result;                     \
+        for (int i = 0; i < n; i++)                 \
+            result[i] = thisVal[i] op other[i];     \
+        return result;                              \
+    }
+
+#define SLANG_VECTOR_UNARY_OP(T, op)                                         \
+    template<int n>                                                          \
+    SLANG_FORCE_INLINE Vector<T, n> operator op(const Vector<T, n>& thisVal) \
+    {                                                                        \
+        Vector<T, n> result;                                                 \
+        for (int i = 0; i < n; i++)                                          \
+            result[i] = op thisVal[i];                                       \
+        return result;                                                       \
+    }
+#define SLANG_INT_VECTOR_OPS(T)           \
+    SLANG_VECTOR_BINARY_OP(T, +)          \
+    SLANG_VECTOR_BINARY_OP(T, -)          \
+    SLANG_VECTOR_BINARY_OP(T, *)          \
+    SLANG_VECTOR_BINARY_OP(T, /)          \
+    SLANG_VECTOR_BINARY_OP(T, &)          \
+    SLANG_VECTOR_BINARY_OP(T, |)          \
+    SLANG_VECTOR_BINARY_OP(T, &&)         \
+    SLANG_VECTOR_BINARY_OP(T, ||)         \
+    SLANG_VECTOR_BINARY_OP(T, ^)          \
+    SLANG_VECTOR_BINARY_OP(T, %)          \
+    SLANG_VECTOR_BINARY_OP(T, >>)         \
+    SLANG_VECTOR_BINARY_OP(T, <<)         \
+    SLANG_VECTOR_BINARY_COMPARE_OP(T, >)  \
+    SLANG_VECTOR_BINARY_COMPARE_OP(T, <)  \
+    SLANG_VECTOR_BINARY_COMPARE_OP(T, >=) \
+    SLANG_VECTOR_BINARY_COMPARE_OP(T, <=) \
+    SLANG_VECTOR_BINARY_COMPARE_OP(T, ==) \
+    SLANG_VECTOR_BINARY_COMPARE_OP(T, !=) \
+    SLANG_VECTOR_UNARY_OP(T, !)           \
+    SLANG_VECTOR_UNARY_OP(T, ~)
+#define SLANG_FLOAT_VECTOR_OPS(T)         \
+    SLANG_VECTOR_BINARY_OP(T, +)          \
+    SLANG_VECTOR_BINARY_OP(T, -)          \
+    SLANG_VECTOR_BINARY_OP(T, *)          \
+    SLANG_VECTOR_BINARY_OP(T, /)          \
+    SLANG_VECTOR_UNARY_OP(T, -)           \
+    SLANG_VECTOR_BINARY_COMPARE_OP(T, >)  \
+    SLANG_VECTOR_BINARY_COMPARE_OP(T, <)  \
+    SLANG_VECTOR_BINARY_COMPARE_OP(T, >=) \
+    SLANG_VECTOR_BINARY_COMPARE_OP(T, <=) \
+    SLANG_VECTOR_BINARY_COMPARE_OP(T, ==) \
+    SLANG_VECTOR_BINARY_COMPARE_OP(T, !=)
+
+SLANG_INT_VECTOR_OPS(bool)
+SLANG_INT_VECTOR_OPS(int)
+SLANG_INT_VECTOR_OPS(int8_t)
+SLANG_INT_VECTOR_OPS(int16_t)
+SLANG_INT_VECTOR_OPS(int64_t)
+SLANG_INT_VECTOR_OPS(uint)
+SLANG_INT_VECTOR_OPS(uint8_t)
+SLANG_INT_VECTOR_OPS(uint16_t)
+SLANG_INT_VECTOR_OPS(uint64_t)
+#if SLANG_INTPTR_TYPE_IS_DISTINCT
+SLANG_INT_VECTOR_OPS(intptr_t)
+SLANG_INT_VECTOR_OPS(uintptr_t)
+#endif
+
+SLANG_FLOAT_VECTOR_OPS(float)
+SLANG_FLOAT_VECTOR_OPS(double)
+
+#define SLANG_VECTOR_INT_NEG_OP(T)                      \
+    template<int N>                                     \
+    Vector<T, N> operator-(const Vector<T, N>& thisVal) \
+    {                                                   \
+        Vector<T, N> result;                            \
+        for (int i = 0; i < N; i++)                     \
+            result[i] = 0 - thisVal[i];                 \
+        return result;                                  \
+    }
+SLANG_VECTOR_INT_NEG_OP(int)
+SLANG_VECTOR_INT_NEG_OP(int8_t)
+SLANG_VECTOR_INT_NEG_OP(int16_t)
+SLANG_VECTOR_INT_NEG_OP(int64_t)
+SLANG_VECTOR_INT_NEG_OP(uint)
+SLANG_VECTOR_INT_NEG_OP(uint8_t)
+SLANG_VECTOR_INT_NEG_OP(uint16_t)
+SLANG_VECTOR_INT_NEG_OP(uint64_t)
+#if SLANG_INTPTR_TYPE_IS_DISTINCT
+SLANG_VECTOR_INT_NEG_OP(intptr_t)
+SLANG_VECTOR_INT_NEG_OP(uintptr_t)
+#endif
+
+#define SLANG_FLOAT_VECTOR_MOD(T)                                               \
+    template<int N>                                                             \
+    Vector<T, N> operator%(const Vector<T, N>& left, const Vector<T, N>& right) \
+    {                                                                           \
+        Vector<T, N> result;                                                    \
+        for (int i = 0; i < N; i++)                                             \
+            result[i] = _slang_fmod(left[i], right[i]);                         \
+        return result;                                                          \
+    }
+
+SLANG_FLOAT_VECTOR_MOD(float)
+SLANG_FLOAT_VECTOR_MOD(double)
+#undef SLANG_FLOAT_VECTOR_MOD
+#undef SLANG_VECTOR_BINARY_OP
+#undef SLANG_VECTOR_UNARY_OP
+#undef SLANG_INT_VECTOR_OPS
+#undef SLANG_FLOAT_VECTOR_OPS
+#undef SLANG_VECTOR_INT_NEG_OP
+#undef SLANG_FLOAT_VECTOR_MOD
+
+template<typename T, int ROWS, int COLS>
+struct Matrix
+{
+    Vector<T, COLS> rows[ROWS];
+    const Vector<T, COLS>& operator[](size_t index) const { return rows[index]; }
+    Vector<T, COLS>& operator[](size_t index) { return rows[index]; }
+    Matrix() = default;
+    Matrix(T scalar)
+    {
+        for (int i = 0; i < ROWS; i++)
+            rows[i] = Vector<T, COLS>(scalar);
+    }
+    Matrix(const Vector<T, COLS>& row0) { rows[0] = row0; }
+    Matrix(const Vector<T, COLS>& row0, const Vector<T, COLS>& row1)
+    {
+        rows[0] = row0;
+        rows[1] = row1;
+    }
+    Matrix(const Vector<T, COLS>& row0, const Vector<T, COLS>& row1, const Vector<T, COLS>& row2)
+    {
+        rows[0] = row0;
+        rows[1] = row1;
+        rows[2] = row2;
+    }
+    Matrix(
+        const Vector<T, COLS>& row0,
+        const Vector<T, COLS>& row1,
+        const Vector<T, COLS>& row2,
+        const Vector<T, COLS>& row3)
+    {
+        rows[0] = row0;
+        rows[1] = row1;
+        rows[2] = row2;
+        rows[3] = row3;
+    }
+    template<typename U, int otherRow, int otherCol>
+    Matrix(const Matrix<U, otherRow, otherCol>& other)
+    {
+        int minRow = ROWS;
+        int minCol = COLS;
+        if (minRow > otherRow)
+            minRow = otherRow;
+        if (minCol > otherCol)
+            minCol = otherCol;
+        for (int i = 0; i < minRow; i++)
+            for (int j = 0; j < minCol; j++)
+                rows[i][j] = (T)other.rows[i][j];
+    }
+    Matrix(T v0, T v1, T v2, T v3)
+    {
+        rows[0][0] = v0;
+        rows[0][1] = v1;
+        rows[1][0] = v2;
+        rows[1][1] = v3;
+    }
+    Matrix(T v0, T v1, T v2, T v3, T v4, T v5)
+    {
+        if (COLS == 3)
+        {
+            rows[0][0] = v0;
+            rows[0][1] = v1;
+            rows[0][2] = v2;
+            rows[1][0] = v3;
+            rows[1][1] = v4;
+            rows[1][2] = v5;
+        }
+        else
+        {
+            rows[0][0] = v0;
+            rows[0][1] = v1;
+            rows[1][0] = v2;
+            rows[1][1] = v3;
+            rows[2][0] = v4;
+            rows[2][1] = v5;
+        }
+    }
+    Matrix(T v0, T v1, T v2, T v3, T v4, T v5, T v6, T v7)
+    {
+        if (COLS == 4)
+        {
+            rows[0][0] = v0;
+            rows[0][1] = v1;
+            rows[0][2] = v2;
+            rows[0][3] = v3;
+            rows[1][0] = v4;
+            rows[1][1] = v5;
+            rows[1][2] = v6;
+            rows[1][3] = v7;
+        }
+        else
+        {
+            rows[0][0] = v0;
+            rows[0][1] = v1;
+            rows[1][0] = v2;
+            rows[1][1] = v3;
+            rows[2][0] = v4;
+            rows[2][1] = v5;
+            rows[3][0] = v6;
+            rows[3][1] = v7;
+        }
+    }
+    Matrix(T v0, T v1, T v2, T v3, T v4, T v5, T v6, T v7, T v8)
+    {
+        rows[0][0] = v0;
+        rows[0][1] = v1;
+        rows[0][2] = v2;
+        rows[1][0] = v3;
+        rows[1][1] = v4;
+        rows[1][2] = v5;
+        rows[2][0] = v6;
+        rows[2][1] = v7;
+        rows[2][2] = v8;
+    }
+    Matrix(T v0, T v1, T v2, T v3, T v4, T v5, T v6, T v7, T v8, T v9, T v10, T v11)
+    {
+        if (COLS == 4)
+        {
+            rows[0][0] = v0;
+            rows[0][1] = v1;
+            rows[0][2] = v2;
+            rows[0][3] = v3;
+            rows[1][0] = v4;
+            rows[1][1] = v5;
+            rows[1][2] = v6;
+            rows[1][3] = v7;
+            rows[2][0] = v8;
+            rows[2][1] = v9;
+            rows[2][2] = v10;
+            rows[2][3] = v11;
+        }
+        else
+        {
+            rows[0][0] = v0;
+            rows[0][1] = v1;
+            rows[0][2] = v2;
+            rows[1][0] = v3;
+            rows[1][1] = v4;
+            rows[1][2] = v5;
+            rows[2][0] = v6;
+            rows[2][1] = v7;
+            rows[2][2] = v8;
+            rows[3][0] = v9;
+            rows[3][1] = v10;
+            rows[3][2] = v11;
+        }
+    }
+    Matrix(
+        T v0,
+        T v1,
+        T v2,
+        T v3,
+        T v4,
+        T v5,
+        T v6,
+        T v7,
+        T v8,
+        T v9,
+        T v10,
+        T v11,
+        T v12,
+        T v13,
+        T v14,
+        T v15)
+    {
+        rows[0][0] = v0;
+        rows[0][1] = v1;
+        rows[0][2] = v2;
+        rows[0][3] = v3;
+        rows[1][0] = v4;
+        rows[1][1] = v5;
+        rows[1][2] = v6;
+        rows[1][3] = v7;
+        rows[2][0] = v8;
+        rows[2][1] = v9;
+        rows[2][2] = v10;
+        rows[2][3] = v11;
+        rows[3][0] = v12;
+        rows[3][1] = v13;
+        rows[3][2] = v14;
+        rows[3][3] = v15;
+    }
+};
+
+#define SLANG_MATRIX_BINARY_OP(T, op)                                                         \
+    template<int R, int C>                                                                    \
+    Matrix<T, R, C> operator op(const Matrix<T, R, C>& thisVal, const Matrix<T, R, C>& other) \
+    {                                                                                         \
+        Matrix<T, R, C> result;                                                               \
+        for (int i = 0; i < R; i++)                                                           \
+            for (int j = 0; j < C; j++)                                                       \
+                result.rows[i][j] = thisVal.rows[i][j] op other.rows[i][j];                   \
+        return result;                                                                        \
+    }
+
+#define SLANG_MATRIX_BINARY_COMPARE_OP(T, op)                                                    \
+    template<int R, int C>                                                                       \
+    Matrix<bool, R, C> operator op(const Matrix<T, R, C>& thisVal, const Matrix<T, R, C>& other) \
+    {                                                                                            \
+        Matrix<bool, R, C> result;                                                               \
+        for (int i = 0; i < R; i++)                                                              \
+            for (int j = 0; j < C; j++)                                                          \
+                result.rows[i][j] = thisVal.rows[i][j] op other.rows[i][j];                      \
+        return result;                                                                           \
+    }
+
+#define SLANG_MATRIX_UNARY_OP(T, op)                            \
+    template<int R, int C>                                      \
+    Matrix<T, R, C> operator op(const Matrix<T, R, C>& thisVal) \
+    {                                                           \
+        Matrix<T, R, C> result;                                 \
+        for (int i = 0; i < R; i++)                             \
+            for (int j = 0; j < C; j++)                         \
+                result[i].rows[i][j] = op thisVal.rows[i][j];   \
+        return result;                                          \
+    }
+
+#define SLANG_INT_MATRIX_OPS(T)           \
+    SLANG_MATRIX_BINARY_OP(T, +)          \
+    SLANG_MATRIX_BINARY_OP(T, -)          \
+    SLANG_MATRIX_BINARY_OP(T, *)          \
+    SLANG_MATRIX_BINARY_OP(T, /)          \
+    SLANG_MATRIX_BINARY_OP(T, &)          \
+    SLANG_MATRIX_BINARY_OP(T, |)          \
+    SLANG_MATRIX_BINARY_OP(T, &&)         \
+    SLANG_MATRIX_BINARY_OP(T, ||)         \
+    SLANG_MATRIX_BINARY_OP(T, ^)          \
+    SLANG_MATRIX_BINARY_OP(T, %)          \
+    SLANG_MATRIX_BINARY_COMPARE_OP(T, >)  \
+    SLANG_MATRIX_BINARY_COMPARE_OP(T, <)  \
+    SLANG_MATRIX_BINARY_COMPARE_OP(T, >=) \
+    SLANG_MATRIX_BINARY_COMPARE_OP(T, <=) \
+    SLANG_MATRIX_BINARY_COMPARE_OP(T, ==) \
+    SLANG_MATRIX_BINARY_COMPARE_OP(T, !=) \
+    SLANG_MATRIX_UNARY_OP(T, !)           \
+    SLANG_MATRIX_UNARY_OP(T, ~)
+#define SLANG_FLOAT_MATRIX_OPS(T)         \
+    SLANG_MATRIX_BINARY_OP(T, +)          \
+    SLANG_MATRIX_BINARY_OP(T, -)          \
+    SLANG_MATRIX_BINARY_OP(T, *)          \
+    SLANG_MATRIX_BINARY_OP(T, /)          \
+    SLANG_MATRIX_UNARY_OP(T, -)           \
+    SLANG_MATRIX_BINARY_COMPARE_OP(T, >)  \
+    SLANG_MATRIX_BINARY_COMPARE_OP(T, <)  \
+    SLANG_MATRIX_BINARY_COMPARE_OP(T, >=) \
+    SLANG_MATRIX_BINARY_COMPARE_OP(T, <=) \
+    SLANG_MATRIX_BINARY_COMPARE_OP(T, ==) \
+    SLANG_MATRIX_BINARY_COMPARE_OP(T, !=)
+SLANG_INT_MATRIX_OPS(int)
+SLANG_INT_MATRIX_OPS(int8_t)
+SLANG_INT_MATRIX_OPS(int16_t)
+SLANG_INT_MATRIX_OPS(int64_t)
+SLANG_INT_MATRIX_OPS(uint)
+SLANG_INT_MATRIX_OPS(uint8_t)
+SLANG_INT_MATRIX_OPS(uint16_t)
+SLANG_INT_MATRIX_OPS(uint64_t)
+#if SLANG_INTPTR_TYPE_IS_DISTINCT
+SLANG_INT_MATRIX_OPS(intptr_t)
+SLANG_INT_MATRIX_OPS(uintptr_t)
+#endif
+
+SLANG_FLOAT_MATRIX_OPS(float)
+SLANG_FLOAT_MATRIX_OPS(double)
+
+#define SLANG_MATRIX_INT_NEG_OP(T)                                        \
+    template<int R, int C>                                                \
+    SLANG_FORCE_INLINE Matrix<T, R, C> operator-(Matrix<T, R, C> thisVal) \
+    {                                                                     \
+        Matrix<T, R, C> result;                                           \
+        for (int i = 0; i < R; i++)                                       \
+            for (int j = 0; j < C; j++)                                   \
+                result.rows[i][j] = 0 - thisVal.rows[i][j];               \
+        return result;                                                    \
+    }
+SLANG_MATRIX_INT_NEG_OP(int)
+SLANG_MATRIX_INT_NEG_OP(int8_t)
+SLANG_MATRIX_INT_NEG_OP(int16_t)
+SLANG_MATRIX_INT_NEG_OP(int64_t)
+SLANG_MATRIX_INT_NEG_OP(uint)
+SLANG_MATRIX_INT_NEG_OP(uint8_t)
+SLANG_MATRIX_INT_NEG_OP(uint16_t)
+SLANG_MATRIX_INT_NEG_OP(uint64_t)
+#if SLANG_INTPTR_TYPE_IS_DISTINCT
+SLANG_MATRIX_INT_NEG_OP(intptr_t)
+SLANG_MATRIX_INT_NEG_OP(uintptr_t)
+#endif
+
+#define SLANG_FLOAT_MATRIX_MOD(T)                                                             \
+    template<int R, int C>                                                                    \
+    SLANG_FORCE_INLINE Matrix<T, R, C> operator%(Matrix<T, R, C> left, Matrix<T, R, C> right) \
+    {                                                                                         \
+        Matrix<T, R, C> result;                                                               \
+        for (int i = 0; i < R; i++)                                                           \
+            for (int j = 0; j < C; j++)                                                       \
+                result.rows[i][j] = _slang_fmod(left.rows[i][j], right.rows[i][j]);           \
+        return result;                                                                        \
+    }
+
+SLANG_FLOAT_MATRIX_MOD(float)
+SLANG_FLOAT_MATRIX_MOD(double)
+#undef SLANG_FLOAT_MATRIX_MOD
+#undef SLANG_MATRIX_BINARY_OP
+#undef SLANG_MATRIX_UNARY_OP
+#undef SLANG_INT_MATRIX_OPS
+#undef SLANG_FLOAT_MATRIX_OPS
+#undef SLANG_MATRIX_INT_NEG_OP
+#undef SLANG_FLOAT_MATRIX_MOD
+
+template<typename TResult, typename TInput>
+TResult slang_bit_cast(TInput val)
+{
+    return *(TResult*)(&val);
+}
+
+#endif
@@ -0,0 +1,8 @@
+#ifdef SLANG_HLSL_ENABLE_NVAPI
+#include "nvHLSLExtns.h"
+#endif
+
+#ifndef __DXC_VERSION_MAJOR
+// warning X3557: loop doesn't seem to do anything, forcing loop to unroll
+#pragma warning(disable : 3557)
+#endif
@@ -0,0 +1,50 @@
+// slang-image-format-defs.h
+#ifndef SLANG_FORMAT
+    #error Must define SLANG_FORMAT macro before including image-format-defs.h
+#endif
+
+SLANG_FORMAT(unknown, (NONE, 0, 0))
+SLANG_FORMAT(rgba32f, (FLOAT32, 4, sizeof(float) * 4))
+SLANG_FORMAT(rgba16f, (FLOAT16, 4, sizeof(uint16_t) * 4))
+SLANG_FORMAT(rg32f, (FLOAT32, 2, sizeof(float) * 2))
+SLANG_FORMAT(rg16f, (FLOAT16, 2, sizeof(uint16_t) * 2))
+SLANG_FORMAT(r11f_g11f_b10f, (NONE, 3, sizeof(uint32_t)))
+SLANG_FORMAT(r32f, (FLOAT32, 1, sizeof(float)))
+SLANG_FORMAT(r16f, (FLOAT16, 1, sizeof(uint16_t)))
+SLANG_FORMAT(rgba16, (UINT16, 4, sizeof(uint16_t) * 4))
+SLANG_FORMAT(rgb10_a2, (NONE, 4, sizeof(uint32_t)))
+SLANG_FORMAT(rgba8, (UINT8, 4, sizeof(uint32_t)))
+SLANG_FORMAT(rg16, (UINT16, 2, sizeof(uint16_t) * 2))
+SLANG_FORMAT(rg8, (UINT8, 2, sizeof(char) * 2))
+SLANG_FORMAT(r16, (UINT16, 1, sizeof(uint16_t)))
+SLANG_FORMAT(r8, (UINT8, 1, sizeof(uint8_t)))
+SLANG_FORMAT(rgba16_snorm, (UINT16, 4, sizeof(uint16_t) * 4))
+SLANG_FORMAT(rgba8_snorm, (UINT8, 4, sizeof(uint8_t) * 4))
+SLANG_FORMAT(rg16_snorm, (UINT16, 2, sizeof(uint16_t) * 2))
+SLANG_FORMAT(rg8_snorm, (UINT8, 2, sizeof(uint8_t) * 2))
+SLANG_FORMAT(r16_snorm, (UINT16, 1, sizeof(uint16_t)))
+SLANG_FORMAT(r8_snorm, (UINT8, 1, sizeof(uint8_t)))
+SLANG_FORMAT(rgba32i, (INT32, 4, sizeof(int32_t) * 4))
+SLANG_FORMAT(rgba16i, (INT16, 4, sizeof(int16_t) * 4))
+SLANG_FORMAT(rgba8i, (INT8, 4, sizeof(int8_t) * 4))
+SLANG_FORMAT(rg32i, (INT32, 2, sizeof(int32_t) * 2))
+SLANG_FORMAT(rg16i, (INT16, 2, sizeof(int16_t) * 2))
+SLANG_FORMAT(rg8i, (INT8, 2, sizeof(int8_t) * 2))
+SLANG_FORMAT(r32i, (INT32, 1, sizeof(int32_t)))
+SLANG_FORMAT(r16i, (INT16, 1, sizeof(int16_t)))
+SLANG_FORMAT(r8i, (INT8, 1, sizeof(int8_t)))
+SLANG_FORMAT(rgba32ui, (UINT32, 4, sizeof(uint32_t) * 4))
+SLANG_FORMAT(rgba16ui, (UINT16, 4, sizeof(uint16_t) * 4))
+SLANG_FORMAT(rgb10_a2ui, (NONE, 4, sizeof(uint32_t)))
+SLANG_FORMAT(rgba8ui, (UINT8, 4, sizeof(uint8_t) * 4))
+SLANG_FORMAT(rg32ui, (UINT32, 2, sizeof(uint32_t) * 2))
+SLANG_FORMAT(rg16ui, (UINT16, 2, sizeof(uint16_t) * 2))
+SLANG_FORMAT(rg8ui, (UINT8, 2, sizeof(uint8_t) * 2))
+SLANG_FORMAT(r32ui, (UINT32, 1, sizeof(uint32_t)))
+SLANG_FORMAT(r16ui, (UINT16, 1, sizeof(uint16_t)))
+SLANG_FORMAT(r8ui, (UINT8, 1, sizeof(uint8_t)))
+SLANG_FORMAT(r64ui, (UINT64, 1, sizeof(uint64_t)))
+SLANG_FORMAT(r64i, (INT64, 1, sizeof(int64_t)))
+SLANG_FORMAT(bgra8, (UINT8, 4, sizeof(uint32_t)))
+
+#undef SLANG_FORMAT
@@ -0,0 +1,404 @@
+#ifndef SLANG_LLVM_H
+#define SLANG_LLVM_H
+
+// TODO(JS):
+// Disable exception declspecs, as not supported on LLVM without some extra options.
+// We could enable with `-fms-extensions`
+#define SLANG_DISABLE_EXCEPTIONS 1
+
+#ifndef SLANG_PRELUDE_ASSERT
+#ifdef SLANG_PRELUDE_ENABLE_ASSERT
+extern "C" void assertFailure(const char* msg);
+#define SLANG_PRELUDE_EXPECT(VALUE, MSG) \
+    if (VALUE)                           \
+    {                                    \
+    }                                    \
+    else                                 \
+        assertFailure("assertion failed: '" MSG "'")
+#define SLANG_PRELUDE_ASSERT(VALUE) SLANG_PRELUDE_EXPECT(VALUE, #VALUE)
+#else // SLANG_PRELUDE_ENABLE_ASSERT
+#define SLANG_PRELUDE_EXPECT(VALUE, MSG)
+#define SLANG_PRELUDE_ASSERT(x)
+#endif // SLANG_PRELUDE_ENABLE_ASSERT
+#endif
+
+/*
+Taken from stddef.h
+*/
+
+typedef __PTRDIFF_TYPE__ ptrdiff_t;
+typedef __SIZE_TYPE__ size_t;
+typedef __SIZE_TYPE__ rsize_t;
+
+// typedef __WCHAR_TYPE__ wchar_t;
+
+#if defined(__need_NULL)
+#undef NULL
+#ifdef __cplusplus
+#if !defined(__MINGW32__) && !defined(_MSC_VER)
+#define NULL __null
+#else
+#define NULL 0
+#endif
+#else
+#define NULL ((void*)0)
+#endif
+#ifdef __cplusplus
+#if defined(_MSC_EXTENSIONS) && defined(_NATIVE_NULLPTR_SUPPORTED)
+namespace std
+{
+typedef decltype(nullptr) nullptr_t;
+}
+using ::std::nullptr_t;
+#endif
+#endif
+#undef __need_NULL
+#endif /* defined(__need_NULL) */
+
+
+/*
+The following are taken verbatim from stdint.h from Clang in LLVM. Only 8/16/32/64 types are needed.
+*/
+
+// LLVM/Clang types such that we can use LLVM/Clang without headers for C++ output from Slang
+
+#ifdef __INT64_TYPE__
+#ifndef __int8_t_defined /* glibc sys/types.h also defines int64_t*/
+typedef __INT64_TYPE__ int64_t;
+#endif /* __int8_t_defined */
+typedef __UINT64_TYPE__ uint64_t;
+#define __int_least64_t int64_t
+#define __uint_least64_t uint64_t
+#endif /* __INT64_TYPE__ */
+
+#ifdef __int_least64_t
+typedef __int_least64_t int_least64_t;
+typedef __uint_least64_t uint_least64_t;
+typedef __int_least64_t int_fast64_t;
+typedef __uint_least64_t uint_fast64_t;
+#endif /* __int_least64_t */
+
+#ifdef __INT32_TYPE__
+
+#ifndef __int8_t_defined /* glibc sys/types.h also defines int32_t*/
+typedef __INT32_TYPE__ int32_t;
+#endif /* __int8_t_defined */
+
+#ifndef __uint32_t_defined /* more glibc compatibility */
+#define __uint32_t_defined
+typedef __UINT32_TYPE__ uint32_t;
+#endif /* __uint32_t_defined */
+
+#define __int_least32_t int32_t
+#define __uint_least32_t uint32_t
+#endif /* __INT32_TYPE__ */
+
+#ifdef __int_least32_t
+typedef __int_least32_t int_least32_t;
+typedef __uint_least32_t uint_least32_t;
+typedef __int_least32_t int_fast32_t;
+typedef __uint_least32_t uint_fast32_t;
+#endif /* __int_least32_t */
+
+#ifdef __INT16_TYPE__
+#ifndef __int8_t_defined /* glibc sys/types.h also defines int16_t*/
+typedef __INT16_TYPE__ int16_t;
+#endif /* __int8_t_defined */
+typedef __UINT16_TYPE__ uint16_t;
+#define __int_least16_t int16_t
+#define __uint_least16_t uint16_t
+#endif /* __INT16_TYPE__ */
+
+#ifdef __int_least16_t
+typedef __int_least16_t int_least16_t;
+typedef __uint_least16_t uint_least16_t;
+typedef __int_least16_t int_fast16_t;
+typedef __uint_least16_t uint_fast16_t;
+#endif /* __int_least16_t */
+
+#ifdef __INT8_TYPE__
+#ifndef __int8_t_defined /* glibc sys/types.h also defines int8_t*/
+typedef __INT8_TYPE__ int8_t;
+#endif /* __int8_t_defined */
+typedef __UINT8_TYPE__ uint8_t;
+#define __int_least8_t int8_t
+#define __uint_least8_t uint8_t
+#endif /* __INT8_TYPE__ */
+
+#ifdef __int_least8_t
+typedef __int_least8_t int_least8_t;
+typedef __uint_least8_t uint_least8_t;
+typedef __int_least8_t int_fast8_t;
+typedef __uint_least8_t uint_fast8_t;
+#endif /* __int_least8_t */
+
+/* prevent glibc sys/types.h from defining conflicting types */
+#ifndef __int8_t_defined
+#define __int8_t_defined
+#endif /* __int8_t_defined */
+
+/* C99 7.18.1.4 Integer types capable of holding object pointers.
+ */
+#define __stdint_join3(a, b, c) a##b##c
+
+#ifndef _INTPTR_T
+#ifndef __intptr_t_defined
+typedef __INTPTR_TYPE__ intptr_t;
+#define __intptr_t_defined
+#define _INTPTR_T
+#endif
+#endif
+
+#ifndef _UINTPTR_T
+typedef __UINTPTR_TYPE__ uintptr_t;
+#define _UINTPTR_T
+#endif
+
+/* C99 7.18.1.5 Greatest-width integer types.
+ */
+typedef __INTMAX_TYPE__ intmax_t;
+typedef __UINTMAX_TYPE__ uintmax_t;
+
+/* C99 7.18.4 Macros for minimum-width integer constants.
+ *
+ * The standard requires that integer constant macros be defined for all the
+ * minimum-width types defined above. As 8-, 16-, 32-, and 64-bit minimum-width
+ * types are required, the corresponding integer constant macros are defined
+ * here. This implementation also defines minimum-width types for every other
+ * integer width that the target implements, so corresponding macros are
+ * defined below, too.
+ *
+ * These macros are defined using the same successive-shrinking approach as
+ * the type definitions above. It is likewise important that macros are defined
+ * in order of decending width.
+ *
+ * Note that C++ should not check __STDC_CONSTANT_MACROS here, contrary to the
+ * claims of the C standard (see C++ 18.3.1p2, [cstdint.syn]).
+ */
+
+#define __int_c_join(a, b) a##b
+#define __int_c(v, suffix) __int_c_join(v, suffix)
+#define __uint_c(v, suffix) __int_c_join(v##U, suffix)
+
+#ifdef __INT64_TYPE__
+#ifdef __INT64_C_SUFFIX__
+#define __int64_c_suffix __INT64_C_SUFFIX__
+#else
+#undef __int64_c_suffix
+#endif /* __INT64_C_SUFFIX__ */
+#endif /* __INT64_TYPE__ */
+
+#ifdef __int_least64_t
+#ifdef __int64_c_suffix
+#define INT64_C(v) __int_c(v, __int64_c_suffix)
+#define UINT64_C(v) __uint_c(v, __int64_c_suffix)
+#else
+#define INT64_C(v) v
+#define UINT64_C(v) v##U
+#endif /* __int64_c_suffix */
+#endif /* __int_least64_t */
+
+
+#ifdef __INT32_TYPE__
+#ifdef __INT32_C_SUFFIX__
+#define __int32_c_suffix __INT32_C_SUFFIX__
+#else
+#undef __int32_c_suffix
+#endif /* __INT32_C_SUFFIX__ */
+#endif /* __INT32_TYPE__ */
+
+#ifdef __int_least32_t
+#ifdef __int32_c_suffix
+#define INT32_C(v) __int_c(v, __int32_c_suffix)
+#define UINT32_C(v) __uint_c(v, __int32_c_suffix)
+#else
+#define INT32_C(v) v
+#define UINT32_C(v) v##U
+#endif /* __int32_c_suffix */
+#endif /* __int_least32_t */
+
+#ifdef __INT16_TYPE__
+#ifdef __INT16_C_SUFFIX__
+#define __int16_c_suffix __INT16_C_SUFFIX__
+#else
+#undef __int16_c_suffix
+#endif /* __INT16_C_SUFFIX__ */
+#endif /* __INT16_TYPE__ */
+
+#ifdef __int_least16_t
+#ifdef __int16_c_suffix
+#define INT16_C(v) __int_c(v, __int16_c_suffix)
+#define UINT16_C(v) __uint_c(v, __int16_c_suffix)
+#else
+#define INT16_C(v) v
+#define UINT16_C(v) v##U
+#endif /* __int16_c_suffix */
+#endif /* __int_least16_t */
+
+
+#ifdef __INT8_TYPE__
+#ifdef __INT8_C_SUFFIX__
+#define __int8_c_suffix __INT8_C_SUFFIX__
+#else
+#undef __int8_c_suffix
+#endif /* __INT8_C_SUFFIX__ */
+#endif /* __INT8_TYPE__ */
+
+#ifdef __int_least8_t
+#ifdef __int8_c_suffix
+#define INT8_C(v) __int_c(v, __int8_c_suffix)
+#define UINT8_C(v) __uint_c(v, __int8_c_suffix)
+#else
+#define INT8_C(v) v
+#define UINT8_C(v) v##U
+#endif /* __int8_c_suffix */
+#endif /* __int_least8_t */
+
+/* C99 7.18.2.1 Limits of exact-width integer types.
+ * C99 7.18.2.2 Limits of minimum-width integer types.
+ * C99 7.18.2.3 Limits of fastest minimum-width integer types.
+ *
+ * The presence of limit macros are completely optional in C99.  This
+ * implementation defines limits for all of the types (exact- and
+ * minimum-width) that it defines above, using the limits of the minimum-width
+ * type for any types that do not have exact-width representations.
+ *
+ * As in the type definitions, this section takes an approach of
+ * successive-shrinking to determine which limits to use for the standard (8,
+ * 16, 32, 64) bit widths when they don't have exact representations. It is
+ * therefore important that the definitions be kept in order of decending
+ * widths.
+ *
+ * Note that C++ should not check __STDC_LIMIT_MACROS here, contrary to the
+ * claims of the C standard (see C++ 18.3.1p2, [cstdint.syn]).
+ */
+
+#ifdef __INT64_TYPE__
+#define INT64_MAX INT64_C(9223372036854775807)
+#define INT64_MIN (-INT64_C(9223372036854775807) - 1)
+#define UINT64_MAX UINT64_C(18446744073709551615)
+#define __INT_LEAST64_MIN INT64_MIN
+#define __INT_LEAST64_MAX INT64_MAX
+#define __UINT_LEAST64_MAX UINT64_MAX
+#endif /* __INT64_TYPE__ */
+
+#ifdef __INT_LEAST64_MIN
+#define INT_LEAST64_MIN __INT_LEAST64_MIN
+#define INT_LEAST64_MAX __INT_LEAST64_MAX
+#define UINT_LEAST64_MAX __UINT_LEAST64_MAX
+#define INT_FAST64_MIN __INT_LEAST64_MIN
+#define INT_FAST64_MAX __INT_LEAST64_MAX
+#define UINT_FAST64_MAX __UINT_LEAST64_MAX
+#endif /* __INT_LEAST64_MIN */
+
+#ifdef __INT32_TYPE__
+#define INT32_MAX INT32_C(2147483647)
+#define INT32_MIN (-INT32_C(2147483647) - 1)
+#define UINT32_MAX UINT32_C(4294967295)
+#define __INT_LEAST32_MIN INT32_MIN
+#define __INT_LEAST32_MAX INT32_MAX
+#define __UINT_LEAST32_MAX UINT32_MAX
+#endif /* __INT32_TYPE__ */
+
+#ifdef __INT_LEAST32_MIN
+#define INT_LEAST32_MIN __INT_LEAST32_MIN
+#define INT_LEAST32_MAX __INT_LEAST32_MAX
+#define UINT_LEAST32_MAX __UINT_LEAST32_MAX
+#define INT_FAST32_MIN __INT_LEAST32_MIN
+#define INT_FAST32_MAX __INT_LEAST32_MAX
+#define UINT_FAST32_MAX __UINT_LEAST32_MAX
+#endif /* __INT_LEAST32_MIN */
+
+#ifdef __INT16_TYPE__
+#define INT16_MAX INT16_C(32767)
+#define INT16_MIN (-INT16_C(32767) - 1)
+#define UINT16_MAX UINT16_C(65535)
+#define __INT_LEAST16_MIN INT16_MIN
+#define __INT_LEAST16_MAX INT16_MAX
+#define __UINT_LEAST16_MAX UINT16_MAX
+#endif /* __INT16_TYPE__ */
+
+#ifdef __INT_LEAST16_MIN
+#define INT_LEAST16_MIN __INT_LEAST16_MIN
+#define INT_LEAST16_MAX __INT_LEAST16_MAX
+#define UINT_LEAST16_MAX __UINT_LEAST16_MAX
+#define INT_FAST16_MIN __INT_LEAST16_MIN
+#define INT_FAST16_MAX __INT_LEAST16_MAX
+#define UINT_FAST16_MAX __UINT_LEAST16_MAX
+#endif /* __INT_LEAST16_MIN */
+
+
+#ifdef __INT8_TYPE__
+#define INT8_MAX INT8_C(127)
+#define INT8_MIN (-INT8_C(127) - 1)
+#define UINT8_MAX UINT8_C(255)
+#define __INT_LEAST8_MIN INT8_MIN
+#define __INT_LEAST8_MAX INT8_MAX
+#define __UINT_LEAST8_MAX UINT8_MAX
+#endif /* __INT8_TYPE__ */
+
+#ifdef __INT_LEAST8_MIN
+#define INT_LEAST8_MIN __INT_LEAST8_MIN
+#define INT_LEAST8_MAX __INT_LEAST8_MAX
+#define UINT_LEAST8_MAX __UINT_LEAST8_MAX
+#define INT_FAST8_MIN __INT_LEAST8_MIN
+#define INT_FAST8_MAX __INT_LEAST8_MAX
+#define UINT_FAST8_MAX __UINT_LEAST8_MAX
+#endif /* __INT_LEAST8_MIN */
+
+/* Some utility macros */
+#define __INTN_MIN(n) __stdint_join3(INT, n, _MIN)
+#define __INTN_MAX(n) __stdint_join3(INT, n, _MAX)
+#define __UINTN_MAX(n) __stdint_join3(UINT, n, _MAX)
+#define __INTN_C(n, v) __stdint_join3(INT, n, _C(v))
+#define __UINTN_C(n, v) __stdint_join3(UINT, n, _C(v))
+
+/* C99 7.18.2.4 Limits of integer types capable of holding object pointers. */
+/* C99 7.18.3 Limits of other integer types. */
+
+#define INTPTR_MIN (-__INTPTR_MAX__ - 1)
+#define INTPTR_MAX __INTPTR_MAX__
+#define UINTPTR_MAX __UINTPTR_MAX__
+#define PTRDIFF_MIN (-__PTRDIFF_MAX__ - 1)
+#define PTRDIFF_MAX __PTRDIFF_MAX__
+#define SIZE_MAX __SIZE_MAX__
+
+/* ISO9899:2011 7.20 (C11 Annex K): Define RSIZE_MAX if __STDC_WANT_LIB_EXT1__
+ * is enabled. */
+#if defined(__STDC_WANT_LIB_EXT1__) && __STDC_WANT_LIB_EXT1__ >= 1
+#define RSIZE_MAX (SIZE_MAX >> 1)
+#endif
+
+/* C99 7.18.2.5 Limits of greatest-width integer types. */
+#define INTMAX_MIN (-__INTMAX_MAX__ - 1)
+#define INTMAX_MAX __INTMAX_MAX__
+#define UINTMAX_MAX __UINTMAX_MAX__
+
+/* C99 7.18.3 Limits of other integer types. */
+#define SIG_ATOMIC_MIN __INTN_MIN(__SIG_ATOMIC_WIDTH__)
+#define SIG_ATOMIC_MAX __INTN_MAX(__SIG_ATOMIC_WIDTH__)
+#ifdef __WINT_UNSIGNED__
+#define WINT_MIN __UINTN_C(__WINT_WIDTH__, 0)
+#define WINT_MAX __UINTN_MAX(__WINT_WIDTH__)
+#else
+#define WINT_MIN __INTN_MIN(__WINT_WIDTH__)
+#define WINT_MAX __INTN_MAX(__WINT_WIDTH__)
+#endif
+
+#ifndef WCHAR_MAX
+#define WCHAR_MAX __WCHAR_MAX__
+#endif
+#ifndef WCHAR_MIN
+#if __WCHAR_MAX__ == __INTN_MAX(__WCHAR_WIDTH__)
+#define WCHAR_MIN __INTN_MIN(__WCHAR_WIDTH__)
+#else
+#define WCHAR_MIN __UINTN_C(__WCHAR_WIDTH__, 0)
+#endif
+#endif
+
+/* 7.18.4.2 Macros for greatest-width integer constants. */
+#define INTMAX_C(v) __int_c(v, __INTMAX_C_SUFFIX__)
+#define UINTMAX_C(v) __int_c(v, __UINTMAX_C_SUFFIX__)
+
+
+#endif // SLANG_LLVM_H
@@ -0,0 +1,2 @@
+#define SLANG_TAG_VERSION "2026.3.1"
+#define SLANG_VERSION_NUMERIC "2026.3.1"
@@ -0,0 +1,183 @@
+// Prelude for PyTorch cpp binding.
+
+// clang-format off
+#include <torch/extension.h>
+// clang-format on
+
+#include <ATen/cuda/CUDAContext.h>
+#include <ATen/cuda/CUDAUtils.h>
+#include <stdexcept>
+#include <string>
+#include <vector>
+
+#ifdef SLANG_LLVM
+#include "slang-llvm.h"
+#else // SLANG_LLVM
+#if SLANG_GCC_FAMILY && __GNUC__ < 6
+#include <cmath>
+#define SLANG_PRELUDE_STD std::
+#else
+#include <math.h>
+#define SLANG_PRELUDE_STD
+#endif
+
+#include <assert.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <string.h>
+#endif // SLANG_LLVM
+
+#include "../source/core/slang-string.h"
+
+#if defined(_MSC_VER)
+#define SLANG_PRELUDE_SHARED_LIB_EXPORT __declspec(dllexport)
+#else
+#define SLANG_PRELUDE_SHARED_LIB_EXPORT __attribute__((__visibility__("default")))
+// #   define SLANG_PRELUDE_SHARED_LIB_EXPORT __attribute__ ((dllexport))
+// __attribute__((__visibility__("default")))
+#endif
+
+#ifdef __cplusplus
+#define SLANG_PRELUDE_EXTERN_C extern "C"
+#define SLANG_PRELUDE_EXTERN_C_START \
+    extern "C"                       \
+    {
+#define SLANG_PRELUDE_EXTERN_C_END }
+#else
+#define SLANG_PRELUDE_EXTERN_C
+#define SLANG_PRELUDE_EXTERN_C_START
+#define SLANG_PRELUDE_EXTERN_C_END
+#endif
+
+#define SLANG_PRELUDE_NAMESPACE
+
+#ifndef SLANG_NO_THROW
+#define SLANG_NO_THROW
+#endif
+#ifndef SLANG_STDCALL
+#define SLANG_STDCALL
+#endif
+#ifndef SLANG_MCALL
+#define SLANG_MCALL SLANG_STDCALL
+#endif
+#ifndef SLANG_FORCE_INLINE
+#define SLANG_FORCE_INLINE inline
+#endif
+#include "slang-cpp-scalar-intrinsics.h"
+#include "slang-cpp-types-core.h"
+
+
+static const int kSlangTorchTensorMaxDim = 5;
+
+// NOTE: If you change this struct's layout, also update the hard-coded size/alignment
+// in _createTypeLayout() in slang-type-layout.cpp.
+struct TensorView
+{
+    uint8_t* data;
+    uint32_t strides[kSlangTorchTensorMaxDim];
+    uint32_t sizes[kSlangTorchTensorMaxDim];
+    uint32_t dimensionCount;
+};
+
+
+TensorView make_tensor_view(
+    torch::Tensor val,
+    const char* name,
+    torch::ScalarType targetScalarType,
+    bool requireContiguous)
+{
+    // We're currently not trying to implicitly cast or transfer to device for two reasons:
+    // 1. There appears to be a bug with .to() where successive calls after the first one fail.
+    // 2. Silent casts like this can cause large memory allocations & unexpected overheads.
+    //    It's better to be explicit.
+
+    // Expect tensors to be on CUDA device
+    if (!val.device().is_cuda())
+        throw std::runtime_error(
+            std::string(name).append(": tensor is not on CUDA device.").c_str());
+
+    // Expect tensors to be the right type.
+    if (val.dtype() != targetScalarType)
+        throw std::runtime_error(
+            std::string(name).append(": tensor is not of the expected type.").c_str());
+
+    // Check that the tensor is contiguous
+    if (requireContiguous && !val.is_contiguous())
+        throw std::runtime_error(std::string(name).append(": tensor is not contiguous.").c_str());
+
+    TensorView res = {};
+    res.dimensionCount = val.dim();
+    res.data = nullptr;
+    size_t elementSize = 4;
+
+    switch (val.scalar_type())
+    {
+    case torch::kInt8:
+    case torch::kUInt8:
+        elementSize = 1;
+        res.data = (uint8_t*)val.data_ptr<uint8_t>();
+        break;
+    case torch::kBFloat16:
+        elementSize = 2;
+        res.data = (uint8_t*)val.data_ptr<torch::BFloat16>();
+        break;
+    case torch::kFloat16:
+        elementSize = 2;
+        res.data = (uint8_t*)val.data_ptr<at::Half>();
+        break;
+    case torch::kInt16:
+        elementSize = 2;
+        res.data = (uint8_t*)val.data_ptr<int16_t>();
+        break;
+    case torch::kFloat32:
+        elementSize = 4;
+        res.data = (uint8_t*)val.data_ptr<float>();
+        break;
+    case torch::kInt32:
+        elementSize = 4;
+        res.data = (uint8_t*)val.data_ptr<int32_t>();
+        break;
+    case torch::kFloat64:
+        elementSize = 8;
+        res.data = (uint8_t*)val.data_ptr<double>();
+        break;
+    case torch::kInt64:
+        elementSize = 8;
+        res.data = (uint8_t*)val.data_ptr<int64_t>();
+        break;
+    case torch::kBool:
+        elementSize = 1;
+        res.data = (uint8_t*)val.data_ptr<bool>();
+        break;
+    }
+
+    if (val.dim() > kSlangTorchTensorMaxDim)
+        throw std::runtime_error(std::string(name)
+                                     .append(": number of dimensions exceeds limit (")
+                                     .append(std::to_string(kSlangTorchTensorMaxDim))
+                                     .append(")")
+                                     .c_str());
+
+    bool isEmpty = true;
+    for (int i = 0; i < val.dim(); ++i)
+    {
+        res.strides[i] = val.stride(i) * elementSize;
+        if (res.strides[i] == 0)
+            throw std::runtime_error(
+                std::string(name)
+                    .append(": tensors with broadcasted dimensions are not supported (use "
+                            "tensor.contiguous() to make tensor whole)")
+                    .c_str());
+
+        res.sizes[i] = val.size(i);
+        if (res.sizes[i] > 0)
+            isEmpty = false;
+    }
+
+    if (!res.data && !isEmpty)
+        throw std::runtime_error(std::string(name).append(": data pointer is invalid.").c_str());
+
+    return res;
+}
+
+#define SLANG_PRELUDE_EXPORT
@@ -0,0 +1,44 @@
+
+
+####### Expanded from @PACKAGE_INIT@ by configure_package_config_file() #######
+####### Any changes to this file will be overwritten by the next CMake run ####
+####### The input file was SlangConfig.cmake.in                            ########
+
+get_filename_component(PACKAGE_PREFIX_DIR "${CMAKE_CURRENT_LIST_DIR}/../../../" ABSOLUTE)
+
+macro(set_and_check _var _file)
+  set(${_var} "${_file}")
+  if(NOT EXISTS "${_file}")
+    message(FATAL_ERROR "File or directory ${_file} referenced by variable ${_var} does not exist !")
+  endif()
+endmacro()
+
+macro(check_required_components _NAME)
+  foreach(comp ${${_NAME}_FIND_COMPONENTS})
+    if(NOT ${_NAME}_${comp}_FOUND)
+      if(${_NAME}_FIND_REQUIRED_${comp})
+        set(${_NAME}_FOUND FALSE)
+      endif()
+    endif()
+  endforeach()
+endmacro()
+
+####################################################################################
+
+if (NOT CMAKE_SYSTEM_NAME STREQUAL "Emscripten")
+  include("${CMAKE_CURRENT_LIST_DIR}/slangTargets.cmake")
+  check_required_components("slang")
+endif()
+
+if(ON)
+
+  find_program(SLANGC_EXECUTABLE "slangc" HINTS "${PACKAGE_PREFIX_DIR}/bin" ENV PATH)
+
+  if (NOT SLANGC_EXECUTABLE)
+      message(STATUS "slangc executable not found; ensure it is available in your PATH.")
+  endif()
+    
+  set(SLANG_EXECUTABLE ${SLANGC_EXECUTABLE} CACHE STRING "Path to the slangc executable")
+
+endif()
+
@@ -0,0 +1,65 @@
+# This is a basic version file for the Config-mode of find_package().
+# It is used by write_basic_package_version_file() as input file for configure_file()
+# to create a version-file which can be installed along a config.cmake file.
+#
+# The created file sets PACKAGE_VERSION_EXACT if the current version string and
+# the requested version string are exactly the same and it sets
+# PACKAGE_VERSION_COMPATIBLE if the current version is >= requested version,
+# but only if the requested major version is the same as the current one.
+# The variable CVF_VERSION must be set before calling configure_file().
+
+
+set(PACKAGE_VERSION "2026.3.1")
+
+if(PACKAGE_VERSION VERSION_LESS PACKAGE_FIND_VERSION)
+  set(PACKAGE_VERSION_COMPATIBLE FALSE)
+else()
+
+  if("2026.3.1" MATCHES "^([0-9]+)\\.")
+    set(CVF_VERSION_MAJOR "${CMAKE_MATCH_1}")
+    if(NOT CVF_VERSION_MAJOR VERSION_EQUAL 0)
+      string(REGEX REPLACE "^0+" "" CVF_VERSION_MAJOR "${CVF_VERSION_MAJOR}")
+    endif()
+  else()
+    set(CVF_VERSION_MAJOR "2026.3.1")
+  endif()
+
+  if(PACKAGE_FIND_VERSION_RANGE)
+    # both endpoints of the range must have the expected major version
+    math (EXPR CVF_VERSION_MAJOR_NEXT "${CVF_VERSION_MAJOR} + 1")
+    if (NOT PACKAGE_FIND_VERSION_MIN_MAJOR STREQUAL CVF_VERSION_MAJOR
+        OR ((PACKAGE_FIND_VERSION_RANGE_MAX STREQUAL "INCLUDE" AND NOT PACKAGE_FIND_VERSION_MAX_MAJOR STREQUAL CVF_VERSION_MAJOR)
+          OR (PACKAGE_FIND_VERSION_RANGE_MAX STREQUAL "EXCLUDE" AND NOT PACKAGE_FIND_VERSION_MAX VERSION_LESS_EQUAL CVF_VERSION_MAJOR_NEXT)))
+      set(PACKAGE_VERSION_COMPATIBLE FALSE)
+    elseif(PACKAGE_FIND_VERSION_MIN_MAJOR STREQUAL CVF_VERSION_MAJOR
+        AND ((PACKAGE_FIND_VERSION_RANGE_MAX STREQUAL "INCLUDE" AND PACKAGE_VERSION VERSION_LESS_EQUAL PACKAGE_FIND_VERSION_MAX)
+        OR (PACKAGE_FIND_VERSION_RANGE_MAX STREQUAL "EXCLUDE" AND PACKAGE_VERSION VERSION_LESS PACKAGE_FIND_VERSION_MAX)))
+      set(PACKAGE_VERSION_COMPATIBLE TRUE)
+    else()
+      set(PACKAGE_VERSION_COMPATIBLE FALSE)
+    endif()
+  else()
+    if(PACKAGE_FIND_VERSION_MAJOR STREQUAL CVF_VERSION_MAJOR)
+      set(PACKAGE_VERSION_COMPATIBLE TRUE)
+    else()
+      set(PACKAGE_VERSION_COMPATIBLE FALSE)
+    endif()
+
+    if(PACKAGE_FIND_VERSION STREQUAL PACKAGE_VERSION)
+      set(PACKAGE_VERSION_EXACT TRUE)
+    endif()
+  endif()
+endif()
+
+
+# if the installed or the using project don't have CMAKE_SIZEOF_VOID_P set, ignore it:
+if("${CMAKE_SIZEOF_VOID_P}" STREQUAL "" OR "8" STREQUAL "")
+  return()
+endif()
+
+# check that the installed version has the same 32/64bit-ness as the one which is currently searching:
+if(NOT CMAKE_SIZEOF_VOID_P STREQUAL "8")
+  math(EXPR installedBits "8 * 8")
+  set(PACKAGE_VERSION "${PACKAGE_VERSION} (${installedBits}bit)")
+  set(PACKAGE_VERSION_UNSUITABLE TRUE)
+endif()
@@ -0,0 +1,90 @@
+#----------------------------------------------------------------
+# Generated CMake target import file for configuration "Release".
+#----------------------------------------------------------------
+
+# Commands may need to know the format version.
+set(CMAKE_IMPORT_FILE_VERSION 1)
+
+# Import target "slang::slang-llvm" for configuration "Release"
+set_property(TARGET slang::slang-llvm APPEND PROPERTY IMPORTED_CONFIGURATIONS RELEASE)
+set_target_properties(slang::slang-llvm PROPERTIES
+  IMPORTED_COMMON_LANGUAGE_RUNTIME_RELEASE ""
+  IMPORTED_LOCATION_RELEASE "${_IMPORT_PREFIX}/lib/libslang-llvm.so"
+  IMPORTED_NO_SONAME_RELEASE "TRUE"
+  )
+
+list(APPEND _cmake_import_check_targets slang::slang-llvm )
+list(APPEND _cmake_import_check_files_for_slang::slang-llvm "${_IMPORT_PREFIX}/lib/libslang-llvm.so" )
+
+# Import target "slang::slang-glslang" for configuration "Release"
+set_property(TARGET slang::slang-glslang APPEND PROPERTY IMPORTED_CONFIGURATIONS RELEASE)
+set_target_properties(slang::slang-glslang PROPERTIES
+  IMPORTED_COMMON_LANGUAGE_RUNTIME_RELEASE ""
+  IMPORTED_LOCATION_RELEASE "${_IMPORT_PREFIX}/lib/libslang-glslang-2026.3.1.so"
+  IMPORTED_NO_SONAME_RELEASE "TRUE"
+  )
+
+list(APPEND _cmake_import_check_targets slang::slang-glslang )
+list(APPEND _cmake_import_check_files_for_slang::slang-glslang "${_IMPORT_PREFIX}/lib/libslang-glslang-2026.3.1.so" )
+
+# Import target "slang::slangd" for configuration "Release"
+set_property(TARGET slang::slangd APPEND PROPERTY IMPORTED_CONFIGURATIONS RELEASE)
+set_target_properties(slang::slangd PROPERTIES
+  IMPORTED_LOCATION_RELEASE "${_IMPORT_PREFIX}/bin/slangd"
+  )
+
+list(APPEND _cmake_import_check_targets slang::slangd )
+list(APPEND _cmake_import_check_files_for_slang::slangd "${_IMPORT_PREFIX}/bin/slangd" )
+
+# Import target "slang::slangi" for configuration "Release"
+set_property(TARGET slang::slangi APPEND PROPERTY IMPORTED_CONFIGURATIONS RELEASE)
+set_target_properties(slang::slangi PROPERTIES
+  IMPORTED_LOCATION_RELEASE "${_IMPORT_PREFIX}/bin/slangi"
+  )
+
+list(APPEND _cmake_import_check_targets slang::slangi )
+list(APPEND _cmake_import_check_files_for_slang::slangi "${_IMPORT_PREFIX}/bin/slangi" )
+
+# Import target "slang::gfx" for configuration "Release"
+set_property(TARGET slang::gfx APPEND PROPERTY IMPORTED_CONFIGURATIONS RELEASE)
+set_target_properties(slang::gfx PROPERTIES
+  IMPORTED_LINK_DEPENDENT_LIBRARIES_RELEASE "slang::slang"
+  IMPORTED_LOCATION_RELEASE "${_IMPORT_PREFIX}/lib/libgfx.so.0.2026.3.1"
+  IMPORTED_SONAME_RELEASE "libgfx.so.0.2026.3.1"
+  )
+
+list(APPEND _cmake_import_check_targets slang::gfx )
+list(APPEND _cmake_import_check_files_for_slang::gfx "${_IMPORT_PREFIX}/lib/libgfx.so.0.2026.3.1" )
+
+# Import target "slang::slang-glsl-module" for configuration "Release"
+set_property(TARGET slang::slang-glsl-module APPEND PROPERTY IMPORTED_CONFIGURATIONS RELEASE)
+set_target_properties(slang::slang-glsl-module PROPERTIES
+  IMPORTED_COMMON_LANGUAGE_RUNTIME_RELEASE ""
+  IMPORTED_LOCATION_RELEASE "${_IMPORT_PREFIX}/lib/libslang-glsl-module-2026.3.1.so"
+  IMPORTED_NO_SONAME_RELEASE "TRUE"
+  )
+
+list(APPEND _cmake_import_check_targets slang::slang-glsl-module )
+list(APPEND _cmake_import_check_files_for_slang::slang-glsl-module "${_IMPORT_PREFIX}/lib/libslang-glsl-module-2026.3.1.so" )
+
+# Import target "slang::slang" for configuration "Release"
+set_property(TARGET slang::slang APPEND PROPERTY IMPORTED_CONFIGURATIONS RELEASE)
+set_target_properties(slang::slang PROPERTIES
+  IMPORTED_LOCATION_RELEASE "${_IMPORT_PREFIX}/lib/libslang-compiler.so.0.2026.3.1"
+  IMPORTED_SONAME_RELEASE "libslang-compiler.so.0.2026.3.1"
+  )
+
+list(APPEND _cmake_import_check_targets slang::slang )
+list(APPEND _cmake_import_check_files_for_slang::slang "${_IMPORT_PREFIX}/lib/libslang-compiler.so.0.2026.3.1" )
+
+# Import target "slang::slangc" for configuration "Release"
+set_property(TARGET slang::slangc APPEND PROPERTY IMPORTED_CONFIGURATIONS RELEASE)
+set_target_properties(slang::slangc PROPERTIES
+  IMPORTED_LOCATION_RELEASE "${_IMPORT_PREFIX}/bin/slangc"
+  )
+
+list(APPEND _cmake_import_check_targets slang::slangc )
+list(APPEND _cmake_import_check_files_for_slang::slangc "${_IMPORT_PREFIX}/bin/slangc" )
+
+# Commands beyond this point should not need to know the version.
+set(CMAKE_IMPORT_FILE_VERSION)
@@ -0,0 +1,137 @@
+# Generated by CMake
+
+if("${CMAKE_MAJOR_VERSION}.${CMAKE_MINOR_VERSION}" LESS 2.8)
+   message(FATAL_ERROR "CMake >= 2.8.3 required")
+endif()
+if(CMAKE_VERSION VERSION_LESS "2.8.3")
+   message(FATAL_ERROR "CMake >= 2.8.3 required")
+endif()
+cmake_policy(PUSH)
+cmake_policy(VERSION 2.8.3...3.29)
+#----------------------------------------------------------------
+# Generated CMake target import file.
+#----------------------------------------------------------------
+
+# Commands may need to know the format version.
+set(CMAKE_IMPORT_FILE_VERSION 1)
+
+# Protect against multiple inclusion, which would fail when already imported targets are added once more.
+set(_cmake_targets_defined "")
+set(_cmake_targets_not_defined "")
+set(_cmake_expected_targets "")
+foreach(_cmake_expected_target IN ITEMS slang::slang-llvm slang::slang-glslang slang::slangd slang::slangi slang::gfx slang::slang-glsl-module slang::slang slang::slangc)
+  list(APPEND _cmake_expected_targets "${_cmake_expected_target}")
+  if(TARGET "${_cmake_expected_target}")
+    list(APPEND _cmake_targets_defined "${_cmake_expected_target}")
+  else()
+    list(APPEND _cmake_targets_not_defined "${_cmake_expected_target}")
+  endif()
+endforeach()
+unset(_cmake_expected_target)
+if(_cmake_targets_defined STREQUAL _cmake_expected_targets)
+  unset(_cmake_targets_defined)
+  unset(_cmake_targets_not_defined)
+  unset(_cmake_expected_targets)
+  unset(CMAKE_IMPORT_FILE_VERSION)
+  cmake_policy(POP)
+  return()
+endif()
+if(NOT _cmake_targets_defined STREQUAL "")
+  string(REPLACE ";" ", " _cmake_targets_defined_text "${_cmake_targets_defined}")
+  string(REPLACE ";" ", " _cmake_targets_not_defined_text "${_cmake_targets_not_defined}")
+  message(FATAL_ERROR "Some (but not all) targets in this export set were already defined.\nTargets Defined: ${_cmake_targets_defined_text}\nTargets not yet defined: ${_cmake_targets_not_defined_text}\n")
+endif()
+unset(_cmake_targets_defined)
+unset(_cmake_targets_not_defined)
+unset(_cmake_expected_targets)
+
+
+# Compute the installation prefix relative to this file.
+get_filename_component(_IMPORT_PREFIX "${CMAKE_CURRENT_LIST_FILE}" PATH)
+get_filename_component(_IMPORT_PREFIX "${_IMPORT_PREFIX}" PATH)
+get_filename_component(_IMPORT_PREFIX "${_IMPORT_PREFIX}" PATH)
+get_filename_component(_IMPORT_PREFIX "${_IMPORT_PREFIX}" PATH)
+if(_IMPORT_PREFIX STREQUAL "/")
+  set(_IMPORT_PREFIX "")
+endif()
+
+# Create imported target slang::slang-llvm
+add_library(slang::slang-llvm MODULE IMPORTED)
+
+set_target_properties(slang::slang-llvm PROPERTIES
+  INTERFACE_COMPILE_DEFINITIONS "SLANG_DYNAMIC"
+)
+
+# Create imported target slang::slang-glslang
+add_library(slang::slang-glslang MODULE IMPORTED)
+
+# Create imported target slang::slangd
+add_executable(slang::slangd IMPORTED)
+
+# Create imported target slang::slangi
+add_executable(slang::slangi IMPORTED)
+
+# Create imported target slang::gfx
+add_library(slang::gfx SHARED IMPORTED)
+
+set_target_properties(slang::gfx PROPERTIES
+  INTERFACE_COMPILE_DEFINITIONS "SLANG_GFX_DYNAMIC"
+  INTERFACE_INCLUDE_DIRECTORIES "${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include;${_IMPORT_PREFIX}/include"
+)
+
+# Create imported target slang::slang-glsl-module
+add_library(slang::slang-glsl-module MODULE IMPORTED)
+
+# Create imported target slang::slang
+add_library(slang::slang SHARED IMPORTED)
+
+set_target_properties(slang::slang PROPERTIES
+  INTERFACE_COMPILE_DEFINITIONS "SLANG_DYNAMIC"
+  INTERFACE_INCLUDE_DIRECTORIES "${_IMPORT_PREFIX}/include"
+)
+
+# Create imported target slang::slangc
+add_executable(slang::slangc IMPORTED)
+
+# Load information for each installed configuration.
+file(GLOB _cmake_config_files "${CMAKE_CURRENT_LIST_DIR}/slangTargets-*.cmake")
+foreach(_cmake_config_file IN LISTS _cmake_config_files)
+  include("${_cmake_config_file}")
+endforeach()
+unset(_cmake_config_file)
+unset(_cmake_config_files)
+
+# Cleanup temporary variables.
+set(_IMPORT_PREFIX)
+
+# Loop over all imported files and verify that they actually exist
+foreach(_cmake_target IN LISTS _cmake_import_check_targets)
+  if(CMAKE_VERSION VERSION_LESS "3.28"
+      OR NOT DEFINED _cmake_import_check_xcframework_for_${_cmake_target}
+      OR NOT IS_DIRECTORY "${_cmake_import_check_xcframework_for_${_cmake_target}}")
+    foreach(_cmake_file IN LISTS "_cmake_import_check_files_for_${_cmake_target}")
+      if(NOT EXISTS "${_cmake_file}")
+        message(FATAL_ERROR "The imported target \"${_cmake_target}\" references the file
+   \"${_cmake_file}\"
+but this file does not exist.  Possible reasons include:
+* The file was deleted, renamed, or moved to another location.
+* An install or uninstall procedure did not complete successfully.
+* The installation package was faulty and contained
+   \"${CMAKE_CURRENT_LIST_FILE}\"
+but not all the files it references.
+")
+      endif()
+    endforeach()
+  endif()
+  unset(_cmake_file)
+  unset("_cmake_import_check_files_for_${_cmake_target}")
+endforeach()
+unset(_cmake_target)
+unset(_cmake_import_check_targets)
+
+# This file does not depend on other imported targets which have
+# been exported from the same project but in a separate export set.
+
+# Commands beyond this point should not need to know the version.
+set(CMAKE_IMPORT_FILE_VERSION)
+cmake_policy(POP)
@@ -0,0 +1 @@
+libgfx.so.0.2026.3.1
@@ -0,0 +1 @@
+libslang-compiler.so.0.2026.3.1
@@ -0,0 +1 @@
+libslang-rt.so.0.2026.3.1
@@ -0,0 +1 @@
+libslang-compiler.so.0.2026.3.1
@@ -0,0 +1,220 @@
+implementing neural;
+
+__include iactivation;
+
+/**
+Identity activation: returns input unchanged.
+*/
+public struct IdentityActivation<T> : IActivation<T>
+    where T : __BuiltinFloatingPointType
+    where T.Differential == T
+{
+    public __init() {}
+
+    [NoDiffThis]
+    [Differentiable]
+    public Vector eval<Vector>(Vector input)
+        where Vector : IVector<T>
+    {
+        return input;
+    }
+}
+
+/**
+ReLU activation: max(x, 0).
+*/
+public struct ReLU<T> : IActivation<T>
+    where T : __BuiltinFloatingPointType
+    where T.Differential == T
+{
+    public __init() {}
+
+    [NoDiffThis]
+    [Differentiable]
+    public Vector eval<Vector>(Vector input)
+        where Vector : IVector<T>
+    {
+        Vector output = Vector();
+        [ForceUnroll]
+        for (int i = 0; i < Vector.Size; i++)
+            output[i] = max(input[i], T(0));
+        return output;
+    }
+}
+
+/**
+LeakyReLU activation: x < 0 ? alpha*x : x
+
+Construct with the leak coefficient alpha (typically 0.01).
+*/
+public struct LeakyReLU<T> : IActivation<T>
+    where T : __BuiltinFloatingPointType
+    where T.Differential == T
+{
+    /// Leak coefficient for negative inputs.
+    public T alpha;
+
+    /// Constructor with optional alpha value (defaults to 0.01).
+    public __init(T alpha = T(0.01))
+    {
+        this.alpha = alpha;
+    }
+
+    [NoDiffThis]
+    [Differentiable]
+    public Vector eval<Vector>(Vector input)
+        where Vector : IVector<T>
+    {
+        Vector output = Vector();
+        [ForceUnroll]
+        for (int i = 0; i < Vector.Size; i++)
+        {
+            let x = input[i];
+            output[i] = (x < T(0)) ? alpha * x : x;
+        }
+        return output;
+    }
+}
+
+/**
+Sigmoid activation: 1 / (1 + exp(-x))
+*/
+public struct Sigmoid<T> : IActivation<T>
+    where T : __BuiltinFloatingPointType
+    where T.Differential == T
+{
+    public __init() {}
+
+    [NoDiffThis]
+    [Differentiable]
+    public Vector eval<Vector>(Vector input)
+        where Vector : IVector<T>
+    {
+        Vector output = Vector();
+        [ForceUnroll]
+        for (int i = 0; i < Vector.Size; i++)
+        {
+            let x = input[i];
+            output[i] = T(1) / (T(1) + exp(-x));
+        }
+        return output;
+    }
+}
+
+/**
+Tanh activation.
+*/
+public struct TanhActivation<T> : IActivation<T>
+    where T : __BuiltinFloatingPointType
+    where T.Differential == T
+{
+    public __init() {}
+
+    [NoDiffThis]
+    [Differentiable]
+    public Vector eval<Vector>(Vector input)
+        where Vector : IVector<T>
+    {
+        Vector output = Vector();
+        [ForceUnroll]
+        for (int i = 0; i < Vector.Size; i++)
+            output[i] = tanh(input[i]);
+        return output;
+    }
+}
+
+/**
+Exp activation: exp(x)
+*/
+public struct ExpActivation<T> : IActivation<T>
+    where T : __BuiltinFloatingPointType
+    where T.Differential == T
+{
+    public __init() {}
+
+    [NoDiffThis]
+    [Differentiable]
+    public Vector eval<Vector>(Vector input)
+        where Vector : IVector<T>
+    {
+        Vector output = Vector();
+        [ForceUnroll]
+        for (int i = 0; i < Vector.Size; i++)
+            output[i] = exp(input[i]);
+        return output;
+    }
+}
+
+/**
+Sine activation: sin(x)
+*/
+public struct SineActivation<T> : IActivation<T>
+    where T : __BuiltinFloatingPointType
+    where T.Differential == T
+{
+    public __init() {}
+
+    [NoDiffThis]
+    [Differentiable]
+    public Vector eval<Vector>(Vector input)
+        where Vector : IVector<T>
+    {
+        Vector output = Vector();
+        [ForceUnroll]
+        for (int i = 0; i < Vector.Size; i++)
+            output[i] = sin(input[i]);
+        return output;
+    }
+}
+
+/**
+SiLU (Sigmoid Linear Unit) activation, also known as Swish: x * sigmoid(x)
+*/
+public struct SiLU<T> : IActivation<T>
+    where T : __BuiltinFloatingPointType
+    where T.Differential == T
+{
+    public __init() {}
+
+    [NoDiffThis]
+    [Differentiable]
+    public Vector eval<Vector>(Vector input)
+        where Vector : IVector<T>
+    {
+        Vector output = Vector();
+        [ForceUnroll]
+        for (int i = 0; i < Vector.Size; i++)
+        {
+            let x = input[i];
+            output[i] = x / (T(1) + exp(-x));  // x * sigmoid(x)
+        }
+        return output;
+    }
+}
+
+/**
+QuickGELU activation: x * sigmoid(1.702 * x)
+
+A fast approximation of GELU (Gaussian Error Linear Unit).
+*/
+public struct QuickGELU<T> : IActivation<T>
+    where T : __BuiltinFloatingPointType
+    where T.Differential == T
+{
+    public __init() {}
+
+    [NoDiffThis]
+    [Differentiable]
+    public Vector eval<Vector>(Vector input)
+        where Vector : IVector<T>
+    {
+        Vector output = Vector();
+        [ForceUnroll]
+        for (int i = 0; i < Vector.Size; i++)
+        {
+            let x = input[i];
+            output[i] = x / (T(1) + exp(T(-1.702) * x));  // x * sigmoid(1.702 * x)
+        }
+        return output;
+    }
+}
@@ -0,0 +1,284 @@
+implementing neural;
+
+/**
+Bindless address type with pointer-like semantics.
+Wraps a buffer handle and base index to provide array-like access.
+*/
+public struct BindlessAddress<T> : IPointerLikeAddress<T>
+    where T : __BuiltinFloatingPointType
+    where T.Differential == T
+{
+    public typealias Differential = BindlessAddress<T.Differential>;
+
+    internal RWStructuredBuffer<T>.Handle handle;
+    internal uint baseIndex;
+
+    public __init(RWStructuredBuffer<T>.Handle handle)
+    {
+        this.handle = handle;
+        this.baseIndex = 0;
+    }
+
+    public __subscript(uint index)->T
+    {
+        [nonmutating]
+        get { return handle[baseIndex + index]; }
+
+        [mutating]
+        set { handle[baseIndex + index] = newValue; }
+    }
+
+    [ForceInline]
+    [require(hlsl, sm_6_6)]
+    internal void atomicAddForHLSL(uint index, T value)
+    {
+        T compareValue;
+        bool success = false;
+        do
+        {
+            compareValue = handle[baseIndex + index];
+            T newValue = compareValue + value;
+            success = __atomic_compare_exchange(handle[baseIndex + index], compareValue, newValue) == compareValue;
+        } while (!success);
+    }
+
+    [ForceInline]
+    [require(cuda_glsl_hlsl_metal_spirv, sm_6_6)]
+    public void atomicAdd(uint index, T value)
+    {
+        __target_switch
+        {
+        case hlsl:
+            atomicAddForHLSL(index, value);
+        default:
+            __atomic_reduce_add(handle[baseIndex + index], value);
+        }
+    }
+
+    [ForceInline]
+    public void atomicAdd(uint index, vector<T, 2> value)
+    {
+        __target_switch
+        {
+        case cuda:
+            // On CUDA, use packed vector atomic for types that support it (half2, bfloat16x2).
+            let scalarPtr = &handle[baseIndex + index];
+            let vecPtr = reinterpret<vector<T, 2>*>(scalarPtr);
+            __atomic_reduce_add(vecPtr[0], value);
+        default:
+            // On other targets, fall back to two scalar atomic adds.
+            atomicAdd(index, value[0]);
+            atomicAdd(index + 1, value[1]);
+        }
+    }
+
+    public This getOffset(int elements)
+    {
+        uint newBaseIndex = baseIndex + elements;
+
+        This address = This(handle);
+        address.baseIndex = newBaseIndex;
+        return address;
+    }
+
+    [ForceInline]
+    internal uint4 readUint4<DstType, bool IsAligned, uint ActualBoundary>(int offsetIndex)
+        where DstType : __BuiltinFloatingPointType
+        where DstType.Differential == DstType
+    {
+        uint4 value;
+        accessUint4<AccessOp.READ, DstType, T, RWStructuredBuffer<T>.Handle, IsAligned, ActualBoundary>(
+            handle, int(baseIndex), int(baseIndex) + offsetIndex, value);
+        return value;
+    }
+
+    [ForceInline]
+    internal void writeUint4Atomic<SrcType, bool IsAligned, uint ActualBoundary>(int offsetIndex, uint4 value)
+        where SrcType : __BuiltinFloatingPointType
+        where SrcType.Differential == SrcType
+    {
+        accessUint4<AccessOp.ATOMIC_ADD, SrcType, T, RWStructuredBuffer<T>.Handle, IsAligned, ActualBoundary>(
+            handle, int(baseIndex), int(baseIndex) + offsetIndex, value);
+    }
+}
+
+public struct PointerAddress<T> : IPointerLikeAddress<T>
+    where T : __BuiltinFloatingPointType
+    where T.Differential == T
+{
+    public typealias Differential = PointerAddress<T.Differential>;
+
+    T* ptr;
+
+    public __init(T* ptr)
+    {
+        this.ptr = ptr;
+    }
+
+    public __subscript(uint index)->T
+    {
+        [nonmutating]
+        get { return ptr[index]; }
+
+        [mutating]
+        set { ptr[index] = newValue; }
+    }
+
+    public This getOffset(int elements)
+    {
+        return This(ptr + elements);
+    }
+
+    [ForceInline]
+    [require(hlsl, sm_6_6)]
+    internal void atomicAddForHLSL(uint index, T value)
+    {
+        T compareValue;
+        bool success = false;
+        do
+        {
+            compareValue = ptr[index];
+            T newValue = compareValue + value;
+            success = __atomic_compare_exchange(ptr[index], compareValue, newValue) == compareValue;
+        } while (!success);
+    }
+
+    [ForceInline]
+    [require(cuda_glsl_hlsl_metal_spirv, sm_6_6)]
+    public void atomicAdd(uint index, T value)
+    {
+        __target_switch
+        {
+        case hlsl:
+            atomicAddForHLSL(index, value);
+        default:
+            __atomic_reduce_add(ptr[index], value);
+        }
+    }
+
+    [ForceInline]
+    public void atomicAdd(uint index, vector<T, 2> value)
+    {
+        __target_switch
+        {
+        case cuda:
+            // On CUDA, use packed vector atomic for types that support it (half2, bfloat16x2).
+            let vecPtr = reinterpret<vector<T, 2>*>(ptr + index);
+            __atomic_reduce_add(vecPtr[0], value);
+        default:
+            // On other targets, fall back to two scalar atomic adds.
+            atomicAdd(index, value[0]);
+            atomicAdd(index + 1, value[1]);
+        }
+    }
+
+    [ForceInline]
+    internal uint4 readUint4<DstType, bool IsAligned, uint ActualBoundary>(int offsetIndex)
+        where DstType : __BuiltinFloatingPointType
+        where DstType.Differential == DstType
+    {
+        uint4 value;
+        accessUint4<AccessOp.READ, DstType, T, T*, IsAligned, ActualBoundary>(
+            ptr, 0, offsetIndex, value);
+        return value;
+    }
+
+    [ForceInline]
+    internal void writeUint4Atomic<SrcType, bool IsAligned, uint ActualBoundary>(int offsetIndex, uint4 value)
+        where SrcType : __BuiltinFloatingPointType
+        where SrcType.Differential == SrcType
+    {
+        accessUint4<AccessOp.ATOMIC_ADD, SrcType, T, T*, IsAligned, ActualBoundary>(
+            ptr, 0, offsetIndex, value);
+    }
+}
+
+// We currently don't support UserPointer as an `IDifferentiablePtrType`, the issue is tracked in
+// https://github.com/shader-slang/slang/issues/8834.
+// So we define an internal extension for now, once we can resolve the issue, we can make it public.
+internal extension<T> Ptr<T, Access.ReadWrite, AddressSpace.Device> : IPointerLikeAddress<T>
+    where T : __BuiltinFloatingPointType
+    where T.Differential == T
+{
+    internal typealias Differential = Ptr<T.Differential, Access.ReadWrite, AddressSpace.Device>;
+
+    internal __init(Ptr<T, Access.ReadWrite, AddressSpace.Device> ptr)
+    {
+        this = ptr;
+    }
+
+    internal __subscript(uint index)->T
+    {
+        [nonmutating]
+        get { return this[index]; }
+
+        [mutating]
+        set { this[index] = newValue; }
+    }
+
+    internal This getOffset(int elements)
+    {
+        return This(this + elements);
+    }
+
+    [require(hlsl, sm_6_6)]
+    internal void atomicAddForHLSL(uint index, T value)
+    {
+        T compareValue;
+        bool success = false;
+        do
+        {
+            compareValue = this[index];
+            T newValue = compareValue + value;
+            success = __atomic_compare_exchange(this[index], compareValue, newValue) == compareValue;
+        } while (!success);
+    }
+
+    [ForceInline]
+    [require(cuda_glsl_hlsl_metal_spirv, sm_6_6)]
+    internal void atomicAdd(uint index, T value)
+    {
+        __target_switch
+        {
+        case hlsl:
+            atomicAddForHLSL(index, value);
+        default:
+            __atomic_reduce_add(this[index], value);
+        }
+    }
+
+    [ForceInline]
+    internal void atomicAdd(uint index, vector<T, 2> value)
+    {
+        __target_switch
+        {
+        case cuda:
+            let vecPtr = reinterpret<vector<T, 2>*>(this + index);
+            __atomic_reduce_add(vecPtr[0], value);
+        default:
+            atomicAdd(index, value[0]);
+            atomicAdd(index + 1, value[1]);
+        }
+    }
+
+    [ForceInline]
+    internal uint4 readUint4<DstType, bool IsAligned, uint ActualBoundary>(int offsetIndex)
+        where DstType : __BuiltinFloatingPointType
+        where DstType.Differential == DstType
+    {
+        uint4 value;
+        accessUint4<AccessOp.READ, DstType, T, This, IsAligned, ActualBoundary>(
+            this, 0, offsetIndex, value);
+        return value;
+    }
+
+    [ForceInline]
+    internal void writeUint4Atomic<SrcType, bool IsAligned, uint ActualBoundary>(int offsetIndex, uint4 value)
+        where SrcType : __BuiltinFloatingPointType
+        where SrcType.Differential == SrcType
+    {
+        accessUint4<AccessOp.ATOMIC_ADD, SrcType, T, This, IsAligned, ActualBoundary>(
+            this, 0, offsetIndex, value);
+    }
+}
+
@@ -0,0 +1,7 @@
+// Common definitions for the neural module
+
+#ifdef UNIT_TEST
+#define VISIBILITY_LEVEL public
+#else
+#define VISIBILITY_LEVEL internal
+#endif
@@ -0,0 +1,152 @@
+implementing neural;
+
+#include "common-def.slang"
+
+VISIBILITY_LEVEL typealias uvec<int Dim> = Array<uint32_t, Dim>;
+
+VISIBILITY_LEVEL enum HashType : uint32_t
+{
+	Prime,
+	CoherentPrime,
+	ReversedPrime,
+	Rng,
+	BaseConvert,
+}
+
+VISIBILITY_LEVEL struct Pcg32
+{
+	static const uint64_t DefaultState  = 0x853c49e6748fea9bULL;
+	static const uint64_t DefaultStream = 0xda3e39cb94b95bdbULL;
+	static const uint64_t Mult          = 0x5851f42d4c957f2dULL;
+
+	uint64_t state;
+	uint64_t inc;
+
+	VISIBILITY_LEVEL __init(uint64_t initstate, uint64_t initseq = 1)
+	{
+		state = 0;
+		inc = (initseq << 1) | 1;
+		nextUint();
+		state += initstate;
+		nextUint();
+	}
+
+	[mutating]
+	VISIBILITY_LEVEL uint32_t nextUint()
+	{
+		uint64_t oldstate = state;
+		state = oldstate * Mult + inc;
+		uint32_t xorshifted = uint32_t(((oldstate >> 18) ^ oldstate) >> 27);
+		uint32_t rot = uint32_t(oldstate >> 59);
+		return (xorshifted >> rot) | (xorshifted << ((~rot + 1) & 31));
+	}
+
+	[mutating]
+	VISIBILITY_LEVEL void advance(int64_t delta)
+	{
+		uint64_t curMult = Mult;
+		uint64_t curPlus = inc;
+		uint64_t accMult = 1;
+		uint64_t accPlus = 0;
+
+		uint64_t d = uint64_t(delta);
+		while (d > 0)
+		{
+			if ((d & 1) != 0)
+			{
+				accMult *= curMult;
+				accPlus = accPlus * curMult + curPlus;
+			}
+			curPlus = (curMult + 1) * curPlus;
+			curMult *= curMult;
+			d /= 2;
+		}
+		state = accMult * state + accPlus;
+	}
+}
+
+// LCG hash: XORs position values multiplied by prime factors
+[ForceInline]
+VISIBILITY_LEVEL uint32_t lcgHash<int Dimensions, int PrimeCount>(uvec<Dimensions> posGrid, uint32_t primes[PrimeCount])
+{
+	uint32_t result = 0;
+	[ForceUnroll]
+	for (int i = 0; i < Dimensions; ++i)
+	{
+		result ^= posGrid[i] * primes[i];
+	}
+	return result;
+}
+
+[ForceInline]
+VISIBILITY_LEVEL uint32_t primeHash<int Dimensions>(uvec<Dimensions> posGrid)
+{
+	uint32_t factors[7] = { 1958374283u, 2654435761u, 805459861u, 3674653429u, 2097192037u, 1434869437u, 2165219737u };
+	return lcgHash<Dimensions, 7>(posGrid, factors);
+}
+
+[ForceInline]
+VISIBILITY_LEVEL uint32_t coherentPrimeHash<int Dimensions>(uvec<Dimensions> posGrid)
+{
+	uint32_t factors[7] = { 1u, 2654435761u, 805459861u, 3674653429u, 2097192037u, 1434869437u, 2165219737u };
+	return lcgHash<Dimensions, 7>(posGrid, factors);
+}
+
+[ForceInline]
+VISIBILITY_LEVEL uint32_t reversedPrimeHash<int Dimensions>(uvec<Dimensions> posGrid)
+{
+	uint32_t factors[7] = { 2165219737u, 1434869437u, 2097192037u, 3674653429u, 805459861u, 2654435761u, 1958374283u };
+	return lcgHash<Dimensions, 7>(posGrid, factors);
+}
+
+// Base conversion hash (used in permuto-encoding)
+[ForceInline]
+VISIBILITY_LEVEL uint32_t baseConvertHash<int Dimensions>(uvec<Dimensions> posGrid)
+{
+	uint32_t k = 0;
+	[ForceUnroll]
+	for (int dim = 0; dim < Dimensions; ++dim)
+	{
+		k += posGrid[dim];
+		k *= 2531011u;
+	}
+	return k;
+}
+
+// RNG hash using PCG32
+[ForceInline]
+VISIBILITY_LEVEL uint32_t rngHash<int Dimensions>(uvec<Dimensions> posGrid, uint32_t seed = 1337)
+{
+	static const int BitsPerDim = 64 / Dimensions;
+	uint64_t step = 0;
+
+	[ForceUnroll]
+	for (int i = 0; i < Dimensions; ++i)
+	{
+		step ^= uint64_t(posGrid[i]) << (i * BitsPerDim);
+	}
+
+	Pcg32 rng = Pcg32(seed);
+	rng.advance(int64_t(step));
+	return rng.nextUint();
+}
+
+[ForceInline]
+VISIBILITY_LEVEL uint32_t gridHash<HashType Type, int Dimensions>(uvec<Dimensions> positionGrid)
+{
+	switch (Type)
+	{
+		case HashType.Prime:
+			return primeHash<Dimensions>(positionGrid);
+		case HashType.CoherentPrime:
+			return coherentPrimeHash<Dimensions>(positionGrid);
+		case HashType.ReversedPrime:
+			return reversedPrimeHash<Dimensions>(positionGrid);
+		case HashType.Rng:
+			return rngHash<Dimensions>(positionGrid);
+		case HashType.BaseConvert:
+			return baseConvertHash<Dimensions>(positionGrid);
+		default:
+			return 0;
+	}
+}
@@ -0,0 +1,28 @@
+implementing neural;
+
+/**
+Activation function interface for neural network operations.
+Defines a differentiable mapping from an input vector to an output vector of the same shape.
+
+Activations that require parameters (e.g., LeakyReLU's alpha) store them as member fields.
+Parameterless activations (e.g., ReLU, Sigmoid) can be used with the default constructor.
+
+@param T Scalar element type (float/half/double).
+@category neural
+*/
+public interface IActivation<T>
+    where T : __BuiltinFloatingPointType
+    where T.Differential == T
+{
+    /// Default constructor. Required so that activations are explicitly default-constructed
+    /// rather than zero-initialized when used with `Activation()` in layer constructors.
+    __init();
+
+    /// Apply activation function element-wise.
+    /// @param input Input vector.
+    /// @return Output vector with activation applied.
+    [NoDiffThis]
+    [Differentiable]
+    public Vector eval<Vector>(Vector input)
+        where Vector : IVector<T>;
+}
@@ -0,0 +1,36 @@
+implementing neural;
+#include "common-def.slang"
+
+public interface IStaticEncoder<T,
+    InArray,
+    OutArray
+    >
+    where T : __BuiltinFloatingPointType
+    where T.Differential == T
+    where InArray : IArrayAccessor<T>
+    where OutArray : IArrayAccessor<T>
+{
+    public associatedtype HyperParameters;
+
+    [Differentiable]
+    public OutArray encode(in InArray input);
+}
+
+public interface ITrainableEncoder<T,
+    InArray,
+    OutArray
+    >
+    where T : __BuiltinFloatingPointType
+    where T.Differential == T
+    where InArray : IArrayAccessor<T>
+    where InArray : IDifferentiable
+    where OutArray : IArrayAccessor<T>
+    where OutArray : IDifferentiable
+{
+    public associatedtype HyperParameters;
+
+    [Differentiable]
+    public OutArray encode<Address>(in InArray input, Address parametersAddress)
+        where Address : IPointerLikeAddress<T>
+        where Address.Differential : IPointerLikeAddress<T.Differential>;
+}
@@ -0,0 +1,28 @@
+implementing neural;
+
+/**
+Layer interface (compile-time, GPU-friendly).
+This interface is intended to be used as a *generic constraint* (no existential storage),
+so it does not imply dynamic dispatch.
+
+@category neural
+*/
+public interface ILayer<T, InputVector, OutputVector, Layout, Activation>
+    where T : __BuiltinFloatingPointType
+    where T.Differential == T
+    where Layout : IStorageLayout
+    where InputVector : IVector<T>
+    where OutputVector : IVector<T>
+    where Activation : IActivation<T>
+{
+    /// Forward evaluation: y = f(x).
+    /// Address is passed as parameter to enable autodiff gradient routing.
+    /// @param input Input vector.
+    /// @param weightAddress Weight address (pointer-like).
+    /// @param biasAddress Bias address (pointer-like). Pass `none` if no bias.
+    /// @return Output vector.
+    [Differentiable]
+    public OutputVector eval<A>(InputVector input, A weightAddress, Optional<A> biasAddress = none)
+        where A : IPointerLikeAddress<T>
+        where A.Differential : IPointerLikeAddress<T.Differential>;
+}
@@ -0,0 +1,194 @@
+implementing neural;
+
+/**
+Concrete implementation of IVector storing elements inline (on stack/registers).
+InlineVector stores all elements in a fixed-size array, making it suitable for
+small vectors that can fit in registers or stack memory. Supports automatic differentiation
+for gradient computation in neural networks.
+@param T The element type
+@param N The vector size (compile-time constant).
+@remarks Type constraints:
+- `T` must conform to `__BuiltinFloatingPointType` (float, double, half, etc.)
+- `T.Differential` must conform to `__BuiltinFloatingPointType` for automatic differentiation
+@category neural
+*/
+public struct InlineVector<T, int N> : IVector<T>
+    where T : __BuiltinFloatingPointType
+    where T.Differential == T
+{
+    /// The differential type for automatic differentiation.
+    public typealias Differential = InlineVector<T.Differential, N>;
+
+    /// The compile-time size of the vector.
+    public static const int Size = N;
+
+    public int getCount() {return N;}
+    /**
+    Internal storage for vector elements.
+    @remarks Marked as derivative member to enable automatic differentiation.
+    */
+    [DerivativeMember(Differential.data)]
+    internal T[N] data;
+
+    /// Default constructor - initializes all elements to zero.
+    public __init() { data = {}; }
+
+    /**
+    Scalar broadcast constructor - fills all elements with the same value.
+    @param[in] value The value to broadcast to all elements.
+    */
+    public __init(T value) {
+        [ForceUnroll]
+        for (int i = 0; i < N; i++)
+            data[i] = value;
+    }
+
+    /**
+    Array constructor - initializes from an array.
+    @param[in] data Array of N elements to initialize the vector.
+    */
+    public __init(T[Size] data) { this.data = data; }
+
+    /**
+    Copy constructor.
+    @param[in] other The vector to copy from.
+    */
+    public __init(This other) { this.data = other.data; }
+
+    public __init<InputArray : IArray<T>>(InputArray data)
+    {
+        static_assert(data.getCount() >= N, "The size of the input array must match the vector size");
+        [ForceUnroll]
+        for (int i = 0; i < N; i++)
+            this.data[i] = data[i];
+    }
+
+    /**
+    Element access operator.
+    @param[in] index The element index (0-based).
+    @return Reference to the element at the given index.
+    */
+    public __subscript(int index) -> T
+    {
+        [ForceInline]
+        [Differentiable]
+        get() { return this.data[index]; }
+
+        [ForceInline]
+        [Differentiable]
+        set() { this.data[index] = newValue; }
+    }
+
+    // Linear transformation without bias
+    [BackwardDerivative(linearTransformBwd)]
+    public OutputVector linearTransform<Address, Layout, OutputVector>(
+        Address weightAddress)
+            where Address : IPointerLikeAddress<T>
+            where Address.Differential : IPointerLikeAddress<T.Differential>
+            where Layout : IStorageLayout
+            where OutputVector : IVector<T>
+    {
+        var output = OutputVector();
+
+        // output = W * input
+        [MaxIters(OutputVector.Size)]
+        for (int row = 0; row < OutputVector.Size; row++)
+        {
+            // get the address of each row of the weight matrix
+            let rowAddr = weightAddress.getOffset(row * N);
+            [ForceUnroll]
+            for (int col = 0; col < N; col++)
+            {
+                output[row] += data[col] * rowAddr[col];
+            }
+        }
+        return output;
+    }
+
+    // Linear transformation with bias (Bindless storage)
+    [BackwardDerivative(linearTransformBwd)]
+    public OutputVector linearTransform<Address, Layout, OutputVector>(
+        Address weightAddress,
+        Address biasAddress)
+            where Address : IPointerLikeAddress<T>
+            where Address.Differential : IPointerLikeAddress<T.Differential>
+            where Layout : IStorageLayout
+            where OutputVector : IVector<T>
+    {
+        // Reuse the unbias matmul method
+        OutputVector output = this.linearTransform<Address, Layout, OutputVector>(weightAddress);
+
+        [ForceUnroll]
+        for (int i = 0; i < OutputVector.Size; i++)
+            output[i] = output[i] + biasAddress[i];
+
+        return output;
+    }
+
+    // Backward of linear transformation without bias (Bindless storage)
+    static public void linearTransformBwd<Address, Layout, OutputVector>(
+        inout DifferentialPair<This> dthis,
+        DifferentialPtrPair<Address> dparameters,
+        OutputVector.Differential doutput)
+            where Address : IPointerLikeAddress<T>
+            where Address.Differential : IPointerLikeAddress<T.Differential>
+            where Layout : IStorageLayout
+            where OutputVector : IVector<T>
+            where OutputVector.Differential : IVector<T.Differential>
+    {
+        // dInput = dW^T * dOutput
+        This.Differential d = {};
+        [MaxIters(OutputVector.Size)]
+        for (int j = 0; j < OutputVector.Size; j++)
+        {
+            let dy = doutput[j];
+            [ForceUnroll]
+            for (int i = 0; i < N; i++)
+            {
+                T.Differential prod = T.Differential.dmul(dparameters.p[i + j * N], dy);
+                d[i] = T.Differential.dadd(d[i], prod);
+            }
+        }
+
+        // Derivative of the weights is the outer product of the input and the output differential
+        // dW = dOutput * Input^T
+        [MaxIters(OutputVector.Size)]
+        for (int row = 0; row < OutputVector.Size; row++)
+        {
+            let rowAddr = dparameters.d.getOffset(row * N);
+            T.Differential dy = doutput[row];
+            [ForceUnroll]
+            for (int col = 0; col < N; col++)
+            {
+                let x = dthis.p[col];
+                T.Differential prod = T.Differential.dmul(x, dy);
+                rowAddr.atomicAdd(col, prod);
+            }
+        }
+
+        dthis = DifferentialPair<This>(dthis.p, d);
+    }
+
+    // Backward of linear transformation with bias (Bindless storage)
+    static public void linearTransformBwd<Address, Layout, OutputVector>(
+        inout DifferentialPair<This> dthis,
+        DifferentialPtrPair<Address> dWeightAddress,
+        DifferentialPtrPair<Address> dBiasAddress,
+        OutputVector.Differential doutput)
+            where Address : IPointerLikeAddress<T>
+            where Address.Differential : IPointerLikeAddress<T.Differential>
+            where Layout : IStorageLayout
+            where OutputVector : IVector<T>
+    {
+        // Reuse the unbias backward method
+        linearTransformBwd<Address, Layout, OutputVector>(dthis, dWeightAddress, doutput);
+
+        let biasOffset = dBiasAddress.d.getOffset(0);
+        // dBias = dOutput
+        [ForceUnroll]
+        for (int i = 0; i < OutputVector.Size; i++)
+        {
+            biasOffset.atomicAdd(i, doutput[i]);
+        }
+    }
+}
@@ -0,0 +1,77 @@
+implementing neural;
+
+public enum LayoutType : uint32_t
+{
+    Linear = 0,
+}
+
+internal interface IStorageLayout
+{
+    internal static const LayoutType Layout;
+}
+
+public struct LinearLayout : IStorageLayout
+{
+    internal static const LayoutType Layout = LayoutType.Linear;
+}
+
+/**
+Interface for pointer-like addressing with direct subscript access.
+Provides array-like access patterns for storage backends that support pointer arithmetic.
+@param T The element type.
+@remarks Type constraints:
+- `T` must conform to `__BuiltinFloatingPointType` (float, double, half, etc.)
+- `T.Differential` must conform to `__BuiltinFloatingPointType` for automatic differentiation
+@see `BindlessBufferStorage.BindlessAddress`
+@category neural
+*/
+public interface IPointerLikeAddress<T> : IDifferentiablePtrType
+    where T : __BuiltinFloatingPointType
+    where T.Differential == T
+{
+    /**
+    Array-style element access.
+    @param[in] index The element index.
+    @return Reference to the element at the given index.
+    */
+    public __subscript(uint index) -> T { get; set; }
+
+    /**
+    Computes an offset pointer.
+    @param[in] elements Number of elements to offset by.
+    @return The offset pointer.
+    */
+    public This getOffset(int elements);
+
+    /**
+    Atomically adds a value at an index.
+    @param[in] index The element index.
+    @param[in] value The value to add.
+    */
+    [require(cuda_glsl_hlsl_metal_spirv, sm_6_6)]
+    public void atomicAdd(uint index, T value);
+
+    /**
+    Atomically adds a vector of 2 values using packed atomic operations where available.
+    @param[in] index The element index (must be aligned to 2 elements).
+    @param[in] value The vector of 2 values to add.
+    */
+    public void atomicAdd(uint index, vector<T, 2> value);
+
+    /**
+    Reads sequential elements starting from the given offset address, packed into a uint4.
+    */
+    [ForceInline]
+    internal uint4 readUint4<DstType, bool IsAligned, uint ActualBoundary>(int offsetIndex)
+        where DstType : __BuiltinFloatingPointType
+        where DstType.Differential == DstType;
+
+    /**
+    Atomically writes unpacked uint4 values back to storage.
+    */
+    [ForceInline]
+    internal void writeUint4Atomic<SrcType, bool IsAligned, uint ActualBoundary>(int offsetIndex, uint4 value)
+        where SrcType : __BuiltinFloatingPointType
+        where SrcType.Differential == SrcType;
+
+}
@@ -0,0 +1,97 @@
+implementing neural;
+
+/**
+Generic vector interface for neural network operations.
+Provides a differentiable vector abstraction supporting automatic differentiation
+and linear algebra operations for neural network computations.
+@param T The element type (must be a floating-point type).
+@param N The vector size (compile-time constant).
+@remarks Type constraints:
+- `T` must conform to `__BuiltinFloatingPointType` (float, double, half, etc.)
+- `T.Differential` must conform to `__BuiltinFloatingPointType` for automatic differentiation
+@see `InlineVector`
+@category neural
+*/
+public interface IVector<T> : IDifferentiable, IArrayAccessor<T>
+    where T : __BuiltinFloatingPointType
+    where T.Differential == T
+{
+    /// The compile-time size of the vector.
+    public static const int Size;
+
+    /// The differential type for automatic differentiation.
+    /// @remarks Ensures the differential is also a vector with the same structure.
+    public associatedtype Differential : IVector<T.Differential>;
+
+    /// Default constructor - initializes vector to zero.
+    public __init();
+
+    /**
+    Scalar broadcast constructor - fills all elements with the same value.
+    @param[in] value The value to broadcast to all elements.
+    */
+    public __init(T value);
+
+    /**
+    Array constructor - initializes from an array.
+    @param[in] data Array of N elements to initialize the vector.
+    */
+    public __init(T[This.Size] data);
+
+    public __init<InputArray : IArray<T>>(InputArray data);
+
+    /**
+    Copy constructor.
+    @param[in] other The vector to copy from.
+    */
+    public __init(This other);
+
+    /**
+    Evaluates a linear transformation: output = W * this.
+    Uses pointer-like addressing for direct access to weight parameters.
+    @param Address The address type with pointer-like access.
+    @param Layout The storage layout type.
+    @param OutputVector The output vector type.
+    @param[in] weightAddress The address of the weight matrix.
+            The weight matrix is stored in a contiguous block of memory, the size
+            of the weight matrix is in `OutputVector.Size` rows, and `N` columns.
+    @return The result of the linear transformation, `result = W * this`, whose size is `OutputVector.Size`.
+    @remarks Type constraints:
+    - `Address` must conform to `IPointerLikeAddress<T>`
+    - `Address.Differential` must conform to `IPointerLikeAddress<T.Differential>`
+    - `OutputVector` must conform to `IVector<T>`
+    */
+    [Differentiable]
+    public OutputVector linearTransform<Address, Layout, OutputVector>(Address weightAddress)
+            where Address : IPointerLikeAddress<T>
+            where Address.Differential : IPointerLikeAddress<T.Differential>
+            where Layout : IStorageLayout
+            where OutputVector : IVector<T>;
+
+    /**
+    Evaluates a linear transformation: output = W * this + bias.
+    Performs matrix-vector multiplication with bias addition.
+    Uses pointer-like addressing for direct access to weight and bias parameters.
+    @param Address The address type with pointer-like access.
+    @param Layout The storage layout type.
+    @param OutputVector The output vector type.
+    @param[in] weightAddress The address of the weight matrix.
+            The weight matrix is stored in a contiguous block of memory, the size
+            of the weight matrix is in `OutputVector.Size` rows, and `N` columns.
+    @param[in] biasAddress The address of the bias vector.
+            The bias vector is stored in a contiguous block of memory, the size
+            of the bias vector is `OutputVector.Size`.
+    @return The result of the linear transformation, `result = W * this + bias`, whose size is `OutputVector.Size`.
+    @remarks Type constraints:
+    - `Address` must conform to `IPointerLikeAddress<T>`
+    - `Address.Differential` must conform to `IPointerLikeAddress<T.Differential>`
+    - `OutputVector` must conform to `IVector<T>`
+    */
+    [Differentiable]
+    public OutputVector linearTransform<Address, Layout, OutputVector>(
+        Address weightAddress, Address biasAddress)
+            where Address : IPointerLikeAddress<T>
+            where Address.Differential : IPointerLikeAddress<T.Differential>
+            where Layout : IStorageLayout
+            where OutputVector : IVector<T>;
+}
@@ -0,0 +1,90 @@
+implementing neural;
+
+/**
+A fully-connected (feed-forward) neural network layer that computes `y = Activation(W*x + b)`.
+
+`FFLayer` represents a single linear transformation followed by an activation function,
+suitable for building multi-layer perceptrons (MLPs) and similar architectures.
+
+## Usage
+
+1. **Construction:** Create a layer:
+   ```
+   let layer = FFLayer<float, Vec4, Vec2, LinearLayout, ReLU<float>>();
+   ```
+
+2. **Forward pass:** Call `eval()` with address and input:
+   ```
+   let output = layer.eval<Address>(input, weightAddr, biasAddr);
+   ```
+
+3. **Training (backward pass):** Use autodiff with `DifferentialPtrPair`:
+   ```
+   var addrPair = DifferentialPtrPair<Address>(addr, gradAddr);
+   bwd_diff(computeOutput)(addrPair, inputPair, layer, dOutput);
+   ```
+
+## Parameter Layout
+
+Parameters are packed as a contiguous block in storage:
+
+- **weights:** `Out * In` scalars, row-major by output row: `W[row * In + col]`
+- **bias (optional):** `Out` scalars immediately following weights
+*/
+public struct FFLayer<
+    T,
+    InputVector,
+    OutputVector,
+    Layout,
+    Activation,
+    let HasBias : bool = true
+>
+    : ILayer<T, InputVector, OutputVector, Layout, Activation>
+    where T : __BuiltinFloatingPointType
+    where T.Differential == T
+    where Layout : IStorageLayout
+    where InputVector : IVector<T>
+    where OutputVector : IVector<T>
+    where Activation : IActivation<T>
+{
+    public static const int ParameterCount =
+        OutputVector.Size * InputVector.Size + (HasBias ? OutputVector.Size : 0);
+
+    /// Activation function instance (stores any activation-specific parameters).
+    internal Activation activation;
+
+    /// Constructor.
+    /// @param act Activation function instance (defaults to default-constructed Activation).
+    public __init(Activation act = Activation())
+    {
+        activation = act;
+    }
+
+    public static int nextOffset(int baseOffset)
+    {
+        return baseOffset + ParameterCount;
+    }
+
+    /// Forward evaluation: y = Activation(W*x + b).
+    /// @param input Input vector.
+    /// @param weightAddr Weight address (pointer-like).
+    /// @param biasAddr Bias address (pointer-like). Pass `none` if no bias.
+    /// @return Output vector after linear transform and activation.
+    [Differentiable]
+    [ForceInline]
+    public OutputVector eval<A>(InputVector input, A weightAddr, Optional<A> biasAddr = none)
+        where A : IPointerLikeAddress<T>
+        where A.Differential : IPointerLikeAddress<T.Differential>
+    {
+        OutputVector y;
+        if(HasBias)
+        {
+            y = input.linearTransform<A, Layout, OutputVector>(weightAddr, biasAddr.value);
+        }
+        else
+        {
+            y = input.linearTransform<A, Layout, OutputVector>(weightAddr);
+        }
+        return activation.eval<OutputVector>(y);
+    }
+}
@@ -0,0 +1,38 @@
+/**
+Neural network primitives for Slang.
+This module provides differentiable primitive data structures and operations for implementing
+inline MLP (Multilayer Perceptron) in Slang shaders. It includes vector types, storage abstractions,
+activation functions, optimizers, and automatic differentiation support for training small inline
+neural networks on the GPU.
+
+@remarks EXPERIMENTAL: This module is under active design and may change significantly
+or be removed in future versions. DO NOT USE IN PRODUCTION.
+
+@category neural
+
+Features:
+- Differentiable vector types (IVector, InlineVector)
+- Pointer-like address abstraction (IPointerLikeAddress)
+- Automatic differentiation support
+- Linear transformations with optional bias
+- Atomic operations for gradient accumulation
+- Permutohedral lattice encoding (PermutoEncoder)
+*/
+[ExperimentalModule]
+module neural;
+
+__include "ivector";
+__include "inline-vector";
+__include "istorages";
+__include "bindless-storage";
+__include "accelerate-vector-coopmat";
+__include "vectorized-reader";
+__include "shared-memory-pool";
+__include "hash-function";
+__include "permuto-encoder";
+__include "iencoder";
+// Frontend APIs built on the primitives above
+__include "iactivation";
+__include "activations";
+__include "ilayer";
+__include "layers";
@@ -0,0 +1,372 @@
+implementing neural;
+#include "common-def.slang"
+
+// =============================================================================
+// Permuto Encoder
+// =============================================================================
+
+public struct PermutoEncoder<T,
+    int Dimensions,
+    int FeatureDimensionPerEntry,
+    uint Log2HashmapSize,
+    uint MaxLevels,
+    InArray,
+    OutArray
+    > : ITrainableEncoder<T, InArray, OutArray>
+    where T : __BuiltinFloatingPointType
+    where T.Differential == T
+    where InArray : IArrayAccessor<T>, IDifferentiable
+    where OutArray : IArrayAccessor<T>, IDifferentiable
+{
+    /// Precomputed per-level information.
+    public struct PerLevelInfo
+    {
+        public uint currentLevelIndex;
+        public uint offsetOfFeatureTable;         // Offset into feature table for this level (in number of features)
+        public uint featureTableSize;             // Hashmap size for this level
+        public float scale;                       // Scale for this level
+        public float scalesPerDim[Dimensions];
+        public float shiftsPerDim[Dimensions];
+    }
+
+    VISIBILITY_LEVEL struct Params
+    {
+        VISIBILITY_LEVEL float maxLevel;                // It's maxLevelRatio * MaxLevels, where maxLevelRatio is a user defined ratio
+        VISIBILITY_LEVEL PerLevelInfo levelInfo;        // Contains scale, offset, hashmap size, and per-dim scales/shifts
+    }
+
+    public struct HyperParameters
+    {
+        public float maxLevel;
+        public PerLevelInfo[MaxLevels] levelInfo;
+    };
+
+    internal HyperParameters params;
+    public __init(HyperParameters params)
+    {
+        this.params = params;
+    }
+
+    // =========================================================================
+    // Permutohedral Lattice Helper Functions
+    // =========================================================================
+
+    /// Computes the permutohedral lattice index using base conversion hash.
+    /// @param key The lattice coordinate.
+    /// @param hashmapSize The size of the hashmap.
+    /// @return The index into the parameter array.
+    static uint permutoIndex(uvec<Dimensions> key, uint hashmapSize)
+    {
+        return baseConvertHash<Dimensions>(key) % hashmapSize;
+    }
+
+    /// Elevates a D-dimension vector to (D+1)-dimension homogeneous vector on hyperplane H_d.
+    /// The sum of the components of `elevated` is zero, ensuring it's within hyperplane H_d.
+    /// The magnitudes of the components of `elevated` are similar to each other.
+    /// @param pos Input position [Dimensions].
+    /// @param scalesPerDim Per-dimension scaling factors [Dimensions].
+    /// @param shiftsPerDim Per-dimension shifts [Dimensions].
+    /// @param elevated Output elevated coordinates [Dimensions+1].
+    [Differentiable]
+    static void permutoElevate<InArray>(
+        InArray pos,
+        no_diff in float scalesPerDim[Dimensions],
+        no_diff in float shiftsPerDim[Dimensions],
+        out float elevated[Dimensions + 1])
+            where InArray : IArrayAccessor<T>
+            where InArray : IDifferentiable
+    {
+        float sum = 0.0f;
+        [ForceUnroll]
+        for (int dim = Dimensions - 1; dim >= 0; --dim)
+        {
+            float cf = (__realCast<float>(pos[dim]) + shiftsPerDim[dim]) * scalesPerDim[dim];
+            elevated[dim + 1] = sum - float(dim + 1) * cf;
+            sum += cf;
+        }
+        elevated[0] = sum;
+    }
+
+    /// Finds the closest remainder-0 point and computes the rank ordering.
+    /// @param elevated The elevated coordinates [Dimensions+1].
+    /// @param rem0 Output: coordinates of remainder-0 point [Dimensions+1].
+    /// @param rank Output: rank ordering [Dimensions+1].
+    /// Note: Not differentiable since outputs are int (no gradient needed).
+    static void permutoFindRem0(
+        float elevated[Dimensions + 1],
+        out int rem0[Dimensions + 1],
+        out int rank[Dimensions + 1])
+    {
+        rank = {};
+        // Find the closest remainder-0 point through rounding
+        int sum = 0;
+        [ForceUnroll]
+        for (uint dim = 0; dim <= Dimensions; ++dim)
+        {
+            // Using xxx*(1.0f/N) is faster than xxx/N
+            float v = elevated[dim] * (1.0f / float(Dimensions + 1));
+            float up = ceil(v) * float(Dimensions + 1);
+            float down = floor(v) * float(Dimensions + 1);
+            if (up - elevated[dim] < elevated[dim] - down)
+            {
+                rem0[dim] = int(up);
+            }
+            else
+            {
+                rem0[dim] = int(down);
+            }
+            sum += rem0[dim];
+        }
+        sum /= int(Dimensions + 1);
+
+        // Find the simplex we are in and store it in rank
+        // (where rank describes what position coordinate i has in the sorted order)
+        [ForceUnroll]
+        for (uint dim = 0; dim < Dimensions; ++dim)
+        {
+            float di = elevated[dim] - float(rem0[dim]);
+            [MaxIters(Dimensions)]
+            for (uint otherDim = dim + 1; otherDim <= Dimensions; ++otherDim)
+            {
+                if (di < elevated[otherDim] - float(rem0[otherDim]))
+                {
+                    rank[dim]++;
+                }
+                else
+                {
+                    rank[otherDim]++;
+                }
+            }
+        }
+
+        // If the point doesn't lie on the plane (sum != 0) bring it back
+        [ForceUnroll]
+        for (uint dim = 0; dim <= Dimensions; ++dim)
+        {
+            rank[dim] += sum;
+            if (rank[dim] < 0)
+            {
+                rank[dim] += int(Dimensions + 1);
+                rem0[dim] += int(Dimensions + 1);
+            }
+            else if (rank[dim] > int(Dimensions))
+            {
+                rank[dim] -= int(Dimensions + 1);
+                rem0[dim] -= int(Dimensions + 1);
+            }
+        }
+    }
+
+    /// Computes the barycentric coordinates for permutohedral interpolation.
+    /// See p.10 in [Adams et al. 2010].
+    /// @param elevated The elevated coordinates [Dimensions+1].
+    /// @param rem0 The remainder-0 point coordinates [Dimensions+1].
+    /// @param rank The rank ordering [Dimensions+1].
+    /// @param barycentric Output: barycentric coordinates [Dimensions+2].
+    [Differentiable]
+    static void permutoBarycentric(
+        float elevated[Dimensions + 1],
+        int rem0[Dimensions + 1],
+        int rank[Dimensions + 1],
+        out float barycentric[Dimensions + 2])
+    {
+        // Compute the barycentric coordinates
+        barycentric = {};
+        [ForceUnroll]
+        for (uint dim = 0; dim <= Dimensions; ++dim)
+        {
+            float delta = (elevated[dim] - float(rem0[dim])) * (1.0f / float(Dimensions + 1));
+            int idxPlus = int(Dimensions) - rank[dim];
+            int idxMinus = int(Dimensions + 1) - rank[dim];
+            barycentric[idxPlus] += delta;
+            barycentric[idxMinus] -= delta;
+        }
+        // Wrap around
+        barycentric[0] += 1.0f + barycentric[Dimensions + 1];
+    }
+
+    // =========================================================================
+    // Feature Reading and Encoding
+    // =========================================================================
+
+    /// Reads feature values at a given lattice position from the feature table.
+    /// @param levelBaseOffset Base offset into the feature table for this level (in number of elements).
+    [BackwardDerivative(readFeatureValueBwd)]
+    static void readFeatureValue<Address>(
+        Address featureTableAddress,
+        uint levelBaseOffset,
+        uint hashmapSize,
+        uvec<Dimensions> key,
+        out T result[FeatureDimensionPerEntry])
+            where Address : IPointerLikeAddress<T>
+            where Address.Differential : IPointerLikeAddress<T.Differential>
+    {
+        uint index = levelBaseOffset + permutoIndex(key, hashmapSize) * FeatureDimensionPerEntry;
+
+        [ForceUnroll]
+        for (uint f = 0; f < FeatureDimensionPerEntry; ++f)
+        {
+            result[f] = featureTableAddress[index + f];
+        }
+    }
+
+    static void readFeatureValueBwd<Address>(
+        DifferentialPtrPair<Address> featureTableAddress,
+        uint levelBaseOffset,
+        uint hashmapSize,
+        uvec<Dimensions> key,
+        T.Differential[FeatureDimensionPerEntry] result)
+            where Address : IPointerLikeAddress<T>
+            where Address.Differential : IPointerLikeAddress<T.Differential>
+    {
+        uint index = levelBaseOffset + permutoIndex(key, hashmapSize) * FeatureDimensionPerEntry;
+
+        [ForceUnroll]
+        for (uint f = 0; f < FeatureDimensionPerEntry; ++f)
+        {
+            featureTableAddress.d.atomicAdd(index + f, result[f]);
+        }
+    }
+
+    /// Main forward pass: encodes positions using the permutohedral lattice for a single level.
+    /// Writes results to the appropriate offset in the full output array.
+    ///
+    /// @param params Encoding parameters containing:
+    ///   - maxLevel: Maximum level to use (maxLevelRatio * MaxLevels)
+    ///   - levelInfo: Per-level info (currentLevelIndex, offsetOfFeatureTable, featureTableSize, scale, scalesPerDim, shiftsPerDim)
+    /// @param position Input position (Dimensions floats)
+    /// @param featureTableAddress Pointer-like address to the feature table for all levels
+    /// @param encodedFeatures Output array of size MaxLevels * FeatureDimensionPerEntry. Results are written at offset currentLevelIndex * FeatureDimensionPerEntry.
+    [Differentiable]
+    VISIBILITY_LEVEL static void encodePerLevel<Address>(
+        no_diff in Params params,
+        InArray position,
+        Address featureTableAddress,
+        inout OutArray encodedFeatures)
+            where Address : IPointerLikeAddress<T>
+            where Address.Differential : IPointerLikeAddress<T.Differential>
+    {
+        // Compute offset into output array for this level
+        uint outputOffset = params.levelInfo.currentLevelIndex * FeatureDimensionPerEntry;
+
+        // Initialize this level's output to zero
+        [ForceUnroll]
+        for (uint f = 0; f < FeatureDimensionPerEntry; ++f)
+        {
+            encodedFeatures[outputOffset + f] = T(0);
+        }
+
+        // If level is greater than maxLevel, output zero padding (already initialized above)
+        if (float(params.levelInfo.currentLevelIndex) >= params.maxLevel + 1e-3f)
+        {
+            return;
+        }
+
+        // Use precomputed level-specific parameters
+        uint levelOffset = params.levelInfo.offsetOfFeatureTable;
+        uint hashmapSize = params.levelInfo.featureTableSize;
+
+        // Compute the base offset for this level's features (as integer, not address offset).
+        // We pass this as an integer to readFeatureValue so that the DifferentialPtrPair
+        // stays at the base address — both .p and .d remain correctly paired.
+        uint levelBaseOffset = levelOffset * FeatureDimensionPerEntry;
+
+        // Elevate D-dimension vector to (D+1)-dimension homogeneous vector on hyperplane H_d
+        float elevated[Dimensions + 1];
+        permutoElevate(position, params.levelInfo.scalesPerDim, params.levelInfo.shiftsPerDim, elevated);
+
+        // Find the closest remainder-0 and rank
+        int rem0[Dimensions + 1];
+        int rank[Dimensions + 1];
+        permutoFindRem0(elevated, rem0, rank);
+
+        // Compute the barycentric coordinates
+        float barycentric[Dimensions + 2];
+        permutoBarycentric(elevated, rem0, rank, barycentric);
+
+        // Interpolate the values using barycentric weights
+        uvec<Dimensions> key;
+        [ForceUnroll]
+        for (uint k = 0; k <= Dimensions; ++k)  // For each remainder-k vertex
+        {
+            // Compute the coordinates of the remainder-k vertex
+            [ForceUnroll]
+            for (uint dim = 0; dim < Dimensions; ++dim)
+            {
+                key[dim] = uint(rem0[dim] + int(k));
+                if (rank[dim] > int(Dimensions - k))
+                {
+                    key[dim] -= uint(Dimensions + 1);
+                }
+            }
+
+            // Read feature value at this vertex
+            T featureVal[FeatureDimensionPerEntry];
+            readFeatureValue(featureTableAddress, levelBaseOffset, hashmapSize, key, featureVal);
+
+            // Accumulate with barycentric weight
+            float weight = barycentric[k];
+            [ForceUnroll]
+            for (uint f = 0; f < FeatureDimensionPerEntry; ++f)
+            {
+                encodedFeatures[outputOffset + f] += T(weight) * featureVal[f];
+            }
+        }
+    }
+
+    [Differentiable]
+    public OutArray encode<Address>(in InArray input, Address parametersAddress)
+        where Address : IPointerLikeAddress<T>
+        where Address.Differential : IPointerLikeAddress<T.Differential>
+    {
+        OutArray encodedFeatures = OutArray();
+
+        Params levelParams;
+        levelParams.maxLevel = params.maxLevel;
+
+        [ForceUnroll]
+        for (uint level = 0; level < MaxLevels; ++level)
+        {
+            levelParams.levelInfo = params.levelInfo[level];
+            encodePerLevel<Address>(levelParams, input, parametersAddress, encodedFeatures);
+        }
+
+        return encodedFeatures;
+    }
+
+    /// Helper function to compute the PerLevelInfo (offset, hashmapSize, etc.) for a given level.
+    /// The hashmap size is 2^Log2HashmapSize for all levels.
+    /// The offset for each level is currentLevel * hashmapSize.
+    ///
+    /// @param currentLevel The current level being processed
+    /// @param baseScale Base scale for the encoding
+    /// @param log2PerLevelScale Log2 of the per-level scale factor
+    /// @param seed Seed for the random shift generation
+    /// @param levelInfo Output: the computed level info with all parameters
+    VISIBILITY_LEVEL static void prepareLevelInfo(
+        in uint currentLevel,
+        in float baseScale,
+        in float log2PerLevelScale,
+        in uint seed,
+        out PerLevelInfo levelInfo)
+    {
+        static const uint HashmapSize = 1u << Log2HashmapSize;
+        levelInfo.currentLevelIndex = currentLevel;
+        levelInfo.featureTableSize = HashmapSize;
+        levelInfo.offsetOfFeatureTable = currentLevel * HashmapSize;
+        levelInfo.scale = baseScale * exp2(float(currentLevel) * log2PerLevelScale);
+
+        // Initialize RNG for random shifts (different levels should draw differently)
+        Pcg32 rng = Pcg32(seed);
+        rng.advance(int64_t(currentLevel * Dimensions));
+
+        [ForceUnroll]
+        for (uint dim = 0; dim < Dimensions; ++dim)
+        {
+            levelInfo.scalesPerDim[dim] = levelInfo.scale * rsqrt(float((dim + 1) * (dim + 2)));
+            // Convert uint32 to float in [0, 1) then scale to [-5, 5)
+            float randFloat = float(rng.nextUint()) * (1.0f / 4294967296.0f);
+            levelInfo.shiftsPerDim[dim] = randFloat * 10.0f - 5.0f;
+        }
+    }
+}
@@ -0,0 +1,274 @@
+// Unit test mode is used for unit testing the tiled MMA implementation.
+// So we can test this single file by providing -DUNIT_TEST to the compiler.
+implementing neural;
+
+#include "common-def.slang"
+
+#define Max(A, B) ((A) > (B) ? (A) : (B))
+
+internal typealias SPtr<T> = Ptr<T, Access::ReadWrite, AddressSpace::GroupShared>;
+
+internal interface ISharedMemoryPool
+{
+    internal static SPtr<uint4> getPointer();
+}
+
+public interface ISharedMemorySize
+{
+    public static const uint Bytes;
+}
+
+public struct SharedMemoryPool<ShMemSize: ISharedMemorySize> : ISharedMemoryPool
+{
+    public static const uint sizeInBytes = ShMemSize.Bytes;
+    internal static groupshared uint4 data[sizeInBytes / sizeof(uint4)];
+    VISIBILITY_LEVEL static SPtr<uint4> getPointer()
+    {
+        return __getAddress(data[0]);
+    }
+}
+
+internal struct SharedMemoryUsage<T, TargetEnum Target, ExecutionMode ExeMode, int InputSize, int OutputSize, int SubgroupSize>
+    where T : __BuiltinFloatingPointType
+    where T.Differential == T
+{
+    static const bool IsTraining = ExeMode == ExecutionMode.Training;
+    typealias TileInfoNormal = TileInfo<T, OutputSize, SubgroupSize, InputSize, Target, false>;
+    typealias TileInfoTransposed = TileInfo<T, OutputSize, SubgroupSize, InputSize, Target, true>;
+    typealias CMShape = CoopMatShape<T, Target>;
+
+    // Shared memory A is used to load Tile A. The Size Tile A is determined by the height of matrix A and width of CoopMatA.
+    // The possible shapes of matrix A can be:
+    // 1. M x K in A * B      -> inference (TransposeA = false)
+    // 2. K x M in A^T * B    -> training (TransposeA = true)
+    // 3. M x N in outer product of dOut and input. -> training (TransposeA = false)
+    // In the inference mode, tile A size is always [M x CoopMatA_Width].
+    // In the training mode, tile A size is either [M x CoopMatA_Width] or [K x CoopMatA_Width], so we need to choose the max value
+    static const int SharedMemSizeInVectorMatA = !IsTraining ?
+        (TileInfoNormal.HeightInElementsTileA * CMShape.COLUMN_A) / CMShape.ElementCountPerVector :
+        (Max(TileInfoNormal.HeightInElementsTileA, TileInfoTransposed.HeightInElementsTileA) * CMShape.COLUMN_A) / CMShape.ElementCountPerVector;
+
+    // Shared memory B is used to load Tile B. The Size Tile B is determined by the height of CoopMatB and width of Tile B.
+    // The possible shapes of matrix B in inference mode can be:
+    // 1. K x N in A * B      -> inference
+    // 2. M x N in A^T * B    -> training
+    // 3. N x K in outer product of dOut and input. -> training
+    // In the inference mode, tile B size is always [CoopMatB_Height x N].
+    // In the training mode, tile B size is either [CoopMatB_Height x N] or [CoopMatB_Height x K], so we need to choose the max value.
+
+    // InputSize is K.
+    static const int TileBWidthForOuterProduct = ((InputSize + CMShape.COLUMN_B - 1) / CMShape.COLUMN_B) * CMShape.COLUMN_B;
+    static const int SharedMemSizeInVectorMatB = !IsTraining ?
+        ((TileInfoNormal.WidthInElementsTileB * CMShape.ROW_B) / CMShape.ElementCountPerVector) :
+        ((Max(TileInfoNormal.WidthInElementsTileB, TileBWidthForOuterProduct) * CMShape.ROW_B) / CMShape.ElementCountPerVector);
+
+    // Shared memory C is used to store the result of CoopMatC. The size is determened by height of CoopMatC and width of Tile C.
+    // The possible shapes matrix C can only be:
+    // 1. M x N in A * B
+    // 2. K x N in A^T * B
+    // 3. M x K in outer product of dOut and input.
+    // Therefore the Tile C size is same as the Tile B size. However, the data type of Tile B can only be half, while tile C can be
+    // both float and half, so we need to take that into account.
+    static const int SharedMemSizeInVectorMatC = SharedMemSizeInVectorMatB * sizeof(T) / sizeof(half);
+}
+
+public struct SharedMemorySize0<T, TargetEnum Target, ExecutionMode ExeMode, int SubgroupSize, int SubgroupCount, int InputSize, int OutputSize>
+                    : ISharedMemorySize
+    where T : __BuiltinFloatingPointType
+    where T.Differential == T
+{
+    typealias ShMemInfo = SharedMemoryUsage<T, Target, ExeMode, InputSize, OutputSize, SubgroupSize>;
+
+    // Notice that in the actual implementation, we always reuse shared memory for Tile B and Tile C because they are always used at
+    // different stages of the computation, and they have the same size.
+    public static const uint Bytes =
+        (ShMemInfo.SharedMemSizeInVectorMatA + (ShMemInfo.SharedMemSizeInVectorMatC) * SubgroupCount) * sizeof(uint4);
+}
+
+// The following code is a macro-based implementation of the shared memory size calculation.
+// It is used to calculate the shared memory size for a given number of hidden layers.
+// The challenge here is that the size of the shared memory has to be compile time constant, however
+// slang doesn't really have const_expr function. So the only way to get the compile time constant
+// is to use meta programming to generate the code that can be evaluated at compile time.
+// Here the algorithm is very simple where we just use divide and conquer to calculate the shared memory size.
+// Firstly, we define the base case `SharedMemorySize0` that give an input and output of a layer, we calculate the shared memory size for this layer.
+// Then we can define larger number of layers by using the divide and conquer strategy. The reason to use Macro here
+// is just to reduce the mount of code that we need to write. But under the hood, the macro will be expanded to:
+// `SharedMemorySize1` to `SharedMemorySize15`.
+
+// Take an example:
+// ```
+// DEFINE_SHMEM_SIZE(3, 1, 1, PARAM_3, ARG_3_L, ARG_3_R)
+// ```
+// This will be expanded to:
+// ```
+// public struct SharedMemorySize3<T, TargetEnum Target, ExecutionMode ExeMode, int SubgroupSize, int SubgroupCount, uint S0, uint S1, uint S2, uint S3> SHMEM_WHERE {
+//     internal static const uint a = SharedMemorySize1<T, Target, ExeMode, SubgroupSize, SubgroupCount, S0, S1, S2>.Bytes;
+//     internal static const uint b = SharedMemorySize1<T, Target, ExeMode, SubgroupSize, SubgroupCount, S2, S3, S4>.Bytes;
+//     public static const uint Bytes = Max(a, b);
+// }
+// ```
+// Where `LN` and `RN` determine how we divide input sequence of layers into two parts.
+
+// TODO: We shouldn't need such sophisticated meta-programming to achieve this once we have const_expr function
+//       or we can provide more advanced variadic generic parameters support such as First(...)/Rest(...), so that
+//       we can define the SharedMemorySize as variadic generic struct instead of these pre-defined generics.
+//
+// We note that this implementation is not the most efficient way to calculate the shared memory size, because
+// we can first find out the max layer size, and then do the remaining calculation. But since this computation is
+// not done at run time, so we don't need to worry about the performance, and we can reuse other data structure we
+// already have, so it's easiest way to implement this.
+
+
+#define UNPACK(...) __VA_ARGS__
+
+// 2. Define your helper macros
+#define SHMEM_WHERE where T : __BuiltinFloatingPointType where T.Differential == T
+#define SHMEM_BASE T, Target, ExeMode, SubgroupSize, SubgroupCount
+
+// 3. The Core Macro - note the removed spaces around UNPACK
+#define DEFINE_SHMEM_SIZE(N, LN, RN, ARGS, L_VALS, R_VALS)                                                                           \
+    public struct SharedMemorySize##N<T, TargetEnum Target, ExecutionMode ExeMode, int SubgroupSize, int SubgroupCount, UNPACK ARGS> \
+         : ISharedMemorySize                                                                                                         \
+        SHMEM_WHERE                                                                                                                  \
+    {                                                                                                                                \
+        internal static const uint a = SharedMemorySize##LN<SHMEM_BASE, UNPACK L_VALS>.Bytes;                                        \
+        internal static const uint b = SharedMemorySize##RN<SHMEM_BASE, UNPACK R_VALS>.Bytes;                                        \
+        public static const uint Bytes = Max(a, b);                                                                                  \
+    }
+
+#define PARAM_1 (uint S0, uint S1, uint S2)
+#define ARG_1 (S0, S1, S2)
+#define ARG_1_L (S0, S1)
+#define ARG_1_R (S1, S2)
+
+#define PARAM_2 (UNPACK PARAM_1, uint S3)
+#define ARG_2 (UNPACK ARG_1, S3)
+#define ARG_2_L (S0, S1, S2)
+#define ARG_2_R (S2, S3)
+
+#define PARAM_3 (UNPACK PARAM_2, uint S4)
+#define ARG_3 (UNPACK ARG_2, S4)
+#define ARG_3_L (S0, S1, S2)
+#define ARG_3_R (S2, S3, S4)
+
+DEFINE_SHMEM_SIZE(1, 0, 0, PARAM_1, ARG_1_L, ARG_1_R)
+DEFINE_SHMEM_SIZE(2, 1, 0, PARAM_2, ARG_2_L, ARG_2_R)
+DEFINE_SHMEM_SIZE(3, 1, 1, PARAM_3, ARG_3_L, ARG_3_R)
+
+// from 4 to 7
+#define PARAM_4 (UNPACK PARAM_3, uint S5)
+#define ARG_4 (UNPACK ARG_3, S5)
+#define ARG_4_R (S4, S5)
+
+#define PARAM_5 (UNPACK PARAM_4, uint S6)
+#define ARG_5 (UNPACK ARG_4, S6)
+#define ARG_5_R (UNPACK ARG_4_R, S6)
+
+#define PARAM_6 (UNPACK PARAM_5, uint S7)
+#define ARG_6 (S0, S1, S2, S3, S4, S5, S6, S7)
+#define ARG_6_R (UNPACK ARG_5_R, S7)
+
+#define PARAM_7 (UNPACK PARAM_6, uint S8)
+#define ARG_7 (UNPACK ARG_6, S8)
+#define ARG_7_R (UNPACK ARG_6_R, S8)
+
+DEFINE_SHMEM_SIZE(4, 3, 0, PARAM_4, ARG_3, ARG_4_R)
+DEFINE_SHMEM_SIZE(5, 3, 1, PARAM_5, ARG_3, ARG_5_R)
+DEFINE_SHMEM_SIZE(6, 3, 2, PARAM_6, ARG_3, ARG_6_R)
+DEFINE_SHMEM_SIZE(7, 3, 3, PARAM_7, ARG_3, ARG_7_R)
+
+// from 8 to 15
+#define PARAM_8 (UNPACK PARAM_7, uint S9)
+#define ARG_8 (UNPACK ARG_7, S9)
+#define ARG_8_R (S8, S9)
+
+#define PARAM_9 (UNPACK PARAM_8, uint S10)
+#define ARG_9 (UNPACK ARG_8, S10)
+#define ARG_9_R (UNPACK ARG_8_R, S10)
+
+#define PARAM_10 (UNPACK PARAM_9, uint S11)
+#define ARG_10 (UNPACK ARG_9, S11)
+#define ARG_10_R (UNPACK ARG_9_R, S11)
+
+#define PARAM_11 (UNPACK PARAM_10, uint S12)
+#define ARG_11 (UNPACK ARG_10, S12)
+#define ARG_11_R (UNPACK ARG_10_R, S12)
+
+#define PARAM_12 (UNPACK PARAM_11, uint S13)
+#define ARG_12 (UNPACK ARG_11, S13)
+#define ARG_12_R (UNPACK ARG_11_R, S13)
+
+#define PARAM_13 (UNPACK PARAM_12, uint S14)
+#define ARG_13 (UNPACK ARG_12, S14)
+#define ARG_13_R (UNPACK ARG_12_R, S14)
+
+#define PARAM_14 (UNPACK PARAM_13, uint S15)
+#define ARG_14 (UNPACK ARG_13, S15)
+#define ARG_14_R (UNPACK ARG_13_R, S15)
+
+#define PARAM_15 (UNPACK PARAM_14, uint S16)
+#define ARG_15 (UNPACK ARG_14, S16)
+#define ARG_15_R (UNPACK ARG_14_R, S16)
+
+DEFINE_SHMEM_SIZE(8,  7, 0, PARAM_8,  ARG_7, ARG_8_R)
+DEFINE_SHMEM_SIZE(9,  7, 1, PARAM_9,  ARG_7, ARG_9_R)
+DEFINE_SHMEM_SIZE(10, 7, 2, PARAM_10, ARG_7, ARG_10_R)
+DEFINE_SHMEM_SIZE(11, 7, 3, PARAM_11, ARG_7, ARG_11_R)
+DEFINE_SHMEM_SIZE(12, 7, 4, PARAM_12, ARG_7, ARG_12_R)
+DEFINE_SHMEM_SIZE(13, 7, 5, PARAM_13, ARG_7, ARG_13_R)
+DEFINE_SHMEM_SIZE(14, 7, 6, PARAM_14, ARG_7, ARG_14_R)
+DEFINE_SHMEM_SIZE(15, 7, 7, PARAM_15, ARG_7, ARG_15_R)
+
+// Slang doesn't support generic overloading, so we cannot provide the one generic with different number of parameters.
+public struct SharedMemorySize<T, TargetEnum Target, ExecutionMode ExeMode, int SubgroupSize, int SubgroupCount>
+    where T : __BuiltinFloatingPointType
+    where T.Differential == T
+{
+    #define STRIP_PARENS(x) STRIP_PARENS_I x
+    #define STRIP_PARENS_I(...) __VA_ARGS__
+
+    public typealias OfLayer1<uint S0, uint S1> = SharedMemorySize0<T, Target, ExeMode, SubgroupSize, SubgroupCount, S0, S1>;
+    public typealias OfLayer2<STRIP_PARENS(PARAM_1)>   = SharedMemorySize1<T, Target, ExeMode, SubgroupSize, SubgroupCount, STRIP_PARENS(ARG_1)>;
+    public typealias OfLayer3<STRIP_PARENS(PARAM_2)>   = SharedMemorySize2<T, Target, ExeMode, SubgroupSize, SubgroupCount, STRIP_PARENS(ARG_2)>;
+    public typealias OfLayer4<STRIP_PARENS(PARAM_3)>   = SharedMemorySize3<T, Target, ExeMode, SubgroupSize, SubgroupCount, STRIP_PARENS(ARG_3)>;
+    public typealias OfLayer5<STRIP_PARENS(PARAM_4)>   = SharedMemorySize4<T, Target, ExeMode, SubgroupSize, SubgroupCount, STRIP_PARENS(ARG_4)>;
+    public typealias OfLayer6<STRIP_PARENS(PARAM_5)>   = SharedMemorySize5<T, Target, ExeMode, SubgroupSize, SubgroupCount, STRIP_PARENS(ARG_5)>;
+    public typealias OfLayer7<STRIP_PARENS(PARAM_6)>   = SharedMemorySize6<T, Target, ExeMode, SubgroupSize, SubgroupCount, STRIP_PARENS(ARG_6)>;
+    public typealias OfLayer8<STRIP_PARENS(PARAM_7)>   = SharedMemorySize7<T, Target, ExeMode, SubgroupSize, SubgroupCount, STRIP_PARENS(ARG_7)>;
+    public typealias OfLayer9<STRIP_PARENS(PARAM_8)>   = SharedMemorySize8<T, Target, ExeMode, SubgroupSize, SubgroupCount, STRIP_PARENS(ARG_8)>;
+    public typealias OfLayer10<STRIP_PARENS(PARAM_9)>   = SharedMemorySize9<T, Target, ExeMode, SubgroupSize, SubgroupCount, STRIP_PARENS(ARG_9)>;
+    public typealias OfLayer11<STRIP_PARENS(PARAM_10)> = SharedMemorySize10<T, Target, ExeMode, SubgroupSize, SubgroupCount, STRIP_PARENS(ARG_10)>;
+    public typealias OfLayer12<STRIP_PARENS(PARAM_11)> = SharedMemorySize11<T, Target, ExeMode, SubgroupSize, SubgroupCount, STRIP_PARENS(ARG_11)>;
+    public typealias OfLayer13<STRIP_PARENS(PARAM_12)> = SharedMemorySize12<T, Target, ExeMode, SubgroupSize, SubgroupCount, STRIP_PARENS(ARG_12)>;
+    public typealias OfLayer14<STRIP_PARENS(PARAM_13)> = SharedMemorySize13<T, Target, ExeMode, SubgroupSize, SubgroupCount, STRIP_PARENS(ARG_13)>;
+    public typealias OfLayer15<STRIP_PARENS(PARAM_14)> = SharedMemorySize14<T, Target, ExeMode, SubgroupSize, SubgroupCount, STRIP_PARENS(ARG_14)>;
+    public typealias OfLayer16<STRIP_PARENS(PARAM_15)> = SharedMemorySize15<T, Target, ExeMode, SubgroupSize, SubgroupCount, STRIP_PARENS(ARG_15)>;
+}
+
+#if 0
+
+// We should implement First/Rest syntax to something like this.
+interface IVal {
+    static const int Value;
+}
+SharedMemorySize<int In, each Val:IVal HiddenSize, int OutputSize>
+{
+    static const uint SharedMemSizeInBytes =
+        max(SharedMemorySize<In, First HiddenSize>, SharedMemorySize<each Rest HiddenSize, OutputSize>);
+}
+
+// Slang doesn't support generic overloading, therefore we cannot provide the pre-defined generics that adds different number of HiddenSize.
+
+public struct SharedMemorySize<T, TargetEnum Target, ExecutionMode ExeMode, int SubgroupSize, int SubgroupCount, int InputSize, int HiddenSize, int OutputSize>
+    where T : __BuiltinFloatingPointType
+    where T.Differential == T
+{
+    typealias ShMemInfo = SharedMemoryUsage<T, Target, ExeMode, InputSize, OutputSize, SubgroupSize>;
+
+    // Notice that in the actual implementation, we always reuse shared memory for Tile B and Tile C because they are always used at
+    // different stages of the computation, and they have the same size.
+    static const uint SharedMemSizeInBytes =
+        (ShMemInfo.SharedMemSizeInVectorMatA + (ShMemInfo.SharedMemSizeInVectorMatB) * SubgroupCount) * sizeof(uint4);
+}
+#endif
@@ -0,0 +1,231 @@
+implementing neural;
+
+#include "common-def.slang"
+
+
+internal interface IArrayAccessor<T>
+{
+    internal void atomicAdd(int index, T value)
+    {
+        static_assert(false, "atomicAdd is not supported for IArrayAccessor");
+    }
+
+    __subscript(int index)->T
+    {
+        get;
+        set;
+    }
+}
+
+internal extension<T> RWStructuredBuffer<T>.Handle : IArrayAccessor<T>
+{
+    [ForceInline]
+    override internal void atomicAdd(int index, T value)
+    {
+        __atomic_reduce_add(this[index], value);
+    }
+}
+
+internal extension<T> Ptr<T> : IArrayAccessor<T>
+{
+    internal __subscript(int index) -> T
+    {
+        [ForceInline]
+        get { return this[index]; }
+
+        [ForceInline]
+        set { this[index] = newValue; }
+    }
+
+    [ForceInline]
+    override internal void atomicAdd(int index, T value)
+    {
+        __atomic_reduce_add(this[index], value);
+    }
+}
+
+internal extension<T, int N> Array<T, N> : IArrayAccessor<T>
+{
+    internal __subscript(int index) -> T
+    {
+        [ForceInline]
+        get { return this[index]; }
+
+        [ForceInline]
+        set { this[index] = newValue; }
+    }
+}
+
+VISIBILITY_LEVEL enum AccessOp : uint32_t
+{
+    READ,
+    WRITE,
+    ACCUMULATE,
+    ATOMIC_ADD,
+}
+
+#define COMMON_TYPE_CONSTRAINTS \
+    where T : __BuiltinFloatingPointType \
+    where U : __BuiltinFloatingPointType \
+    where BufferType : IArrayAccessor<U>
+
+[ForceInline]
+internal static void readOneElement<T, U, BufferType, int NBytes, int BitsShiftPerRead>(BufferType buffer, int bufferIdx, int elementIdx, inout uint result)
+    COMMON_TYPE_CONSTRAINTS
+{
+    const uint shift = BitsShiftPerRead * elementIdx;
+    T convertedValue;
+    convertedValue = __realCast<T>(buffer[bufferIdx]);
+    switch (NBytes)
+    {
+    case 1:
+        result |= uint(bit_cast<uint8_t>(convertedValue)) << shift;
+        break;
+    case 2:
+        result |= uint(bit_cast<uint16_t>(convertedValue)) << shift;
+        break;
+    case 4:
+        result |= uint(bit_cast<uint>(convertedValue)) << shift;
+        break;
+    default:
+        static_assert(false, "Unsupported data type T");
+    }
+}
+
+[ForceInline]
+internal static void writeOneElement<T, U, BufferType, int NBytes, int BitsShiftPerWrite, AccessOp Op>(inout BufferType buffer, int bufferIdx, int elementIdx, uint value)
+    COMMON_TYPE_CONSTRAINTS
+{
+    const uint shift = BitsShiftPerWrite * elementIdx;
+    U convertedValue;
+    switch (NBytes)
+    {
+    case 1:
+        convertedValue = __realCast<U>(bit_cast<T>((uint8_t)(value >> shift)));
+        break;
+    case 2:
+        convertedValue = __realCast<U>(bit_cast<T>((uint16_t)(value >> shift)));
+        break;
+    case 4:
+        convertedValue = __realCast<U>(bit_cast<T>((uint)(value >> shift)));
+        break;
+    default:
+        static_assert(false, "Unsupported data type T");
+    }
+
+    switch (Op)
+    {
+    case AccessOp.WRITE:
+        buffer[bufferIdx] = convertedValue;
+        break;
+    case AccessOp.ACCUMULATE:
+        buffer[bufferIdx] = buffer[bufferIdx] + convertedValue;
+        break;
+    case AccessOp.ATOMIC_ADD:
+        buffer.atomicAdd(bufferIdx, convertedValue);
+        break;
+    default:
+        static_assert(false, "Unsupported access operation");
+    }
+}
+
+[ForceInline]
+internal static void accessUint4Aligned<AccessOp Op, T, U, BufferType>(inout BufferType buffer, int startIndex, inout uint4 value)
+    COMMON_TYPE_CONSTRAINTS
+{
+    const int nBytes = sizeof(T);
+    const int WritePerElement = 4 / nBytes;
+    const int BitsShiftPerWrite = 32 / WritePerElement;
+
+    if (Op == AccessOp.READ)
+        value = uint4(0, 0, 0, 0);
+
+    [ForceUnroll]
+    for (int i = 0; i < 4; i++)
+    {
+        [ForceUnroll]
+        for (int j = 0; j < WritePerElement; j++)
+        {
+            int index = startIndex + i * WritePerElement + j;
+            switch (Op)
+            {
+            case AccessOp.READ:
+                readOneElement<T, U, BufferType, nBytes, BitsShiftPerWrite>(buffer, index, j, value[i]);
+                break;
+            case AccessOp.WRITE:
+            case AccessOp.ACCUMULATE:
+            case AccessOp.ATOMIC_ADD:
+                writeOneElement<T, U, BufferType, nBytes, BitsShiftPerWrite, Op>(buffer, index, j, value[i]);
+                break;
+            default:
+                static_assert(false, "Unsupported access operation");
+            }
+        }
+    }
+}
+
+[ForceInline]
+internal void accessUint4<AccessOp Op, T, U, BufferType, bool IsAligned, int Stride>(BufferType buffer, int baseIndex, int startIndex, inout uint4 value)
+    COMMON_TYPE_CONSTRAINTS
+{
+    if (IsAligned)
+    {
+        // Call the aligned version of readUint4 which is branchless.
+        accessUint4Aligned<Op, T, U, BufferType>(buffer, startIndex, value);
+        return;
+    }
+
+    if (Op == AccessOp.READ)
+        value = uint4(0, 0, 0, 0);
+
+    // T is the type of source (read) or destination (write) data type. We will always pack few elements into a uint4.
+    // So T will determine how many elements we can pack into a uint4.
+    // If U is different from T, we will first convert from U to T (in read operation) or from T to U (in write operation).
+    // But U will not determined how many elements we can read or write, only T will.
+    const int nBytes = sizeof(T);
+    const int ReadPerElement = 4 / nBytes;
+    const int BitsShiftPerRead = 32 / ReadPerElement;
+
+    const int x = (startIndex - baseIndex) % Stride;
+
+    // end address of this read [address+length-1]
+    const int endAddress = (x + 4 * ReadPerElement - 1);
+
+    // this is same as paddingCount = endAddress < AlignedStride ? 0 : AlignedStride - endAddress + 1
+    const int paddingCount = max<int>(0, endAddress - Stride + 1);
+    const int elementsToRead = (4 * ReadPerElement) - paddingCount;
+
+    [ForceUnroll]
+    for (int i = 0; i < 4; i++)
+    {
+        int offset = i * ReadPerElement;
+        [ForceUnroll]
+        for (int j = 0; j < ReadPerElement; j++)
+        {
+            // 4 * ReadPerElement is the total number of elements we can read from the buffer.
+            // paddingCount is the number of the elements we need to pad.
+            // e.g. if ReadPerElement is 2, paddingCount is 4.Because (4 * 2 - 4 == 4), so we can
+            // just stop reading when offset bigger than 3.
+            offset += j;
+            if (offset >= elementsToRead)
+            {
+                return;
+            }
+
+            int index = (startIndex + offset);
+            switch (Op)
+            {
+            case AccessOp.READ:
+                readOneElement<T, U, BufferType, nBytes, BitsShiftPerRead>(buffer, index, j, value[i]);
+                break;
+            case AccessOp.WRITE:
+            case AccessOp.ACCUMULATE:
+            case AccessOp.ATOMIC_ADD:
+                writeOneElement<T, U, BufferType, nBytes, BitsShiftPerRead, Op>(buffer, index, j, value[i]);
+                break;
+            default:
+                static_assert(false, "Unsupported access operation");
+            }
+        }
+    }
+}
@@ -0,0 +1,156 @@
+Slang 64-bit Type Support
+=========================
+
+## Summary
+
+* Not all targets support 64 bit types, or all 64 bit types 
+  * 64 bit integers generally require later APIs/shader models
+* When specifying 64 bit floating-point literals *always* use the type suffixes (ie `L`) 
+* An integer literal will be interpreted as 64 bits if it cannot fit in a 32 bit value.
+* GPU target/s generally do not support all double intrinsics 
+  * Typically missing are trascendentals (sin, cos etc), logarithm and exponential functions
+  * CUDA is the exception supporting nearly all double intrinsics
+* D3D 
+  * D3D targets *appear* to support double intrinsics (like sin, cos, log etc), but behind the scenes they are actually being converted to float
+  * When using D3D12, it is best to use DXIL if you use double because there are some serious issues around double and DXBC
+* VK will produce an error in validation if a double intrinsic is used it does support (which is most of them)
+* Vector and Matrix types have even spottier than scalar intrinsic support across targets
+
+Overview
+========
+
+The Slang language supports 64 bit built in types. Such as
+
+* `double`
+* `uint64_t`
+* `int64_t`
+
+This also applies to vector and matrix versions of these types. 
+
+Unfortunately if a specific target supports the type or the typical HLSL intrinsic functions (such as sin/cos/max/min etc) depends very much on the target.
+
+Special attention has to be made with respect to literal 64 bit types. By default float literals if they do not have an explicit suffix are assumed to be 32 bit. There is a variety of reasons for this design choice - the main one being around by default behavior of getting good performance. The suffixes required for 64 bit types are as follows
+
+```
+// double - 'l' or 'L'
+
+double a = 1.34e-200L;
+// WRONG!: This is the same as b = double(float(1.34e-200)) which will be 0. Will produce a warning.
+double b = 1.34e-200; 
+
+// int64_t - 'll' or 'LL' (or combination of upper/lower)
+
+int64_t c = -5436365345345234ll;
+
+int64_t e = ~0LL;       // Same as 0xffffffffffffffff
+
+// uint64_t - 'ull' or 'ULL' (or combination of upper/lower)
+
+uint64_t g = 0x8000000000000000ull; 
+
+uint64_t i = ~0ull;       // Same as 0xffffffffffffffff
+uint64_t j = ~0;          // Equivalent to 'i' because uint64_t(int64_t(~int32_t(0)));
+```
+
+These issues are discussed more on issue [#1185](https://github.com/shader-slang/slang/issues/1185)
+
+The type of a decimal non-suffixed integer literal is the first integer type from the list [`int`, `int64_t`] 
+which can represent the specified literal value. If the value cannot fit, the literal is  represented as an `uint64_t` 
+and a warning is given.
+The type of a hexadecimal non-suffixed integer literal  is the first type from the list [`int`, `uint`, `int64_t`, `uint64_t`] 
+that can represent the specified literal value. A non-suffixed integer literal will be 64 bit if it cannot fit in 32 bits.
+```
+// Same as int64_t a = int(1), the value can fit into a 32 bit integer.
+int64_t a = 1;
+
+// Same as int64_t b = int64_t(2147483648), the value cannot fit into a 32 bit integer.
+int64_t b = 2147483648;
+
+// Same as int64_t c = uint64_t(18446744073709551615), the value is larger than the maximum value of a signed 64 bit
+// integer, and is interpreted as an unsigned 64 bit integer. Warning is given.
+uint64_t c = 18446744073709551615;
+
+// Same as uint64_t = int(0x7FFFFFFF), the value can fit into a 32 bit integer.
+uint64_t d = 0x7FFFFFFF;
+
+// Same as uint64_t = int64_t(0x7FFFFFFFFFFFFFFF), the value cannot fit into an unsigned 32 bit integer but
+// can fit into a signed 64 bit integer.
+uint64_t e = 0x7FFFFFFFFFFFFFFF;
+
+// Same as uint64_t = uint64_t(0xFFFFFFFFFFFFFFFF), the value cannot fit into a signed 64 bit integer, and
+// is interpreted as an unsigned 64 bit integer.
+uint64_t f = 0xFFFFFFFFFFFFFFFF;
+```
+
+Double support
+==============
+
+Target   | Compiler/Binary  |  Double Type   |   Intrinsics          |  Notes
+---------|------------------|----------------|-----------------------|-----------
+CPU      |                  |      Yes       |          Yes          |  1
+CUDA     | Nvrtx/PTX        |      Yes       |          Yes          |  1
+D3D12    | DXC/DXIL         |      Yes       |          Small Subset |  4 
+Vulkan   | GlSlang/Spir-V   |      Yes       |          Partial      |  2
+D3D11    | FXC/DXBC         |      Yes       |          Small Subset |  4
+D3D12    | FXC/DXBC         |      Yes       |          Small Subset |  3, 4
+
+1) CUDA and CPU support most intrinsics, with the notable exception currently of matrix invert
+2) In terms of lack of general intrinsic support, the restriction is described in  https://www.khronos.org/registry/spir-v/specs/1.0/GLSL.std.450.html
+
+The following intrinsics are available for Vulkan 
+
+`fmod` (as %), `rcp`, `sign`, `saturate`, `sqrt`, `rsqrt`, `frac`, `ceil`, `floor`, `trunc`, `abs`, `min`, `max`, `smoothstep`, `lerp`, `clamp`, `step` and `asuint`. 
+
+These are tested in the test `tests/hlsl-intrinsic/scalar-double-vk-intrinsic.slang`.
+
+What is missing are transedentals, expX, logX. 
+
+Note that GlSlang does produce Spir-V that contains double intrinsic calls for the missing intrinsics, the failure happens when validating the Spir-V 
+
+```
+Validation: error 0:  [ UNASSIGNED-CoreValidation-Shader-InconsistentSpirv ] Object: VK_NULL_HANDLE (Type = 0) | SPIR-V module not valid: GLSL.std.450 Sin: expected Result Type to be a 16 or 32-bit scalar or vector float type
+  %57 = OpExtInst %double %1 Sin %56
+```
+
+3) That if a RWStructuredBuffer<double> is used on D3D12 with DXBC, and a double is written, it can lead to incorrect behavior. Thus it is recommended not to use double with dxbc, but to use dxil to keep things simple. A test showing this problem is `tests/bugs/dxbc-double-problem.slang`. The test `tests/hlsl-intrinsic/scalar-double-simple.slang` shows not using a double resource, doubles do appear to work on D3D12 DXBC. 
+
+4) If you compile code using double and intrinsics through Slang at first blush it will seem to work. Assuming there are no errors in your code, your code will even typically appear to work correctly. Unfortunately what is really happening is the backend compiler (fxc or dxc) compiler is narrowing double to float and then using float intrinsics. It typically generates a warning when this happens, but unless there is an error in your code you will not see these warnings because dxc doesn't appear to have a mechanism to return warnings if there isn't an error. This is why everything appears to work - but actually any intrinsic call is losing precision silently. 
+
+Note on dxc by default Slang disables warnings - warnings need to be enabled to see the narrowing warnings. 
+
+There is another exception around the use of % - if you do this with double it will return an error saying on float is supported. 
+
+It appears that no intrinsics are available for double with fxc. 
+
+On dxc the following intrinsics are available with double::
+
+`rcp`, `sign`, `saturate`, `abs`, `min`, `max`, `clamp`, `asuint`. 
+
+These are tested in the test `tests/hlsl-intrinsic/scalar-double-d3d-intrinsic.slang`.
+
+There is no support for transcendentals (`sin`, `cos` etc) or `log`/`exp`. More surprising is that `sqrt`, `rsqrt`, `frac`, `ceil`, `floor`, `trunc`, `step`, `lerp`, `smoothstep` are also not supported.
+
+uint64_t and int64_t Support
+============================
+
+Target   | Compiler/Binary  | u/int64_t Type |  Intrinsic support | Notes
+---------|------------------|----------------|--------------------|--------
+CPU      |                  |      Yes       |          Yes       |   
+CUDA     | Nvrtx/PTX        |      Yes       |          Yes       |   
+Vulkan   | GlSlang/Spir-V   |      Yes       |          Yes       |   
+D3D12    | DXC/DXIL         |      Yes       |          Yes       |   1
+D3D11    | FXC/DXBC         |      No        |          No        |   2
+D3D12    | FXC/DXBC         |      No        |          No        |   2
+
+1) The [sm6.0 docs](https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/hlsl-shader-model-6-0-features-for-direct3d-12) describe only supporting uint64_t, but dxc says int64_t is supported in [HLSL 2016](https://github.com/Microsoft/DirectXShaderCompiler/wiki/Language-Versions). Tests show that this is indeed the case.
+
+2) uint64_t support requires https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/hlsl-shader-model-6-0-features-for-direct3d-12, so DXBC is not a target.
+
+The intrinsics available on `uint64_t` type are `abs`, `min`, `max`, `clamp` and `countbits`.
+The intrinsics available on `int64_t` type are `abs`, `min`, `max`, `clamp` and `countbits`.
+
+GLSL
+====
+
+GLSL/Spir-v based targets do not support 'generated' intrinsics on matrix types. For example 'sin(mat)' will not work on GLSL/Spir-v.
+
@@ -0,0 +1,35 @@
+Slang Documentation
+===================
+
+This directory contains documentation for the Slang system.
+Some of the documentation is intended for users of the language and compiler, while other documentation is intended for developers contributing to the project.
+
+Getting Started
+---------------
+
+The Slang [User's Guide](https://shader-slang.github.io/slang/user-guide/) provides an introduction to the Slang language and its major features, as well as the compilation and reflection API.
+
+There is also documentation specific to using the [slangc](https://shader-slang.github.io/slang/user-guide/compiling.html#command-line-compilation-with-slangc) command-line tool.
+
+Advanced Users
+--------------
+
+For the benefit of advanced users we provide detailed documentation on how Slang compiles code for specific platforms.
+The [target compatibility guide](target-compatibility.md) gives an overview of feature compatibility for targets. 
+
+The [CPU target guide](cpu-target.md) gives information on compiling Slang or C++ source into shared libraries/executables or functions that can be directly executed. It also covers how to generate C++ code from Slang source.  
+
+The [CUDA target guide](cuda-target.md) provides information on compiling Slang/HLSL or CUDA source. Slang can compile to equivalent CUDA source, as well as to PTX via the nvrtc CUDA compiler.
+
+Contributors
+------------
+
+For contributors to the Slang project, the information under the [`design/`](design/) directory may help explain the rationale behind certain design decisions and help when ramping up in the codebase.
+
+Research
+--------
+
+The Slang project is based on a long history of research work. While understanding this research is not necessary for working with Slang, it may be instructive for understanding the big-picture goals of the language, as well as why certain critical decisions were made.
+
+A [paper](http://graphics.cs.cmu.edu/projects/slang/) on the Slang system was accepted into SIGGRAPH 2018, and it provides an overview of the language and the compiler implementation.
+Yong He's [dissertation](http://graphics.cs.cmu.edu/projects/renderergenerator/yong_he_thesis.pdf) provided more detailed discussion of the design of the Slang system.
@@ -0,0 +1 @@
+theme: jekyll-theme-tactile
@@ -0,0 +1,137 @@
+{% capture headingsWorkspace %}
+{% comment %}
+Copyright (c) 2018 Vladimir "allejo" Jimenez
+
+Permission is hereby granted, free of charge, to any person
+obtaining a copy of this software and associated documentation
+files (the "Software"), to deal in the Software without
+restriction, including without limitation the rights to use,
+copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the
+Software is furnished to do so, subject to the following
+conditions:
+
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
+OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
+HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
+WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+OTHER DEALINGS IN THE SOFTWARE.
+{% endcomment %}
+{% comment %}
+Version 1.0.9
+https://github.com/allejo/jekyll-anchor-headings
+
+"Be the pull request you wish to see in the world." ~Ben Balter
+
+Usage:
+{% include anchor_headings.html html=content anchorBody="#" %}
+
+Parameters:
+* html (string) - the HTML of compiled markdown generated by kramdown in Jekyll
+
+Optional Parameters:
+* beforeHeading (bool) : false - Set to true if the anchor should be placed _before_ the heading's content
+* headerAttrs (string) : '' - Any custom HTML attributes that will be added to the heading tag; you may NOT use `id`;
+the `%heading%` and `%html_id%` placeholders are available
+* anchorAttrs (string) : '' - Any custom HTML attributes that will be added to the `<a>` tag; you may NOT use `href`,
+    `class` or `title`;
+    the `%heading%` and `%html_id%` placeholders are available
+    * anchorBody (string) : '' - The content that will be placed inside the anchor; the `%heading%` placeholder is
+    available
+    * anchorClass (string) : '' - The class(es) that will be used for each anchor. Separate multiple classes with a
+    space
+    * anchorTitle (string) : '' - The `title` attribute that will be used for anchors
+    * h_min (int) : 1 - The minimum header level to build an anchor for; any header lower than this value will be
+    ignored
+    * h_max (int) : 6 - The maximum header level to build an anchor for; any header greater than this value will be
+    ignored
+    * bodyPrefix (string) : '' - Anything that should be inserted inside of the heading tag _before_ its anchor and
+    content
+    * bodySuffix (string) : '' - Anything that should be inserted inside of the heading tag _after_ its anchor and
+    content
+
+    Output:
+    The original HTML with the addition of anchors inside of all of the h1-h6 headings.
+    {% endcomment %}
+
+    {% assign minHeader = include.h_min | default: 1 %}
+    {% assign maxHeader = include.h_max | default: 2 %}
+    {% assign beforeHeading = include.beforeHeading %}
+    {% assign nodes = include.html | split: '<h' %} {% capture edited_headings %}{% endcapture %} {% for _node in nodes
+        %} {% capture node %}{{ _node | strip }}{% endcapture %} {% if node=="" %} {% continue %} {% endif %} {% assign
+        nextChar=node | replace: '"' , '' | strip | slice: 0, 1 %} {% assign headerLevel=nextChar | times: 1 %} <!-- If
+        the level is cast to 0, it means it's not a h1-h6 tag, so let's see if we need to fix it -->
+        {% if headerLevel == 0 %}
+        <!-- Split up the node based on closing angle brackets and get the first one. -->
+        {% assign firstChunk = node | split: '>' | first %}
+
+        <!-- If the first chunk does NOT contain a '<', that means we've broken another HTML tag that starts with 'h' -->
+        {% unless firstChunk contains '<' %} {% capture node %}<h{{ node }}{% endcapture %} {% endunless %} {% capture
+            edited_headings %}{{ edited_headings }}{{ node }}{% endcapture %} {% continue %} {% endif %} {% capture
+            _closingTag %}</h{{ headerLevel }}>{% endcapture %}
+            {% assign _workspace = node | split: _closingTag %}
+            {% assign _idWorkspace = _workspace[0] | split: 'id="' %}
+            {% assign _idWorkspace = _idWorkspace[1] | split: '"' %}
+            {% assign html_id = _idWorkspace[0] %}
+
+            {% capture _hAttrToStrip %}{{ _workspace[0] | split: '>' | first }}>{% endcapture %}
+            {% assign header = _workspace[0] | replace: _hAttrToStrip, '' %}
+
+            <!-- Build the anchor to inject for our heading -->
+            {% capture anchor %}{% endcapture %}
+
+            {% if html_id and headerLevel >= minHeader and headerLevel <= maxHeader %} {% assign escaped_header=header |
+                strip_html %} {% if include.headerAttrs %} {% capture _hAttrToStrip %}{{ _hAttrToStrip | split: '>' |
+                first }} {{ include.headerAttrs | replace: '%heading%' , escaped_header | replace: '%html_id%' , html_id
+                }}>{% endcapture %}
+                {% endif %}
+
+                {% capture anchor %}href="#{{ html_id }}"{% endcapture %}
+
+                {% if include.anchorClass %}
+                {% capture anchor %}{{ anchor }} class="{{ include.anchorClass }}"{% endcapture %}
+                {% endif %}
+
+                {% if include.anchorTitle %}
+                {% capture anchor %}{{ anchor }} title="{{ include.anchorTitle | replace: '%heading%', escaped_header
+                }}"{% endcapture %}
+                {% endif %}
+
+                {% if include.anchorAttrs %}
+                {% capture anchor %}{{ anchor }} {{ include.anchorAttrs | replace: '%heading%', escaped_header |
+                replace: '%html_id%', html_id }}{% endcapture %}
+                {% endif %}
+
+                {% capture anchor %}<a {{ anchor }}>{{ include.anchorBody | replace: '%heading%', escaped_header |
+                    default: '' }}</a>{% endcapture %}
+
+                <!-- In order to prevent adding extra space after a heading, we'll let the 'anchor' value contain it -->
+                {% if beforeHeading %}
+                {% capture anchor %}{{ anchor }} {% endcapture %}
+                {% else %}
+                {% capture anchor %} {{ anchor }}{% endcapture %}
+                {% endif %}
+                {% endif %}
+
+                {% capture new_heading %}
+                <h{{ _hAttrToStrip }} {{ include.bodyPrefix }} {% if beforeHeading %} {{ anchor }}{{ header }} {% else
+                    %} {{ header }}{{ anchor }} {% endif %} {{ include.bodySuffix }} </h{{ headerLevel }}>
+                    {% endcapture %}
+
+                    <!--
+    If we have content after the `</hX>` tag, then we'll want to append that here so we don't lost any content.
+    -->
+                    {% assign chunkCount = _workspace | size %}
+                    {% if chunkCount > 1 %}
+                    {% capture new_heading %}{{ new_heading }}{{ _workspace | last }}{% endcapture %}
+                    {% endif %}
+
+                    {% capture edited_headings %}{{ edited_headings }}{{ new_heading }}{% endcapture %}
+                    {% endfor %}
+                    {% endcapture %}{% assign headingsWorkspace = '' %}{{ edited_headings | strip }}
@@ -0,0 +1,225 @@
+<!DOCTYPE html>
+<html lang="{{ site.lang | default: " en-US" }}">
+
+<head>
+  <meta charset='utf-8'>
+  <meta name="viewport" content="width=device-width, initial-scale=1">
+  <meta http-equiv="X-UA-Compatible" content="IE=edge">
+  <link rel="stylesheet" href="{{ '/assets/css/style.css?v=' | append: site.github.build_revision | relative_url }}">
+  <link rel="stylesheet" type="text/css" href="{{ '/assets/css/print.css' | relative_url }}" media="print">
+  <script async src="https://www.googletagmanager.com/gtag/js?id=G-TMTZVLLMBP"></script>
+  <script>
+    window.dataLayer = window.dataLayer || [];
+    function gtag(){dataLayer.push(arguments);}
+    gtag('js', new Date());
+    gtag('config', 'G-TMTZVLLMBP');
+  </script>
+  <!--[if lt IE 9]>
+    <script src="//html5shiv.googlecode.com/svn/trunk/html5.js"></script>
+    <![endif]-->
+  <style>
+    #centeringDiv {
+      margin: auto;
+      max-width: 1200px;
+    }
+    #navDiv
+    {
+      display: block;
+      box-sizing: border-box;
+      padding-top: 5px;
+      padding-bottom: 5px;
+      border-bottom-width: 3px;
+      border-bottom-style: solid;
+      border-bottom-color: #F0F0F0;
+    }
+    #navDiv nav
+    {
+      float:left;
+    }
+    #navDiv::after {
+      content: "";
+      clear: both;
+      display: table;
+    }
+    #navDiv nav li::after
+    {
+      content: "/";
+      padding-left: 10px;
+      padding-right: 0px;
+      color: #808080;
+    }
+    #navDiv nav li
+    {
+      display:inline;
+      padding-left: 10px;
+      padding-right: 0px;
+    }
+    #tocColumn {
+      width: 350px;
+      position: fixed;
+      overflow-y: auto;
+      box-sizing: border-box;
+      display: block;
+    }
+
+    #tocInner {
+      padding: 20px;
+    }
+
+    #rightColumn {
+      padding-left: 390px;
+      padding-right: 40px;
+      padding-top: 20px;
+    }
+
+    .toc_root_list {
+      list-style-type: none;
+      list-style-position: outside;
+      background-color: initial;
+      padding-left: 0px;
+    }
+    .toc_list {
+        padding-left: 16px;
+        background-color: initial;
+        list-style-type: none;
+        margin-bottom: 0px;
+    }
+    .toc_item {
+        cursor: pointer;
+        user-select: none;
+        list-style-type: none;
+        padding-left: 0px;
+        padding-top: 5px;
+    }
+    .toc_item_expanded::before {
+        content: "\25be";
+        cursor: pointer;
+    }
+    .toc_item_collapsed::before {
+        content: "\25b8";
+        cursor: pointer;
+    }
+    .toc_item_leaf {
+        padding-left: 14px;
+        cursor: pointer;
+        list-style-type: none;
+    }
+    .toc_span:hover
+    {
+      color: #d5000d;
+    }
+    .tocIcon
+    {
+      vertical-align: -2.5px;
+    }
+    .editButton
+    {
+      float: right;
+      margin-right: 10px;
+      color:#808080;
+    }
+    .editIcon
+    {
+      fill: currentColor;
+      vertical-align: text-top;
+    }
+    #btnToggleTOC {
+      display: none;
+      width: fit-content;
+      margin-left: 10px;
+      margin-top: 10px;
+      padding: 10px;
+      border-style: solid;
+      border-color: #808080;
+      border-width: 1px;
+      background-color: #E8E8E8;
+    }
+    #btnToggleTOC:hover {
+      background-color: #F0F0E8;
+    }
+    #btnToggleTOC:active {
+      background-color: #D4D4D4;
+    }
+    @media screen and (max-width: 900px) {
+      #tocColumn {
+        width: 300px;
+        display: block;
+        box-sizing: border-box;
+      }
+      #rightColumn {
+        padding-left: 320px;
+        padding-right: 20px;
+      }
+    }
+
+    @media screen and (max-width: 700px) {
+      #tocColumn {
+        width: 100%;
+        position: static;
+        display: none;
+        border-right-style: none;
+        box-sizing: content-box;
+      }
+      #tocInner {
+        padding: 10px;
+      }
+      #rightColumn {
+        padding-left: 10px;
+        padding-right: 10px;
+      }
+      #centeringDiv {
+         padding-left: 0px;
+      }
+      #btnToggleTOC {
+        display: block;
+      }
+    }
+  </style>
+  {% seo %}
+</head>
+
+<body>
+  <div id="centeringDiv">
+    <div id="navDiv">
+    <a class="editButton" title="Edit this page" href="https://github.com/{{ site.github.repository_nwo }}/edit/master/docs/{{ page.path }}">
+      <svg class="editIcon" height="16" viewBox="0 0 16 16" version="1.1" width="16" aria-hidden="true">
+        <path fill-rule="evenodd"
+          d="M11.013 1.427a1.75 1.75 0 012.474 0l1.086 1.086a1.75 1.75 0 010 2.474l-8.61 8.61c-.21.21-.47.364-.756.445l-3.251.93a.75.75 0 01-.927-.928l.929-3.25a1.75 1.75 0 01.445-.758l8.61-8.61zm1.414 1.06a.25.25 0 00-.354 0L10.811 3.75l1.439 1.44 1.263-1.263a.25.25 0 000-.354l-1.086-1.086zM11.189 6.25L9.75 4.81l-6.286 6.287a.25.25 0 00-.064.108l-.558 1.953 1.953-.558a.249.249 0 00.108-.064l6.286-6.286z">
+        </path>
+      </svg>
+    </a>
+    </div>
+    <div id="rightColumn">
+        <section id="main_content">
+          {% include anchor_headings.html html=content anchorBody="" %}
+        </section>
+        <a href="javascript:;" id="_content_end_"></a>
+        <footer>
+          {% if site.github.is_project_page %}
+          {{ site.title | default: site.github.repository_name }} is maintained by <a
+            href="{{ site.github.owner_url }}">{{ site.github.owner_name }}</a><br>
+          {% endif %}
+          This page was generated by <a href="https://pages.github.com">GitHub Pages</a>.
+        </footer>
+      </div>
+    </div>
+  <script type="text/x-mathjax-config">
+    MathJax.Hub.Config({
+      tex2jax: {
+        inlineMath: [ ['$$','$$'], ["\\(","\\)"] ],
+        displayMath: [ ['$$','$$'], ["\\(","\\)"] ],
+      },
+      TeX: {
+        Macros: {
+          bra: ["\\langle{#1}|", 1],
+          ket: ["|{#1}\\rangle", 1],
+          braket: ["\\langle{#1}\\rangle", 1],
+          bk: ["\\langle{#1}|{#2}|{#3}\\rangle", 3]
+       }
+     }
+    });
+  </script>
+  <script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>
+</body>
+
+</html>
@@ -0,0 +1,417 @@
+<!DOCTYPE html>
+<html lang="{{ site.lang | default: " en-US" }}">
+
+<head>
+  <meta charset='utf-8'>
+  <meta name="viewport" content="width=device-width, initial-scale=1">
+  <meta http-equiv="X-UA-Compatible" content="IE=edge">
+  <link rel="stylesheet" href="{{ '/assets/css/style.css?v=' | append: site.github.build_revision | relative_url }}">
+  <link rel="stylesheet" type="text/css" href="{{ '/assets/css/print.css' | relative_url }}" media="print">
+  <script async src="https://www.googletagmanager.com/gtag/js?id=G-TMTZVLLMBP"></script>
+  <script>
+    window.dataLayer = window.dataLayer || [];
+    function gtag(){dataLayer.push(arguments);}
+    gtag('js', new Date());
+    gtag('config', 'G-TMTZVLLMBP');
+  </script>
+  <!--[if lt IE 9]>
+    <script src="//html5shiv.googlecode.com/svn/trunk/html5.js"></script>
+    <![endif]-->
+  <style>
+    #centeringDiv {
+      margin: auto;
+      max-width: 1200px;
+    }
+    #navDiv
+    {
+      display: block;
+      box-sizing: border-box;
+      padding-top: 5px;
+      padding-bottom: 5px;
+      border-bottom-width: 3px;
+      border-bottom-style: solid;
+      border-bottom-color: #F0F0F0;
+    }
+    #navDiv nav
+    {
+      float:left;
+    }
+    #navDiv::after {
+      content: "";
+      clear: both;
+      display: table;
+    }
+    #navDiv nav li::after
+    {
+      content: "/";
+      padding-left: 10px;
+      padding-right: 0px;
+      color: #808080;
+    }
+    #navDiv nav li
+    {
+      display:inline;
+      padding-left: 10px;
+      padding-right: 0px;
+    }
+    #tocColumn {
+      width: 350px;
+      position: fixed;
+      overflow-y: auto;
+      box-sizing: border-box;
+      display: block;
+    }
+
+    #tocInner {
+      padding: 20px;
+    }
+
+    #rightColumn {
+      padding-left: 390px;
+      padding-right: 40px;
+      padding-top: 20px;
+    }
+
+    .toc_root_list {
+      list-style-type: none;
+      list-style-position: outside;
+      background-color: initial;
+      padding-left: 0px;
+    }
+    .toc_list {
+        padding-left: 16px;
+        background-color: initial;
+        list-style-type: none;
+        margin-bottom: 0px;
+    }
+    .toc_item {
+        cursor: pointer;
+        user-select: none;
+        list-style-type: none;
+        padding-left: 0px;
+        padding-top: 5px;
+    }
+    .toc_item_expanded::before {
+        content: "\25be";
+        cursor: pointer;
+    }
+    .toc_item_collapsed::before {
+        content: "\25b8";
+        cursor: pointer;
+    }
+    .toc_item_leaf {
+        padding-left: 14px;
+        cursor: pointer;
+        list-style-type: none;
+    }
+    .toc_span:hover
+    {
+      color: #d5000d;
+    }
+    .tocIcon
+    {
+      vertical-align: -2.5px;
+    }
+    .editButton
+    {
+      float: right;
+      margin-right: 10px;
+      color:#808080;
+    }
+    .editIcon
+    {
+      fill: currentColor;
+      vertical-align: text-top;
+    }
+    #btnToggleTOC {
+      display: none;
+      width: fit-content;
+      margin-left: 10px;
+      margin-top: 10px;
+      padding: 10px;
+      border-style: solid;
+      border-color: #808080;
+      border-width: 1px;
+      background-color: #E8E8E8;
+    }
+    #btnToggleTOC:hover {
+      background-color: #F0F0E8;
+    }
+    #btnToggleTOC:active {
+      background-color: #D4D4D4;
+    }
+    @media screen and (max-width: 900px) {
+      #tocColumn {
+        width: 300px;
+        display: block;
+        box-sizing: border-box;
+      }
+      #rightColumn {
+        padding-left: 320px;
+        padding-right: 20px;
+      }
+    }
+
+    @media screen and (max-width: 700px) {
+      #tocColumn {
+        width: 100%;
+        position: static;
+        display: none;
+        border-right-style: none;
+        box-sizing: content-box;
+      }
+      #tocInner {
+        padding: 10px;
+      }
+      #rightColumn {
+        padding-left: 10px;
+        padding-right: 10px;
+      }
+      #centeringDiv {
+         padding-left: 0px;
+      }
+      #btnToggleTOC {
+        display: block;
+      }
+    }
+  </style>
+  {% seo %}
+</head>
+
+<body>
+  <div id="centeringDiv">
+    <div id="navDiv">
+    {% include_relative nav.html %}
+    <a class="editButton" title="Edit this page" href="https://github.com/{{ site.github.repository_nwo }}/edit/master/docs/{{ page.path }}">
+      <svg class="editIcon" height="16" viewBox="0 0 16 16" version="1.1" width="16" aria-hidden="true">
+        <path fill-rule="evenodd"
+          d="M11.013 1.427a1.75 1.75 0 012.474 0l1.086 1.086a1.75 1.75 0 010 2.474l-8.61 8.61c-.21.21-.47.364-.756.445l-3.251.93a.75.75 0 01-.927-.928l.929-3.25a1.75 1.75 0 01.445-.758l8.61-8.61zm1.414 1.06a.25.25 0 00-.354 0L10.811 3.75l1.439 1.44 1.263-1.263a.25.25 0 000-.354l-1.086-1.086zM11.189 6.25L9.75 4.81l-6.286 6.287a.25.25 0 00-.064.108l-.558 1.953 1.953-.558a.249.249 0 00.108-.064l6.286-6.286z">
+        </path>
+      </svg>
+    </a>
+    </div>
+    <button id="btnToggleTOC" onclick="toggleTOC()">
+      <svg height="16" class="tocIcon" viewBox="0 0 16 16" version="1.1" width="16" aria-hidden="true">
+        <path fill-rule="evenodd"
+          d="M2 4a1 1 0 100-2 1 1 0 000 2zm3.75-1.5a.75.75 0 000 1.5h8.5a.75.75 0 000-1.5h-8.5zm0 5a.75.75 0 000 1.5h8.5a.75.75 0 000-1.5h-8.5zm0 5a.75.75 0 000 1.5h8.5a.75.75 0 000-1.5h-8.5zM3 8a1 1 0 11-2 0 1 1 0 012 0zm-1 6a1 1 0 100-2 1 1 0 000 2z">
+        </path>
+      </svg>
+      Table of Contents</button>
+    <div id="tocColumn">
+      <div id="tocInner">
+        {% include_relative toc.html %}
+      </div>
+    </div>
+    <div id="rightColumn">
+        <section id="main_content">
+          {% include anchor_headings.html html=content anchorBody="" %}
+        </section>
+        <a href="javascript:;" id="_content_end_"></a>
+        <footer>
+          {% if site.github.is_project_page %}
+          {{ site.title | default: site.github.repository_name }} is maintained by <a
+            href="{{ site.github.owner_url }}">{{ site.github.owner_name }}</a><br>
+          {% endif %}
+          This page was generated by <a href="https://pages.github.com">GitHub Pages</a>.
+        </footer>
+      </div>
+    </div>
+  <script>
+    // Fix for IE. Make sure String has `startsWith` method.
+    if (!String.prototype.startsWith)
+    {
+      String.prototype.startsWith = function (searchString, position) {
+        position = position || 0;
+        return this.indexOf(searchString, position) === position;
+      };
+    }
+
+    var tocColumn = document.getElementById("tocColumn");
+    var rightColumn = document.getElementById("rightColumn");
+    function updateScroll()
+    {
+      if (window.innerWidth < 700)
+      {
+        tocColumn.style.height = "";
+        return;
+      }
+      var top = Math.max(0, rightColumn.getBoundingClientRect().top);
+      tocColumn.style.top = top + "px";
+      tocColumn.style.height = (window.innerHeight-top) + "px";
+    }
+    function updatePosition()
+    {
+      if (window.innerWidth > 700)
+        tocColumn.style.display = "";
+      tocColumn.style.left = rightColumn.getBoundingClientRect().left + "px";
+      updateScroll();
+    }
+    window.addEventListener("resize", updatePosition);
+    updatePosition();
+
+    var tocItemsArray = [];
+    var subSectionItems = [];
+    var selectedItem = null;
+    function toggleTOC() {
+      var tocColumn = document.getElementById("tocColumn");
+      if (tocColumn.style.display == "block")
+        tocColumn.style.display = "none";
+      else
+        tocColumn.style.display = "block";
+      event.stopPropagation();
+    }
+    function expandItem(e) {
+      if (e == selectedItem)
+        e.style["font-weight"] = "bold";
+      var childList = e.getElementsByClassName("toc_list");
+      if (childList.length == 0)
+        return;
+      childList[0].style.display = "block";
+      childList[0].style["font-weight"] = "normal";
+      e.setAttribute("class", "toc_item toc_item_expanded");
+    }
+    function collapseItem(e) {
+      var childList = e.getElementsByClassName("toc_list");
+      if (childList.length == 0)
+        return;
+      childList[0].style.display = "none";
+      e.setAttribute("class", "toc_item toc_item_collapsed");
+    }
+    function tocSpanOnClick(e)
+    {
+      if (event.srcElement != null && event.srcElement.parentElement != null)
+      {
+        var link = event.srcElement.parentElement.getAttribute("data-link");
+        if (link != null)
+        {
+          var poundIndex = link.indexOf("#");
+          if (poundIndex == -1)
+            window.location.href = link + ".html";
+          else
+            window.location.href = link.substr(0, poundIndex) + ".html#" + link.substr(poundIndex+1, link.length - poundIndex - 1);
+        }
+      }
+      event.stopPropagation();
+    }
+    function tocItemOnClick(e)
+    {
+      if (event.srcElement == null) return;
+      // Toggle expanded/collapsed state.
+      if (event.srcElement.getAttribute("class").endsWith("toc_item_collapsed"))
+        expandItem(event.srcElement);
+      else if (event.srcElement.getAttribute("class").endsWith("toc_item_expanded"))
+        collapseItem(event.srcElement);
+      event.stopPropagation();
+    }
+    var path = window.location.pathname;
+    var pageName = path.split("/").pop();
+    var currentPageID = pageName.substr(0, pageName.lastIndexOf("."));
+    if (currentPageID.length == 0)
+      currentPageID = "index";
+    var tocLists = document.getElementsByClassName("toc_root_list");
+    for (var i = 0; i < tocLists.length; i++) {
+      var tocList = tocLists[i];
+      var items = tocList.getElementsByTagName("li")
+      for (var j = 0; j < items.length; j++)
+        tocItemsArray.push(items[j]);
+    }
+    for (var i = 0; i < tocItemsArray.length; i++) {
+      var item = tocItemsArray[i];
+      if (item.getAttribute("data-link") == currentPageID)
+        selectedItem = item;
+      if (item.getElementsByTagName("li").length != 0) {
+        collapseItem(item);
+      }
+      else {
+        item.setAttribute("class", "toc_item toc_item_leaf");
+      }
+      item.addEventListener("click", tocItemOnClick);
+      var innerSpan = item.getElementsByTagName("span");
+      if (innerSpan.length != 0)
+      {
+        innerSpan[0].addEventListener("click", tocSpanOnClick);
+        innerSpan[0].setAttribute("class", "toc_span");
+      }
+    }
+    var curItem = selectedItem;
+    while (curItem != null) {
+      expandItem(curItem);
+      curItem = curItem.parentElement;
+      if (curItem != null && curItem.getAttribute("class") != null &&
+        curItem.getAttribute("class").startsWith("toc_list"))
+        curItem = curItem.parentElement;
+      if (curItem != null && curItem.getAttribute("class") != null &&
+        curItem.getAttribute("class").startsWith("toc_root_list"))
+        break;
+    }
+
+    var subItems = selectedItem.getElementsByTagName("li");
+    var subSectionTitles = [];
+    var subSectionTitleStrs = [];
+    for (var i = 0; i < subItems.length; i++)
+    {
+      subSectionItems.push(subItems[i]);
+      var title = subItems[i].getAttribute("data-link");
+      var pos = title.lastIndexOf("#");
+      title = title.substr(pos + 1);
+      var element = document.getElementById(title);
+      subSectionTitles.push(element);
+      subSectionTitleStrs.push(title);
+    }
+    subSectionTitles.push(document.getElementById("_content_end_"));
+    function isSectionFullyVisible(id)
+    {
+      var titleElement = subSectionTitles[id];
+      var nextTitleElement = subSectionTitles[id+1];
+      return (titleElement.getBoundingClientRect().top >= 0 && nextTitleElement.getBoundingClientRect().top <= window.innerHeight);
+    }
+    function findCurrentSubsection()
+    {
+      var currentSubsectionID = -1;
+      for (var i = 0; i < subSectionItems.length; i++) {
+        var titleElement = subSectionTitles[i];
+        if (titleElement == null)
+          continue;
+        if (titleElement.getBoundingClientRect().top < window.innerHeight * 0.12)
+          currentSubsectionID = i;
+      }
+      return currentSubsectionID;
+    }
+    function updateCurrentSubsection(currentSubsectionID)
+    {
+      for (var i = 0; i < subSectionItems.length; i++)
+      {
+        if (i == currentSubsectionID || isSectionFullyVisible(i))
+          subSectionItems[i].getElementsByTagName("span")[0].style["font-weight"] = 600;
+        else
+          subSectionItems[i].getElementsByTagName("span")[0].style["font-weight"] = 400;
+      }
+    }
+    function windowScroll(e)
+    {
+      updateCurrentSubsection(findCurrentSubsection());
+      updateScroll();
+    }
+    window.addEventListener("scroll", windowScroll);
+    updateCurrentSubsection(findCurrentSubsection());
+  </script>
+  <script type="text/x-mathjax-config">
+    MathJax.Hub.Config({
+      tex2jax: {
+        inlineMath: [ ['$$','$$'], ["\\(","\\)"] ],
+        displayMath: [ ['$$','$$'], ["\\(","\\)"] ],
+      },
+      TeX: {
+        Macros: {
+          bra: ["\\langle{#1}|", 1],
+          ket: ["|{#1}\\rangle", 1],
+          braket: ["\\langle{#1}\\rangle", 1],
+          bk: ["\\langle{#1}|{#2}|{#3}\\rangle", 3]
+       }
+     }
+    });
+  </script>
+  <script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>
+</body>
+
+</html>
@@ -0,0 +1,203 @@
+---
+---
+
+@import "{{ site.theme }}";
+a:hover {
+    text-decoration: underline;
+}
+h3 {
+    color: #363636;
+}
+h4 {
+    color: #363636;
+}
+blockquote {
+    background-color: #f2f2f2;
+    padding-top: 10px;
+    padding-bottom: 5px;
+}
+blockquote p {
+    font-size: 16px;
+    font-weight: 400;
+    margin-bottom: 5px;
+    color: #202020;
+}
+body {
+    color: initial;
+    text-shadow: none;
+    background: none;
+}
+#container 
+{
+    background:none;
+}
+
+
+
+.highlight .cm {
+  color: #148b04;
+}
+.highlight .cp {
+  color: #148b04;
+}
+.highlight .c1 {
+  color: #148b04;
+}
+.highlight .cs {
+  color: #148b04;
+}
+.highlight .c, .highlight .ch, .highlight .cd, .highlight .cpf {
+  color: #148b04;
+}
+.highlight .err {
+  color: #a61717;
+  background-color: #e3d2d2;
+}
+.highlight .gd {
+  color: #000000;
+  background-color: #ffdddd;
+}
+.highlight .ge {
+  color: #000000;
+  font-style: italic;
+}
+.highlight .gr {
+  color: #aa0000;
+}
+.highlight .gh {
+  color: #999999;
+}
+.highlight .gi {
+  color: #000000;
+  background-color: #ddffdd;
+}
+.highlight .go {
+  color: #888888;
+}
+.highlight .gp {
+  color: #555555;
+}
+.highlight .gu {
+  color: #aaaaaa;
+}
+.highlight .gt {
+  color: #aa0000;
+}
+.highlight .kc {
+  color: #1243d4;
+}
+.highlight .kd {
+  color: #1243d4;
+}
+.highlight .kn {
+  color: #1243d4;
+}
+.highlight .kp {
+  color: #1243d4;
+}
+.highlight .kr {
+  color: #1243d4;
+}
+.highlight .kt {
+  color: #1243d4;
+}
+.highlight .k, .highlight .kv {
+  color: #1243d4;
+}
+.highlight .m, .highlight .mb, .highlight .mx, .highlight .mi, .highlight .mf {
+  color: #7211c2;
+}
+.highlight .sa {
+  color: #000000;
+}
+.highlight .sb {
+  color: #d14;
+}
+.highlight .sc {
+  color: #d14;
+}
+.highlight .sd {
+  color: #d14;
+}
+.highlight .s2 {
+  color: #d14;
+}
+.highlight .se {
+  color: #d14;
+}
+.highlight .sh {
+  color: #d14;
+}
+.highlight .si {
+  color: #d14;
+}
+.highlight .sx {
+  color: #d14;
+}
+.highlight .sr {
+  color: #009926;
+}
+.highlight .s1 {
+  color: #d14;
+}
+.highlight .ss {
+  color: #990073;
+}
+.highlight .s, .highlight .dl {
+  color: #d14;
+}
+.highlight .na {
+  color: #008080;
+}
+.highlight .bp {
+  color: #999999;
+}
+.highlight .n{
+    color: black;
+}
+.highlight .nc {
+  color: #11abb9;
+}
+.highlight .nt {
+  color: #11abb9;
+}
+.highlight .vc {
+  color: #008080;
+}
+.highlight .vg {
+  color: #008080;
+}
+.highlight .vi {
+  color: #008080;
+}
+.highlight .nv, .highlight .vm {
+  color: #008080;
+}
+.highlight .ow {
+  color: #000000;
+}
+.highlight .o {
+  color: #000000;
+}
+.highlight .w {
+  color: #000000;
+}
+.highlight .p {color:#000000;}
+
+code
+{
+    background-color: initial;
+    border:none;
+}
+pre{
+  color: #000000;
+  background: #F8F8F8;
+}
+pre code {
+  color: #000000;
+  background-color: #F8F8F8;
+}
+.highlight
+{
+    background: #F8F8F8;
+}
@@ -0,0 +1,62 @@
+# This script uses `slangc` to generate the core module reference documentation and push the updated
+# documents to shader-slang/stdlib-reference repository.
+# The stdlib-reference repository has github-pages setup so that the markdown files we generate
+# in this step will be rendered as html pages by Jekyll upon a commit to the repository.
+# So we we need to do here is to pull the stdlib-reference repository, regenerate the markdown files
+# and push the changes back to the repository.
+
+# The generated markdown files will be located in three folders:
+# - ./global-decls
+# - ./interfaces
+# - ./types
+# In addition, slangc will generate a table of content file `toc.html` which will be copied to
+# ./_includes/stdlib-reference-toc.html for Jekyll for consume it correctly.
+
+# If stdlib-reference folder does not exist, clone from github repo
+if (-not (Test-Path ".\stdlib-reference")) {
+    git clone https://github.com/shader-slang/stdlib-reference/
+}
+else {
+# If it already exist, just pull the latest changes.
+    cd stdlib-reference
+    git pull
+    cd ../
+}
+# Remove the old generated files.
+Remove-Item -Path ".\stdlib-reference\global-decls" -Recurse -Force
+Remove-Item -Path ".\stdlib-reference\interfaces" -Recurse -Force
+Remove-Item -Path ".\stdlib-reference\types" -Recurse -Force
+Remove-Item -Path ".\stdlib-reference\attributes" -Recurse -Force
+
+# Use git describe to produce a version string and write it to _includes/version.inc.
+# This file will be included by the stdlib-reference Jekyll template.
+git describe --tags | Out-File -FilePath ".\stdlib-reference\_includes\version.inc" -Encoding ASCII
+
+cd stdlib-reference
+$slangPaths = @(
+    "../../build/RelWithDebInfo/bin/slangc.exe",
+    "../../build/Release/bin/slangc.exe",
+    "../../build/Debug/bin/slangc.exe"
+)
+$slangExe = $slangPaths | Where-Object { Test-Path $_ } | Select-Object -First 1
+if ($slangExe) {
+    & $slangExe -compile-core-module -doc
+    Move-Item -Path ".\toc.html" -Destination ".\_includes\stdlib-reference-toc.html" -Force
+    git config user.email "bot@shader-slang.com"
+    git config user.name "Stdlib Reference Bot"
+    git add .
+    git commit -m "Update the core module reference"
+    git push
+} else {
+    Write-Error "Could not find slangc executable in RelWithDebInfo or Release directories"
+}
+cd ../
+
+# For local debugging only.
+# Remove-Item -Path "D:\git_repo\stdlib-reference\global-decls" -Recurse -Force
+# Remove-Item -Path "D:\git_repo\stdlib-reference\interfaces" -Recurse -Force
+# Remove-Item -Path "D:\git_repo\stdlib-reference\types" -Recurse -Force
+# Copy-Item -Path .\stdlib-reference\global-decls -Destination D:\git_repo\stdlib-reference\global-decls -Recurse -Force
+# Copy-Item -Path .\stdlib-reference\interfaces -Destination D:\git_repo\stdlib-reference\interfaces -Recurse -Force
+# Copy-Item -Path .\stdlib-reference\types -Destination D:\git_repo\stdlib-reference\types -Recurse -Force
+# Copy-Item -Path .\stdlib-reference\_includes\stdlib-reference-toc.html -Destination D:\git_repo\stdlib-reference\_includes\stdlib-reference-toc.html -Force
@@ -0,0 +1,10 @@
+$job = Start-Job -ArgumentList $PSScriptRoot -ScriptBlock {
+    Set-Location $args[0]
+    $code = (Get-Content -Raw -Path "scripts/Program.cs").ToString()
+    $assemblies = ("System.Core", "System.IO", "System.Collections")
+    Add-Type -ReferencedAssemblies $assemblies -TypeDefinition $code -Language CSharp
+    $path = Join-Path -Path $args[0] -ChildPath "user-guide"
+    [toc.Builder]::Run($path);
+}
+Wait-Job $job
+Receive-Job -Job $job
@@ -0,0 +1,127 @@
+#!/usr/bin/env bash
+set -e
+
+script_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+project_root="$(dirname "$script_dir")"
+check_only=0
+
+show_help() {
+  me=$(basename "$0")
+  cat <<EOF
+$me: Build table of contents for documentation directories
+
+Usage: $me [--help] [--source <path>] [--check-only]
+
+Options:
+  --help           Show this help message
+  --source         Path to project root directory (defaults to parent of the script directory)
+  --check-only     Check if TOC needs updating, exit 1 if changes needed
+EOF
+}
+
+while [[ "$#" -gt 0 ]]; do
+  case $1 in
+  -h | --help)
+    show_help
+    exit 0
+    ;;
+  --source)
+    project_root="$2"
+    shift
+    ;;
+  --check-only)
+    check_only=1
+    ;;
+  *)
+    echo "unrecognized argument: $1" >&2
+    show_help >&2
+    exit 1
+    ;;
+  esac
+  shift
+done
+
+missing_bin=0
+
+require_bin() {
+  local name="$1"
+  if ! command -v "$name" &>/dev/null; then
+    echo "This script needs $name, but it isn't in \$PATH" >&2
+    missing_bin=1
+    return
+  fi
+}
+
+require_bin "mcs"
+require_bin "mono"
+
+if [ "$missing_bin" -eq 1 ]; then
+  exit 1
+fi
+
+temp_dir=$(mktemp -d)
+trap 'rm -rf "$temp_dir"' EXIT
+
+docs_dir="$project_root/docs"
+
+cat >"$temp_dir/temp_program.cs" <<EOL
+$(cat "$script_dir/scripts/Program.cs")
+
+namespace toc
+{
+    class Program
+    {
+        static int Main(string[] args)
+        {
+            if (args.Length < 1)
+            {
+                Console.WriteLine("Please provide a directory path");
+                return 1;
+            }
+
+            try
+            {
+                Builder.Run(args[0]);
+                return 0;
+            }
+            catch (Exception ex)
+            {
+                Console.WriteLine(\$"Error: {ex.Message}");
+                return 1;
+            }
+        }
+    }
+}
+EOL
+
+if ! mcs -r:System.Core "$temp_dir/temp_program.cs" -out:"$temp_dir/toc-builder.exe"; then
+  echo "Compilation of $script_dir/scripts/Program.cs failed" >&2
+  exit 1
+fi
+
+for dir in "user-guide"; do
+  if [ -d "$docs_dir/$dir" ]; then
+    if [ "$check_only" -eq 1 ]; then
+      # Ensure working directory is clean
+      if ! git -C "$project_root" diff --quiet "docs/$dir/toc.html" 2>/dev/null; then
+        echo "Working directory not clean, cannot check TOC" >&2
+        exit 1
+      fi
+    fi
+
+    if ! mono "$temp_dir/toc-builder.exe" "$docs_dir/$dir"; then
+      echo "TOC generation failed for $dir" >&2
+      exit 1
+    fi
+
+    if [ "$check_only" -eq 1 ]; then
+      if ! git -C "$project_root" diff --quiet "docs/$dir/toc.html" 2>/dev/null; then
+        git -C "$project_root" diff --color "docs/$dir/toc.html"
+        git -C "$project_root" checkout -- "docs/$dir/toc.html" 2>/dev/null
+        exit 1
+      fi
+    fi
+  else
+    echo "Directory $dir not found" >&2
+  fi
+done
@@ -0,0 +1,442 @@
+# Building Slang From Source
+
+### TLDR
+
+`cmake --workflow --preset release` to configure, build, and package a release
+version of Slang.
+
+## Prerequisites:
+
+Please install:
+
+- CMake (3.26 preferred, but 3.22 works[^1])
+- A C++ compiler with support for C++17. GCC, Clang and MSVC are supported
+- A CMake compatible backend, for example Visual Studio or Ninja
+- Python3 (a dependency for building spirv-tools)
+
+Optional dependencies for tests include
+
+- CUDA
+- OptiX
+- NVAPI
+- Aftermath
+- X11
+
+Other dependencies are sourced from submodules in the [./external](./external)
+directory.
+
+## Get the Source Code
+
+Clone [this](https://github.com/shader-slang/slang) repository. Make sure to
+fetch the submodules also.
+
+```bash
+git clone https://github.com/shader-slang/slang --recursive
+```
+
+You will need the git tags from this repository, otherwise versioning
+information (including the Slang modules directory name and the library
+filenames on macOS and Linux) will be incorrect. The above command should fetch
+them for you, but if you're fetching from a fork you may need to explicitly
+fetch the latest tags from the shader-slang repository with:
+
+```bash
+git fetch https://github.com/shader-slang/slang.git 'refs/tags/*:refs/tags/*'
+```
+
+## Configure and build
+
+> This section assumes cmake 3.25 or greater, if you're on a lower version
+> please see [building with an older cmake](#building-with-an-older-cmake)
+
+For a Ninja based build system (all platforms) run:
+```bash
+cmake --preset default
+cmake --build --preset releaseWithDebugInfo # or --preset debug, or --preset release
+```
+
+For Visual Studio run:
+```bash
+cmake --preset vs2022 # or 'vs2019' or 'vs2026'
+start devenv ./build/slang.sln # to optionally open the project in Visual Studio
+cmake --build --preset releaseWithDebugInfo # to build from the CLI, could also use --preset release or --preset debug
+```
+
+There are also `*-dev` variants like `vs2022-dev` and `vs2026-dev` which turn on features to aid
+debugging.
+
+### WebAssembly build
+
+In order to build WebAssembly build of Slang, Slang needs to be compiled with
+[Emscripten SDK](https://github.com/emscripten-core/emsdk). You can find more
+information about [Emscripten](https://emscripten.org/).
+
+You need to clone the EMSDK repo. And you need to install and activate the latest.
+
+
+```bash
+git clone https://github.com/emscripten-core/emsdk.git
+cd emsdk
+```
+
+For non-Windows platforms
+```bash
+./emsdk install latest
+./emsdk activate latest
+```
+
+For Windows
+```cmd
+emsdk.bat install latest
+emsdk.bat activate latest
+```
+
+After EMSDK is activated, Slang needs to be built in a cross compiling setup: 
+
+- build the `generators` target for the build platform
+- configure the build with `emcmake` for the host platform
+- build for the host platform.
+
+> Note: For more details on cross compiling please refer to the 
+> [cross-compiling](docs/building.md#cross-compiling) section.
+
+```bash
+# Build generators.
+cmake --workflow --preset generators --fresh
+mkdir generators
+cmake --install build --prefix generators --component generators
+
+# Configure the build with emcmake.
+# emcmake is available only when emsdk_env setup the environment correctly.
+pushd ../emsdk
+source ./emsdk_env # For Windows, emsdk_env.bat
+popd
+emcmake cmake -DSLANG_GENERATORS_PATH=generators/bin --preset emscripten -G "Ninja"
+
+# Build slang-wasm.js and slang-wasm.wasm in build.em/Release/bin
+cmake --build --preset emscripten --target slang-wasm
+```
+
+> Note: If the last build step fails, try running the command that `emcmake`
+> outputs, directly.
+
+## Installing
+
+Build targets may be installed using cmake:
+
+```bash
+cmake --build . --target install
+```
+
+This should install `SlangConfig.cmake` that should allow `find_package` to work.
+SlangConfig.cmake defines `SLANG_EXECUTABLE` variable that will point to `slangc`
+executable and also define `slang::slang` target to be linked to.
+
+For now, `slang::slang` is the only exported target defined in the config which can
+be linked to.
+
+Example usage
+
+```cmake
+find_package(slang REQUIRED PATHS ${your_cmake_install_prefix_path} NO_DEFAULT_PATH)
+# slang_FOUND should be automatically set
+target_link_libraries(yourLib PUBLIC
+  slang::slang
+)
+```
+
+## Testing
+
+```bash
+build/Debug/bin/slang-test
+```
+
+See the [documentation on testing](../tools/slang-test/README.md) for more information.
+
+## Debugging
+
+See the [documentation on debugging](/docs/debugging.md).
+
+## Distributing
+
+### Versioned Libraries
+
+As of v2025.21, the Slang libraries on **Mac** and **Linux** use versioned
+filenames. The public ABI for Slang libraries in general is not currently
+stable, so in accordance with semantic versioning conventions, the major
+version number for dynamically linkable libraries is currently 0. Due to the
+unstable ABI, releases are designed so that downstream users will be linked
+against the fully versioned library filenames (e.g.,
+`libslang-compiler.so.0.2025.21` instead of `libslang-compiler.so`).
+
+Slang libraries for **Windows** do not have an explicit version in the
+library filename, but the the same guidance about stability of the ABI applies.
+
+Downstream users of Slang distributing their products as binaries should
+therefor **on all platforms, including Windows** redistribute the Slang
+libraries they linked against, or otherwise communicate the specific version
+dependency to their users. It is *not the case* that a user of your product can
+just install any recent Slang release and have an installation of Slang that
+works for any given binary.
+
+## More niche topics
+
+### CMake options
+
+| Option                            | Default                    | Description                                                                                  |
+|-----------------------------------|----------------------------|----------------------------------------------------------------------------------------------|
+| `SLANG_VERSION`                   | Latest `v*` tag            | The project version, detected using git if available                                         |
+| `SLANG_EMBED_CORE_MODULE`         | `TRUE`                     | Build slang with an embedded version of the core module                                      |
+| `SLANG_EMBED_CORE_MODULE_SOURCE`  | `TRUE`                     | Embed the core module source in the binary                                                   |
+| `SLANG_ENABLE_DXIL`               | `TRUE`                     | Enable generating DXIL using DXC                                                             |
+| `SLANG_ENABLE_ASAN`               | `FALSE`                    | Enable ASAN (address sanitizer)                                                              |
+| `SLANG_ENABLE_COVERAGE`           | `FALSE`                    | Enable code coverage instrumentation                                                         |
+| `SLANG_ENABLE_FULL_IR_VALIDATION` | `FALSE`                    | Enable full IR validation (SLOW!)                                                            |
+| `SLANG_ENABLE_IR_BREAK_ALLOC`     | `FALSE`                    | Enable IR BreakAlloc functionality for debugging.                                            |
+| `SLANG_ENABLE_GFX`                | `TRUE`                     | Enable gfx targets                                                                           |
+| `SLANG_ENABLE_SLANGD`             | `TRUE`                     | Enable language server target                                                                |
+| `SLANG_ENABLE_SLANGC`             | `TRUE`                     | Enable standalone compiler target                                                            |
+| `SLANG_ENABLE_SLANGI`             | `TRUE`                     | Enable Slang interpreter target                                                              |
+| `SLANG_ENABLE_SLANGRT`            | `TRUE`                     | Enable runtime target                                                                        |
+| `SLANG_ENABLE_SLANG_GLSLANG`      | `TRUE`                     | Enable glslang dependency and slang-glslang wrapper target                                   |
+| `SLANG_ENABLE_TESTS`              | `TRUE`                     | Enable test targets, requires SLANG_ENABLE_GFX, SLANG_ENABLE_SLANGD and SLANG_ENABLE_SLANGRT |
+| `SLANG_ENABLE_EXAMPLES`           | `TRUE`                     | Enable example targets, requires SLANG_ENABLE_GFX                                            |
+| `SLANG_LIB_TYPE`                  | `SHARED`                   | How to build the slang library                                                               |
+| `SLANG_ENABLE_RELEASE_DEBUG_INFO` | `TRUE`                     | Enable generating debug info for Release configs                                             |
+| `SLANG_ENABLE_RELEASE_LTO`        | `FALSE`                    | Enable LTO for Release builds                                                                |
+| `SLANG_ENABLE_SPLIT_DEBUG_INFO`   | `TRUE`                     | Enable generating split debug info for Debug and RelWithDebInfo configs                      |
+| `SLANG_SLANG_LLVM_FLAVOR`         | `FETCH_BINARY_IF_POSSIBLE` | How to set up llvm support                                                                   |
+| `SLANG_SLANG_LLVM_BINARY_URL`     | System dependent           | URL specifying the location of the slang-llvm prebuilt library                               |
+| `SLANG_GENERATORS_PATH`           | ``                         | Path to an installed `all-generators` target for cross compilation                           |
+
+The following options relate to optional dependencies for additional backends
+and running additional tests. Left unchanged they are auto detected, however
+they can be set to `OFF` to prevent their usage, or set to `ON` to make it an
+error if they can't be found.
+
+| Option                   | CMake hints                    | Notes                                                                                        |
+|--------------------------|--------------------------------|----------------------------------------------------------------------------------------------|
+| `SLANG_ENABLE_CUDA`      | `CUDAToolkit_ROOT` `CUDA_PATH` | Enable running tests with the CUDA backend, doesn't affect the targets Slang itself supports |
+| `SLANG_ENABLE_OPTIX`     | `Optix_ROOT_DIR`               | Requires CUDA                                                                                |
+| `SLANG_ENABLE_NVAPI`     | `NVAPI_ROOT_DIR`               | Only available for builds targeting Windows                                                  |
+| `SLANG_ENABLE_AFTERMATH` | `Aftermath_ROOT_DIR`           | Enable Aftermath in GFX, and add aftermath crash example to project                          |
+| `SLANG_ENABLE_XLIB`      |                                |                                                                                              |
+
+### Advanced options
+
+| Option                             | Default | Description                                                                                                                    |
+|------------------------------------|---------|--------------------------------------------------------------------------------------------------------------------------------|
+| `SLANG_ENABLE_DX_ON_VK`            | `FALSE` | Enable running the DX11 and DX12 tests on non-warning Windows platforms via vkd3d-proton, requires system-provided d3d headers |
+| `SLANG_ENABLE_SLANG_RHI`           | `TRUE`  | Enable building and using [slang-rhi](https://github.com/shader-slang/slang-rhi) for tests                                     |
+| `SLANG_USE_SYSTEM_MINIZ`           | `FALSE` | Build using system Miniz library instead of the bundled version in [./external](./external)                                    |
+| `SLANG_USE_SYSTEM_LZ4`             | `FALSE` | Build using system LZ4 library instead of the bundled version in [./external](./external)                                      |
+| `SLANG_USE_SYSTEM_VULKAN_HEADERS`  | `FALSE` | Build using system Vulkan headers instead of the bundled version in [./external](./external)                                   |
+| `SLANG_USE_SYSTEM_SPIRV_HEADERS`   | `FALSE` | Build using system SPIR-V headers instead of the bundled version in [./external](./external)                                   |
+| `SLANG_USE_SYSTEM_UNORDERED_DENSE` | `FALSE` | Build using system unordered dense instead of the bundled version in [./external](./external)                                  |
+| `SLANG_SPIRV_HEADERS_INCLUDE_DIR`  | ``      | Use this specific path to SPIR-V headers instead of the bundled version in [./external](./external)                            |
+
+### LLVM Support
+
+There are several options for getting llvm-support:
+
+- Use a prebuilt binary slang-llvm library:
+  `-DSLANG_SLANG_LLVM_FLAVOR=FETCH_BINARY` or `-DSLANG_SLANG_LLVM_FLAVOR=FETCH_BINARY_IF_POSSIBLE` (this is the default)
+    - You can set `SLANG_SLANG_LLVM_BINARY_URL` to point to a local
+      `libslang-llvm.so/slang-llvm.dll` or set it to a URL of an zip/archive
+      containing such a file
+    - If this isn't set then the build system tries to download it from the
+      release on github matching the current tag. If such a tag doesn't exist
+      or doesn't have the correct os\*arch combination then the latest release
+      will be tried.
+    - If `SLANG_SLANG_LLVM_BINARY_URL` is `FETCH_BINARY_IF_POSSIBLE` then in
+      the case that a prebuilt binary can't be found then the build will proceed
+      as though `DISABLE` was chosen
+- Use a system supplied LLVM: `-DSLANG_SLANG_LLVM_FLAVOR=USE_SYSTEM_LLVM`, you
+  must have llvm-21.1 and a matching libclang installed. It's important that
+  either:
+    - You don't end up linking to a dynamic libllvm.so, this will almost
+      certainly cause multiple versions of LLVM to be loaded at runtime,
+      leading to errors like `opt: CommandLine Error: Option
+      'asm-macro-max-nesting-depth' registered more than once!`. Avoid this by
+      compiling LLVM without the dynamic library.
+    - Anything else which may be linked in (for example Mesa, also dynamically
+      loads the same llvm object)
+- Do not enable LLVM support: `-DSLANG_SLANG_LLVM_FLAVOR=DISABLE`
+
+To build only a standalone slang-llvm, you can run:
+
+```bash
+cmake --workflow --preset slang-llvm
+```
+
+This will generate `build/dist-release/slang-slang-llvm.zip` containing the
+library. This, of course, uses the system LLVM to build slang-llvm, otherwise
+it would just be a convoluted way to download a prebuilt binary.
+
+### Cross compiling
+
+Slang generates some code at build time, using generators build from this
+codebase. Due to this, for cross compilation one must already have built these
+generators for the build platform. Build them with the `generators` preset, and
+pass the install path to the cross building CMake invocation using
+`SLANG_GENERATORS_PATH`
+
+Non-Windows platforms:
+
+```bash
+# build the generators
+cmake --workflow --preset generators --fresh
+mkdir build-platform-generators
+cmake --install build --config Release --prefix build-platform-generators --component generators
+# reconfigure, pointing to these generators
+# Here is also where you should set up any cross compiling environment
+cmake \
+  --preset default \
+  --fresh \
+  -DSLANG_GENERATORS_PATH=build-platform-generators/bin \
+  -Dwhatever-other-necessary-options-for-your-cross-build \
+  # for example \
+  -DCMAKE_C_COMPILER=my-arch-gcc \
+  -DCMAKE_CXX_COMPILER=my-arch-g++
+# perform the final build
+cmake --workflow --preset release
+```
+
+Windows
+
+```bash
+# build the generators
+cmake --workflow --preset generators --fresh
+mkdir build-platform-generators
+cmake --install build --config Release --prefix build-platform-generators --component generators
+# reconfigure, pointing to these generators
+# Here is also where you should set up any cross compiling environment
+# For example
+./vcvarsamd64_arm64.bat
+cmake \
+  --preset default \
+  --fresh \
+  -DSLANG_GENERATORS_PATH=build-platform-generators/bin \
+  -Dwhatever-other-necessary-options-for-your-cross-build
+# perform the final build
+cmake --workflow --preset release
+```
+
+### Example cross compiling with MSVC to windows-aarch64
+
+One option is to build using the ninja generator, which requires providing the
+native and cross environments via `vcvarsall.bat`
+
+```bash
+vcvarsall.bat
+cmake --workflow --preset generators --fresh
+mkdir generators
+cmake --install build --prefix generators --component generators
+vsvarsall.bat x64_arm64
+cmake --preset default --fresh -DSLANG_GENERATORS_PATH=generators/bin
+cmake --workflow --preset release
+```
+
+Another option is to build using the Visual Studio generator which can find
+this automatically
+
+```
+cmake --preset vs2022 # or --preset vs2019, vs2026
+cmake --build --preset generators # to build from the CLI
+cmake --install build --prefix generators --component generators
+rm -rf build # The Visual Studio generator will complain if this is left over from a previous build
+cmake --preset vs2022 --fresh -A arm64 -DSLANG_GENERATORS_PATH=generators/bin
+cmake --build --preset release
+```
+
+### Nix
+
+This repository contains a [Nix](https://nixos.org/)
+[flake](https://wiki.nixos.org/wiki/Flakes) (not officially supported or
+tested), which provides the necessary prerequisites for local development. Also,
+if you use [direnv](https://direnv.net/), you can run the following commands to
+have the Nix environment automatically activate when you enter your clone of
+this repository:
+
+```bash
+echo 'use flake' > .envrc
+direnv allow
+```
+
+## Building with an older CMake
+
+Because older CMake versions don't support all the features we want to use in
+CMakePresets, you'll have to do without the presets. Something like the following
+
+```bash
+cmake -B build -G Ninja
+cmake --build build -j
+```
+
+## Specific supported compiler versions
+
+<!---
+Please keep the exact formatting '_Foo_ xx.yy is tested in CI' as there is a
+script which checks that this is still up to date.
+-->
+
+_GCC_ 11.4 and 13.3 are tested in CI and is the recommended minimum version. GCC 10 is
+supported on a best-effort basis, i.e. PRs supporting this version are
+encouraged but it isn't a continuously maintained setup.
+
+_MSVC_ 19 is tested in CI and is the recommended minimum version.
+
+_Clang_ 17.0 is tested in CI and is the recommended minimum version.
+
+## Static linking against libslang-compiler
+
+To build statically, set the `SLANG_LIB_TYPE` flag in CMake to `STATIC`.
+
+If linking against a static `libslang-compiler.a` you will need to link against some
+dependencies also if you're not already incorporating them into your project.
+
+```
+${SLANG_DIR}/build/Release/lib/libslang-compiler.a
+${SLANG_DIR}/build/Release/lib/libcompiler-core.a
+${SLANG_DIR}/build/Release/lib/libcore.a
+${SLANG_DIR}/build/external/miniz/libminiz.a
+${SLANG_DIR}/build/external/lz4/build/cmake/liblz4.a
+```
+
+## Deprecation of libslang and slang.dll filenames
+
+In Slang v2025.21, the primary library for Slang was renamed, from
+`libslang.so` and `slang.dll` to `libslang-compiler.so` and
+`slang-compiler.dll`. (A similar change was made for macOS.) The reason behind
+this change was to address a conflict on the Linux target, where the S-Lang
+library of the same name is commonly preinstalled on Linux distributions. The
+same issue affected macOS, to a lesser extent, where the S-Lang library could
+be installed via `brew`. To make the Slang library name predictable and
+simplify downstream build logic, the Slang library name was changed on all
+platforms.
+
+A change like this requires a period of transition, so on a **temporary**
+basis: Linux and macOS packages now include symlinks from the old filename to
+the new one. For Windows, a proxy library is provided with the old name, that
+redirects all functions to the new `slang-compiler.dll`. The rationale here is
+that applications with a complex dependency graph may have some components
+still temporarily using `slang.dll`, while others have been updated to use
+`slang-compiler.dll`. Using a proxy library for `slang.dll` ensures that all
+components are using the same library, and avoids any potential state or
+heap-related issues from an executable sharing data structures between the two
+libraries.
+
+These backwards compatability affordances, namely the proxy `slang.dll` and
+`slang.lib` (for Windows) and the `libslang.so` and `libslang.dylib` symlinks
+(for Linux and macOS), **will be removed at the end of 2026**. Until that time,
+they will be present in the github release packages for downstream use.
+Downstream packaging may or may not choose to distribute them, at their
+discretion. **We strongly encourage downstream users of Slang to move to the
+new library names as soon as they are able.**
+
+## Notes
+
+[^1] below 3.25, CMake lacks the ability to mark directories as being
+system directories (https://cmake.org/cmake/help/latest/prop_tgt/SYSTEM.html#prop_tgt:SYSTEM),
+this leads to an inability to suppress warnings originating in the
+dependencies in `./external`, so be prepared for some additional warnings.
@@ -0,0 +1,36 @@
+# Our CI
+
+There are github actions for building and testing slang.
+
+## Tests
+
+Most configurations run a restricted set of tests, however on some self hosted
+runners we run the full test suite, as well as running Falcor's test suite with
+the new slang build.
+
+## Building LLVM
+
+We require a static build of LLVM for building slang-llvm, we build and cache
+this in all workflow runs. Since this changes infrequently, the cache is almost
+always hit. A cold build takes about an hour on the slowest platform. The
+cached output is a few hundred MB, so conceivably if we add many more platforms
+we might be caching more than the 10GB github allowance, which would
+necessitate being a bit more complicated in building and tracking outputs here.
+
+For slang-llvm, this is handled the same as any other dependency, except on
+Windows Debug builds, where we are required by the differences in Debug/Release
+standard libraries to always make a release build, this is noted in the ci
+action yaml file.
+
+Note that we don't use sccache while building LLVM, as it changes very
+infrequently. The caching of LLVM is done by caching the final build product
+only.
+
+## sccache
+
+> Due to reliability issues, we are not currently using sccache, this is
+> historical/aspirational.
+
+The CI actions use sccache, keyed on compiler and platform, this runs on all
+configurations and significantly speeds up small source change builds. This
+cache can be safely missed without a large impact on build times.
@@ -0,0 +1,650 @@
+Slang CPU Target Support
+========================
+
+Slang has preliminary support for producing CPU source and binaries. 
+
+# Features
+
+* Can compile C/C++/Slang source to binaries (executables, shared libraries or [directly executable](#host-callable))
+* Does *not* require a C/C++ compiler to be installed if [slang-llvm](#slang-llvm) is available (as distributed with slang binary distributions)
+* Can compile Slang source into C++ source code
+* Supports compute style shaders 
+
+# Limitations
+
+These limitations apply to Slang transpiling to C++. 
+
+* Barriers are not supported (making these work would require an ABI change)
+* Atomics are not currently supported
+* Limited support for [out of bounds](#out-of-bounds) accesses handling
+* Entry point/s cannot be named `main` (this is because downstream C++ compiler/s expecting a regular `main`)
+* `float16_t` type is not currently supported
+
+For current C++ source output, the compiler needs to support partial specialization. 
+
+# How it works
+
+The initial version works by using a 'downstream' C/C++ compiler. A C++ compiler does *not* in general need to be installed on a system to compile and execute code as long as [slang-llvm](#slang-llvm) is available. A [regular C/C++](#regular-cpp) compiler can also be used, allowing access to tooling, such as profiling and debuggers, as well as being able to use regular host development features such as linking, libraries, shared libraries/dlls and executables. 
+
+The C/C++ backend can be directly accessed much like 'dxc', 'fxc' of 'glslang' can, using the pass-through mechanism with the following new backends... 
+
+```
+SLANG_PASS_THROUGH_CLANG,                   ///< Clang C/C++ compiler 
+SLANG_PASS_THROUGH_VISUAL_STUDIO,           ///< Visual studio C/C++ compiler
+SLANG_PASS_THROUGH_GCC,                     ///< GCC C/C++ compiler
+SLANG_PASS_THROUGH_LLVM,                    ///< slang-llvm 'compiler' - includes LLVM and Clang
+SLANG_PASS_THROUGH_GENERIC_C_CPP,           ///< Generic C or C++ compiler, which is decided by the source type
+```
+
+Sometimes it is not important which C/C++ compiler is used, and this can be specified via the 'Generic C/C++' option. This will aim to use the compiler that is most likely binary compatible with the compiler that was used to build the Slang binary being used. 
+
+To make it possible for Slang to produce CPU code, in this first iteration we convert Slang code into C/C++ which can subsequently be compiled into CPU code. If source is desired instead of a binary this can be specified via the SlangCompileTarget. These can be specified on the `slangc` command line as `-target cpp`.
+
+When using the 'pass through' mode for a CPU based target it is currently necessary to set an entry point, even though it's basically ignored. 
+
+In the API the `SlangCompileTarget`s are 
+
+```
+SLANG_C_SOURCE             ///< The C language
+SLANG_CPP_SOURCE           ///< The C++ language
+SLANG_CPP_HEADER           ///< The C++ language (header)
+SLANG_HOST_CPP_SOURCE,     ///< C++ code for `host` style 
+```        
+   
+Using the `-target` command line option
+
+* `C_SOURCE`: c
+* `CPP_SOURCE`: cpp,c++,cxx
+* `CPP_HEADER`: hpp
+* `HOST_CPP_SOURCE`: host-cpp,host-c++,host-cxx
+
+Note! Output of C source is not currently supported.
+
+If a CPU binary is required this can be specified as a `SlangCompileTarget` of 
+   
+```   
+SLANG_EXECUTABLE                ///< Executable (for hosting CPU/OS)
+SLANG_SHADER_SHARED_LIBRARY     ///< A shared library/Dll (for hosting CPU/OS)
+SLANG_SHADER_HOST_CALLABLE      ///< A CPU target that makes `compute kernel` compiled code available to be run immediately 
+SLANG_HOST_HOST_CALLABLE        ///< A CPU target that makes `scalar` compiled code available to be run immediately
+SLANG_OBJECT_CODE,              ///< Object code that can be used for later linking
+```
+
+Using the `-target` command line option
+
+* `EXECUTABLE`: exe, executable
+* `SHADER_SHARED_LIBRARY`: sharedlib, sharedlibrary, dll
+* `SHADER_HOST_CALLABLE`: callable, host-callable
+* `OBJECT_CODE`: object-conde
+* `HOST_HOST_CALLABLE`: host-host-callable
+    
+Using `host-callable` types from the the command line, other than to test such code compile and can be loaded for host execution.
+
+For launching a [shader like](#compile-style) Slang code on the CPU, there typically needs to be binding of values passed the entry point function. How this works is described in the [ABI section](#abi). Functions *can* be executed directly but care must be taken to [export](#visibility) them and such that there isn't an issue with [context threading](#context-threading).
+
+If a binary target is requested, the binary contents can be returned in a ISlangBlob just like for other targets. When using a [regular C/C++ compiler](#regular-cpp) the CPU binary typically must be saved as a file and then potentially marked for execution by the OS. It may be possible to load shared libraries or dlls from memory - but doing so is a non standard feature, that requires unusual work arounds. If possible it is typically fastest and easiest to use [slang-llvm](#slang-llvm) to directly execute slang or C/C++ code.
+
+## <a id="compile-style"/>Compilation Styles
+
+There are currently two styles of *compilation style* supported - `host` and `shader`. 
+
+The `shader` style implies 
+
+* The code *can* be executed in a GPU-kernel like execution model, launched across multiple threads (as described in the [ABI](#abi)) 
+* Currently no reference counting
+* Only functionality from the Slang core module, built in HLSL or anything supplied by a [COM interfaces](#com-interface) is available
+* Currently [slang-llvm](#slang-llvm) only supports the `shader` style
+
+The `host` style implies 
+
+* Execution style is akin to more regular CPU scalar code
+* Typically requires linking with `slang-rt` and use of `slang-rt` types such as `Slang::String` 
+* Allows use of `new` 
+* Allows the use of `class` for reference counted types
+* COM interfaces are reference counted
+
+The styles as used with [host-callable](#host-callable) are indicated via the API by 
+
+```
+SLANG_SHADER_HOST_CALLABLE  ///< A CPU target that makes `compute kernel` compiled code available to be run immediately 
+SLANG_HOST_HOST_CALLABLE    ///< A CPU target that makes `scalar` compiled code available to be run immediately
+```
+
+Or via the `-target` command line options
+
+* For 'shader' `callable` `host-callable`
+* For 'host' `host-host-callable`
+
+For an example of the `host` style please look at "examples/cpu-hello-world".
+
+## <a id="host-callable"/>Host callable
+
+Slang supports `host-callable` compilation targets which allow for the direct execution of the compiled code on the CPU. Currently this style of execution is supported if [slang-llvm](#slang-llvm) or a [regular C/C++ compiler](#regular-cpp) are available.
+
+There are currently two [compilation styles](#compile-style) supported. 
+
+In order to call into `host-callable` code after compilation it's necessary to access the result via the `ISlangSharedLibrary` interface. 
+
+Please look at the [ABI](#abi) section for more specifics around ABI usage especially for `shader` [compile styles](#compile-style).
+
+```C++
+    slang::ICompileRequest* request = ...;
+
+    const SlangResult compileRes = request->compile();
+
+    // Even if there were no errors that forced compilation to fail, the
+    // compiler may have produced "diagnostic" output such as warnings.
+    // We will go ahead and print that output here.
+    //
+    if(auto diagnostics = request->getDiagnosticOutput())
+    {
+        printf("%s", diagnostics);
+    }
+
+    // Get the 'shared library' (note that this doesn't necessarily have to be implemented as a shared library
+    // it's just an interface to executable code).
+    ComPtr<ISlangSharedLibrary> sharedLibrary;
+    SLANG_RETURN_ON_FAIL(request->getTargetHostCallable(0, sharedLibrary.writeRef()));
+
+    // We can now find exported functions/variables via findSymbolAddressByName
+
+    // For a __global public __extern_cpp int myGlobal;
+    {    
+        auto myGlobalPtr = (int*)sharedLibrary->findSymbolAddressByName("myGlobal");
+        if (myGlobalPtr)
+        {
+            *myGlobalPtr = 10;
+        }
+    }
+    
+    // To get a function 
+    // 
+    // public __extern_cpp int add(int a, int b);
+    
+    // Test a free function
+    {
+        typedef int (*AddFunc)(int a, int b); 
+        auto func = (AddFunc)sharedLibrary->findFuncByName("add");
+
+        if (func)
+        {
+            // Let's add!
+            int c = func(10, 20):
+        }
+    }
+```
+
+## <a id="slang-llvm"/>slang-llvm
+
+`slang-llvm` is a special Slang version of [LLVM](https://llvm.org/). It's current main purpose is to allow compiling C/C++ such that it is [directly available](#host-callable) for execution using the LLVM JIT feature. If `slang-llvm` is available it is the default downstream compiler for [host-callable](#host-callable). This is because it allows for faster compilation, avoids the file system, and can execute the compiled code directly. [Regular C/C++ compilers](#regular-cpp) can be used for [host-callable](#host-callable) but requires writing source files to the file system and creating/loading shared-libraries/dlls to make the feature work.  Additionally using `slang-llvm` avoids the need for a C/C++ compiler installed on a target system.   
+ 
+`slang-llvm` contains the Clang C++ compiler, so it is possible to also compile and execute C/C++ code in the [host-callable](#host-callable) style. 
+
+Limitations of using `slang-llvm`
+ 
+* Can only currently be used for [shader style](#compile-style) 
+  * Cannot produce object files, libraries, OS executables or binaries 
+* Is *limited* because it is not possible to directly access libraries such as the C or C++ standard libraries (see [COM interface](#com-interface) for a work-around)
+* It's not possible to source debug into `slang-llvm` compiled code running on the JIT (see [debugging](#debugging) for a work-around)
+* Not currently possible to return as a ISlangBlob representation 
+
+You can detect if `slang-llvm` is available via
+
+```C++
+    slang::IGlobalSession* slangSession = ...;
+    const bool hasSlangLlvm = SLANG_SUCCEEDED(slangSession->checkPassThroughSupport(SLANG_PASS_THROUGH_LLVM)); 
+```
+
+## <a id="regular-cpp"/>Regular C/C++ compilers
+
+Slang can work with regular C/C++ 'downstream' compilers. It has been tested to work with Visual Studio, Clang and G++/Gcc on Windows and Linux.
+
+Under the covers when Slang is used to generate a binary via a C/C++ compiler, it must do so through the file system. Currently this means the source (say generated by Slang) and the binary (produced by the C/C++ compiler) must all be files. To make this work Slang uses temporary files. The reasoning for hiding this mechanism, other than simplicity, is that it allows using with [slang-llvm](#slang-llvm) without any changes. 
+
+## <a id="visibility"/>Visibility
+
+In a typical Slang [shader like](#compile-style) scenario, functionality is exposed via entry points. It can be convenient and desirable to be able to call Slang functions directly from application code, and not just via entry points. By default non entry point functions are *removed* if they are not reachable by the specified entry point. Additionally for non entry point functions Slang typically generates function names that differ from the original name. 
+
+To work around these two issues the `public` and `__extern_cpp` modifiers can be used.
+
+`public` makes the variable or function visible outside of the module even if it isn't used within the module. For the function to work it will also keep around any function or variable it accesses. 
+
+Note! Some care is needed here around [context threading](#context-threading) - if a function or any function a function accesses requires state held in the context, the signature of the function will be altered to include the context as the first parameter.
+
+Making a function or variable `public` does not mean that the name remains the same. To indicate that the name should not be altered use the `__extern_cpp` modifier. For example
+
+```
+// myGlobal will be visible to the application (note the __global modifier additionally means it has C++ global behavior)
+__global public __extern_cpp int myGlobal;
+
+// myFunc is available to the application
+public __extern_cpp myFunc(int a) 
+{
+    return a * a;
+}
+```
+
+## <a id="com-interface"/>COM interface support
+
+Slang has preliminary support for [Component Object Model (COM)](https://en.wikipedia.org/wiki/Component_Object_Model) interfaces in CPU code. 
+
+```
+[COM]
+interface IDoThings
+{
+    int doThing(int a, int b);
+    int calcHash(NativeString in);
+    void printMessage(NativeString nativeString);
+}
+```
+
+This support provides a way for an application to provide access to functionality in the application runtime - essentially it allows Slang code to call into application code. To do this a COM interface can be created that exposes the desired functionality. The interface/s can be made available through any of the normal mechanisms - such as through a constant buffer variable. Additionally [`__global`](#actual-global) provides a way to make functions available to Slang code without the need for [context threading](#context-threading).
+
+The example "examples/cpu-com-example" shows this at work.
+
+## <a id="actual-global"/>Global support
+
+The Slang language is based on the HLSL language. This heritage means that globals have slightly different meaning to typical C/C++ usage. 
+
+```
+int myGlobal;                           ///< A constant value stored in a constant buffer
+static int staticMyGlobal;              ///< A global that cannot be seen by the application 
+static const int staticConstMyGlobal;   ///< A fixed value
+``` 
+
+The variable `myGlobal` will be a member of a constant buffer, meaning it's value can only change via bindings and not during execution. For some uses having `myGlobal` in the constant buffer might be appropriate, for example
+
+* It's use is reached from a [shader style](#compile-style) entry point 
+* It's value is constant across the launch 
+
+In Slang a variable can be declared as global in the C/C++ sense via the `__global` modifier. For example
+
+```
+__global int myGlobal;
+```
+
+Doing so means
+
+* `myGlobal` will not be defined in the constant buffer
+* It can be used in functions that do not have access to the [constant buffer](#context-threading)
+* It can be modified in the kernel 
+* Can only be used on CPU targets (currently `__global` is not supported on the GPU targets)
+
+One disadvantage of using `__global` is in multi-threaded environments, with multiple launches on multiple CPU threads, there is only one global and will likely cause problems unless the global value is the same across all threads.
+
+It may be useful to set a global directly via host code, without having to write a function to enable the access. This is possible by using [`public`](#visibility) and [`__extern_cpp`](#visibility) modifiers. For example 
+
+```
+__global public __extern_cpp int myGlobal;
+```
+
+The global can now be set from host code via
+
+```C++
+    slang::ICompileRequest = ...;
+
+    // Get the 'shared library' (note that this doesn't necessarily have to be implemented as a shared library
+    // it's just an interface to executable code).
+    ComPtr<ISlangSharedLibrary> sharedLibrary;
+    SLANG_RETURN_ON_FAIL(request->getTargetHostCallable(0, sharedLibrary.writeRef()));
+
+    // Set myGlobal to 20
+    {
+        auto myGlobalPtr = (int*)sharedLibrary->findSymbolAddressByName("myGlobal");
+        *myGlobalPtr = 20;
+    }
+```
+
+In terms of reflection `__global` variables are not visible.
+
+## NativeString
+
+Slang supports a rich 'String' type when using the [host style](#compile-style), which for C++ targets is implemented as the `Slang::String` C++ type. The type is only available on CPU targets that support `slang-rt`. 
+
+Some limited String-like support is available via `NativeString` type which for C/C++ CPU targets is equivalent to `const char*`. For GPU targets this will use the same hash mechanism as normally available. 
+
+`NativeString` is supported by all [shader compilation styles](#compile-style) including [slang-llvm](#slang-llvm).
+
+TODO(JS): What happens with String with shader compile style on CPU? Shouldn't it be the same as GPU (and reflected as such in reflection)?
+
+## Debugging
+
+It is currently not possible to step into LLVM-JIT code when using [slang-llvm](#slang-llvm). Fortunately it is possible to step into code compiled via a [regular C/C++ compiler](#regular-cpp). 
+
+Below is a code snippet showing how to switch to a [regular C/C++ compiler](#regular-cpp) at runtime.
+
+```C++
+    SlangPassThrough findRegularCppCompiler(slang::IGlobalSession* slangSession)
+    {
+        // Current list of 'regular' C/C++ compilers
+        const SlangPassThrough cppCompilers[] = 
+        {
+            SLANG_PASS_THROUGH_VISUAL_STUDIO,
+            SLANG_PASS_THROUGH_GCC,
+            SLANG_PASS_THROUGH_CLANG,
+        };
+        // Do we have a C++ compiler
+        for (const auto compiler : cppCompilers)
+        {
+            if (SLANG_SUCCEEDED(slangSession->checkPassThroughSupport(compiler)))
+            {
+                return compile;
+            }
+        }
+        return SLANG_PASS_THROUGH_NONE;
+    }
+
+    SlangResult useRegularCppCompiler(slang::IGlobalSession* session)
+    {
+        const auto regularCppCompiler = findRegularCppCompiler(session)
+        if (regularCppCompiler != SLANG_PASS_THROUGH_NONE)
+        {
+            slangSession->setDownstreamCompilerForTransition(SLANG_CPP_SOURCE, SLANG_SHADER_HOST_CALLABLE, regularCppCompiler);
+            slangSession->setDownstreamCompilerForTransition(SLANG_CPP_SOURCE, SLANG_HOST_HOST_CALLABLE, regularCppCompiler);
+            return SLANG_OK;
+        }
+        return SLANG_FAIL;
+    }
+```        
+
+It is generally recommended to use [slang-llvm](#slang-llvm) if that is appropriate, but to switch to using a [regular C/C++ compiler](#regular-cpp) when debugging is needed. This should be largely transparent to most code.
+
+Executing CPU Code
+==================
+
+In typical Slang operation when code is compiled it produces either source or a binary that can then be loaded by another API such as a rendering API. With CPU code the binary produced could be saved to a file and then executed as an exe or a shared library/dll. In practice though it is common to want to be able to execute compiled code immediately. Having to save off to a file and then load again can be awkward. It is also not necessarily the case that code needs to be saved to a file to be executed. 
+
+To handle being able call code directly, code can be compiled using the [host-callable](#host-callable).
+
+For pass through compilation of C/C++ this mechanism allows any functions marked for export to be directly queried. Marking for export is a C/C++ compiler specific feature. Look at the definition of `SLANG_PRELUDE_EXPORT` in the [C++ prelude](#prelude).
+
+For a complete example on how to execute CPU code using `spGetEntryPointHostCallable`/`getEntryPointHostCallable` look at code in `example/cpu-hello-world`. 
+
+<a id="abi"/>Application Binary Interface (ABI)
+===
+
+Say we have some Slang source like the following:
+
+```
+struct Thing { int a; int b; }
+
+Texture2D<float> tex;
+SamplerState sampler;
+RWStructuredBuffer<int> outputBuffer;        
+ConstantBuffer<Thing> thing3;        
+        
+[numthreads(4, 1, 1)]
+void computeMain(
+    uint3 dispatchThreadID : SV_DispatchThreadID, 
+    uniform Thing thing, 
+    uniform Thing thing2)
+{
+   // ...
+}
+```
+
+When compiled into a [shader compile style](#compile-style) shared library/dll/host-callable - how is it invoked? An entry point in the Slang source code produces several exported functions. The 'default' exported function has the same name as the entry point in the original source. It has the signature  
+
+```
+void computeMain(ComputeVaryingInput* varyingInput, UniformEntryPointParams* uniformParams, UniformState* uniformState);
+```
+
+NOTE! Using `main` as an entry point name should be avoided if CPU is a target because it typically causes compilation errors due it's normal C/C++ usage.
+
+ComputeVaryingInput is defined in the prelude as 
+
+```
+struct ComputeVaryingInput
+{
+    uint3 startGroupID;
+    uint3 endGroupID;
+};
+```
+
+`ComputeVaryingInput` allows specifying a range of groupIDs to execute - all the ids in a grid from startGroup to endGroup, but not including the endGroupIDs. Most compute APIs allow specifying an x,y,z extent on 'dispatch'. This would be equivalent as having startGroupID = { 0, 0, 0} and endGroupID = { x, y, z }. The exported function allows setting a range of groupIDs such that client code could dispatch different parts of the work to different cores. This group range mechanism was chosen as the 'default' mechanism as it is most likely to achieve the best performance.
+
+There are two other functions that consist of the entry point name postfixed with `_Thread` and `_Group`. For the entry point 'computeMain' these functions would be accessible from the shared library interface as `computeMain_Group` and `computeMain_Thread`. `_Group` has the same signature as the listed for computeMain, but it doesn't execute a range, only the single group specified by startGroupID (endGroupID is ignored). That is all of the threads within the group (as specified by `[numthreads]`) will be executed in a single call.
+
+It may be desirable to have even finer control of how execution takes place down to the level of individual 'thread's and this can be achieved with the `_Thread` style. The signature looks as follows
+
+```
+struct ComputeThreadVaryingInput
+{
+    uint3 groupID;
+    uint3 groupThreadID;
+};
+
+void computeMain_Thread(ComputeThreadVaryingInput* varyingInput, UniformEntryPointParams* uniformParams, UniformState* uniformState);
+```
+
+When invoking the kernel at the `thread` level it is a question of updating the groupID/groupThreadID, to specify which thread of the computation to execute. For the example above we have `[numthreads(4, 1, 1)]`. This means groupThreadID.x can vary from 0-3 and .y and .z must be 0. That groupID.x indicates which 'group of 4' to execute. So groupID.x = 1, with groupThreadID.x=0,1,2,3 runs the 4th, 5th, 6th and 7th 'thread'. Being able to invoke each thread in this way is flexible - in that any specific thread can specified and executed. It is not necessarily very efficient because there is the call overhead and a small amount of extra work that is performed inside the kernel. 
+
+Note that the `_Thread` style signature is likely to change to support 'groupshared' variables in the near future.
+
+In terms of performance the 'default' function is probably the most efficient for most common usages. The `_Group` style allows for slightly less loop overhead, but with many invocations this will likely be drowned out by the extra call/setup overhead. The `_Thread` style in most situations will be the slowest, with even more call overhead, and less options for the C/C++ compiler to use faster paths. 
+
+The UniformState and UniformEntryPointParams struct typically vary by shader. UniformState holds 'normal' bindings, whereas UniformEntryPointParams hold the uniform entry point parameters. Where specific bindings or parameters are located can be determined by reflection. The structures for the example above would be something like the following... 
+
+```
+struct UniformEntryPointParams
+{
+    Thing thing;
+    Thing thing2;
+};
+
+struct UniformState
+{
+    Texture2D<float > tex;
+    SamplerState sampler;
+    RWStructuredBuffer<int32_t> outputBuffer;
+    Thing* thing3;
+};   
+```
+
+Notice that of the entry point parameters `dispatchThreadID` is not part of UniformEntryPointParams and this is because it is not uniform.
+
+`ConstantBuffer` and `ParameterBlock` will become pointers to the type they hold (as `thing3` is in the above structure).
+ 
+`StructuredBuffer<T>`,`RWStructuredBuffer<T>` become
+
+```
+    T* data;
+    size_t count;
+```    
+
+`ByteAddressBuffer`, `RWByteAddressBuffer` become 
+
+```
+    uint32_t* data;
+    size_t sizeInBytes;
+```    
+
+
+Resource types become pointers to interfaces that implement their features. For example `Texture2D` become a pointer to a `ITexture2D` interface that has to be implemented in client side code. Similarly SamplerState and SamplerComparisonState become `ISamplerState` and `ISamplerComparisonState`.  
+ 
+The actual definitions for the interfaces for resource types, and types are specified in 'slang-cpp-types.h' in the `prelude` directory.
+
+## Unsized arrays
+
+Unsized arrays can be used, which are indicated by an array with no size as in `[]`. For example 
+
+```
+    RWStructuredBuffer<int> arrayOfArrays[];
+```
+
+With normal 'sized' arrays, the elements are just stored contiguously within wherever they are defined. With an unsized array they map to `Array<T>` which is...
+
+```
+    T* data;
+    size_t count;
+```    
+
+Note that there is no method in the shader source to get the `count`, even though on the CPU target it is stored and easily available. This is because of the behavior on GPU targets 
+
+* That the count has to be stored elsewhere (unlike with CPU) 
+* On some GPU targets there is no bounds checking - accessing outside the bound values can cause *undefined behavior*
+* The elements may be laid out *contiguously* on GPU
+
+In practice this means if you want to access the `count` in shader code it will need to be passed by another mechanism - such as within a constant buffer. It is possible in the future support may be added to allow direct access of `count` work across targets transparently. 
+
+It is perhaps worth noting that the CPU allows us to have an indirection (a pointer to the unsized arrays contents) which has the potential for more flexibility than is possible on GPU targets. GPU target typically require the elements to be placed 'contiguously' from their location in their `container` - be that registers or in memory. This means on GPU targets there may be other restrictions on where unsized arrays can be placed in a structure for example, such as only at the end. If code needs to work across targets this means these restrictions will need to be followed across targets. 
+
+## <a id="context-threading"/>Context Threading
+
+The [shader compile style](#compile-style) brings some extra issues to bare. In the HLSL compute kernel launch model application visible variables and resource are bound. As described in the [ABI](#abi) section these bindings and additional information identifying a compute thread are passed into the launch as a context. Take for example the code snippet below
+
+```
+int myGlobal;
+
+int myFunc(int v)
+{
+    return myGlobal + v;
+}
+
+int anotherFunc(int a, int b)
+{
+    return a + b;
+}
+
+[numthreads(4, 1, 1)]
+void computeMain(uint3 dispatchThreadID : SV_DispatchThreadID)
+{    
+    outputBuffer[dispatchThreadID.x] = myFunc(dispatchThreadID.x) + anotherFunc(1, dispatchThreadID.y);
+}
+```
+
+The function `myFunc` accesses a variable `myGlobal` that is held within a constant buffer. The function cannot be meaningfully executed without access to the context, and the context is available as a parameter passed through `computeMain` entry point at launch. This means the *actual* signature of this function in output code will be something like
+
+```
+int32_t myFunc_0(KernelContext_0 * kernelContext_0)
+{
+    return *(&(*(&kernelContext_0->globalParams_0))->myGlobal_0) + int(1);
+}
+```
+
+The context parameter has been *threaded* into this function. This *threading* will happen to any function that accesses any state that is held in the context. This behavior also happens transitively - if a function *could* call *any* another function that requires the context, the context will be threaded through to it also.
+
+If application code assumed `myFunc` could be called with no parameters a crash would likely ensue. Note that `anotherFunc` does not have the issue because it doesn't perform an access that needs the context, and so no context threading is added.
+
+If a global is desired in a function that wants to be called from the application, the [`__global`](#actual-global) modifier can be used.
+
+## <a id="prelude"/>Prelude
+
+For C++ targets, there is code to support the Slang generated source defined within the 'prelude'. The prelude is inserted text placed before the Slang generated C++ source. For the Slang command line tools as well as the test infrastructure, the prelude functionality is achieved through a `#include` in the prelude text of the `prelude/slang-cpp-prelude.h` specified with an absolute path. Doing so means other files the `slang-cpp-prelude.h` might need can be specified relatively, and include paths for the backend C/C++ compiler do not need to be modified. 
+
+The prelude needs to define 
+
+* 'Built in' types (vector, matrix, 'object'-like Texture, SamplerState etc) 
+* Scalar intrinsic function implementations
+* Compiler based definations/tweaks 
+
+For the Slang prelude this is split into the following files...
+
+* 'prelude/slang-cpp-prelude.h' - Header that includes all the other requirements & some compiler tweaks
+* 'prelude/slang-cpp-scalar-intrinsics.h' - Scalar intrinsic implementations
+* 'prelude/slang-cpp-types.h' - The 'built in types' 
+* 'slang.h' - Slang header is used for majority of compiler based definitions
+
+For a client application - as long as the requirements of the generated code are met, the prelude can be implemented by whatever mechanism is appropriate for the client. For example the implementation could be replaced with another implementation, or the prelude could contain all of the required text for compilation. Setting the prelude text can be achieved with the method on the global session...
+
+```
+/** Set the 'prelude' for generated code for a 'downstream compiler'.
+@param passThrough The downstream compiler for generated code that will have the prelude applied to it. 
+@param preludeText The text added pre-pended verbatim before the generated source
+
+That for pass-through usage, prelude is not pre-pended, preludes are for code generation only. 
+*/
+virtual SLANG_NO_THROW void SLANG_MCALL setDownstreamCompilerPrelude(
+SlangPassThrough passThrough,
+const char* preludeText) = 0;
+```
+
+It may be useful to be able to include `slang-cpp-types.h` in C++ code to access the types that are used in the generated code. This introduces a problem in that the types used in the generated code might clash with types in client code. To work around this problem, you can wrap all of the types defined in the prelude with a namespace of your choosing. For example 
+
+```
+#define SLANG_PRELUDE_NAMESPACE CPPPrelude
+#include "../../prelude/slang-cpp-types.h"
+``` 
+
+Would wrap all the Slang prelude types in the namespace `CPPPrelude`, such that say a `StructuredBuffer<int32_t>` could be specified in C++ source code as `CPPPrelude::StructuredBuffer<int32_t>`.
+
+The code that sets up the prelude for the test infrastructure and command line usage can be found in ```TestToolUtil::setSessionDefaultPrelude```. Essentially this determines what the absolute path is to `slang-cpp-prelude.h` is and then just makes the prelude `#include "the absolute path"`.
+
+The *default* prelude is set to the contents of the files for C++ held in the prelude directory and is held within the Slang shared library. It is therefore typically not necessary to distribute Slang with prelude files.
+
+Language aspects
+================
+
+# Arrays passed by Value
+
+Slang follows the HLSL convention that arrays are passed by value. This is in contrast the C/C++ where arrays are passed by reference. To make generated C/C++ follow this convention an array is turned into a 'FixedArray' struct type. Sinces classes by default in C/C++ are passed by reference the wrapped array is also. 
+
+To get something similar to C/C++ operation the array can be marked `inout` to make it passed by reference. 
+
+Limitations
+===========
+
+# <a id="out-of-bounds"/>Out of bounds access
+
+In HLSL code if an access is made out of bounds of a StructuredBuffer, execution proceceeds. If an out of bounds read is performed, a zeroed value is returned. If an out of bounds write is performed it's effectively a noop, as the value is discarded. On the CPU target this behavior is *not* supported by default. 
+
+For a debug CPU build an out of bounds access will assert, for a release build the behaviour is by default undefined. A limited Limited [zero index](#zero-index) out of bounds mechanism is supported, but must be enabled.
+
+The reason for this is that such an access is difficult and/or slow to implement the identical GPU behavior on the CPU. The underlying problem is `operator[]` typically returns a reference to the contained value. If this is out of bounds - it's not clear what to return, in particular because the value may be read or written and moreover elements of the type might be written. In practice this means a global zeroed value cannot be returned. 
+
+This could be somewhat supported if code gen worked as followed for say
+
+```
+RWStructuredBuffer<float4> values;
+values[3].x = 10;
+```
+
+Produces
+
+```
+template <typename T>
+struct RWStructuredBuffer
+{
+    T& at(size_t index, T& defValue) { return index < size ? values[index] : defValue; } 
+
+    T* values;
+    size_t size;
+};
+
+RWStructuredBuffer<float4> values;
+
+// ...
+Vector<float, 3> defValue = {};         // Zero initialize such that read access returns default values
+values.at(3).x = 10;
+```
+
+Note that '[] 'would be turned into the `at` function, which takes the default value as a parameter provided by the caller. If this is then written to then only the defValue is corrupted.  Even this mechanism not be quite right, because if we write and then read again from the out of bounds reference in HLSL we may expect that 0 is returned, whereas here we get the value that was last written.
+
+## <a id="zero-index"/>Zero index bound checking
+
+If bounds checking is wanted in order to avoid undefined behavior and limit how memory is accessed `zero indexed` bounds checking might be appropriate. When enabled if an access is out of bounds the value at the zero index is returned. This is quite different behavior than the typical GPU behavior, but is fairly efficient and simple to implement. Importantly it means behavior is well defined and always 'in range' assuming there is an element.
+
+To enable zero indexing bounds checking pass in the define `SLANG_ENABLE_BOUND_ZERO_INDEX` to a Slang compilation. This define is passed down to C++ and CUDA compilations, and the code in the CUDA and C++ preludes implement the feature. Note that zero indexed bounds checking will slow down accesses that are checked.
+
+The C++ implementation of the feature can be seen by looking at the file "prelude/slang-cpp-types.h". For CUDA "prelude/slang-cuda-prelude.h".
+
+The bounds checking macros are guarded such it is possible to replace the implementations, without directly altering the prelude.
+
+TODO
+====
+
+# Main
+
+* groupshared is not yet supported
+* Output of header files 
+* Output multiple entry points
+
+# Internal Slang compiler features
+
+These issues are more internal Slang features/improvements 
+
+* Currently only generates C++ code, it would be fairly straight forward to support C (especially if we have 'intrinsic definitions')
+* Have 'intrinsic definitions' in standard library - such that they can be generated where appropriate 
+  + This will simplify the C/C++ code generation as means Slang language will generate must of the appropriate code
+* Currently 'construct' IR inst is supported as is, we may want to split out to separate instructions for specific scenarios
+* Refactoring around swizzle. Currently in emit it has to check for a variety of scenarios - could be simplified with an IR pass and perhaps more specific instructions. 
@@ -0,0 +1,333 @@
+Slang CUDA Target Support
+=========================
+
+Slang has preliminary support for producing CUDA source, and PTX binaries using [NVRTC](https://docs.nvidia.com/cuda/nvrtc/index.html).
+
+NOTE! NVRTC is only available for 64-bit operating systems. On Windows Visual Studio make sure you are compiling for 'x64' and/or use 64 bit Slang binaries.
+
+# Features
+
+* Can compile Slang source into CUDA source code
+* Supports compute style shaders
+* Supports a 'bindless' CPU like model
+* Can compile CUDA source to PTX through 'pass through' mechansism
+
+# Limitations
+
+These limitations apply to Slang transpiling to CUDA.
+
+* Only supports the 'texture object' style binding (The texture object API is only supported on devices of compute capability 3.0 or higher. )
+* Samplers are not separate objects in CUDA - they are combined into a single 'TextureObject'. So samplers are effectively ignored on CUDA targets.
+* When using a TextureArray.Sample (layered texture in CUDA) - the index will be treated as an int, as this is all CUDA allows
+* Care must be used in using `WaveGetLaneIndex` wave intrinsic - it will only give the right results for appropriate launches
+* CUDA 'surfaces' are used for textures which are read/write (aka RWTexture).
+
+The following are a work in progress or not implemented but are planned to be so in the future
+
+* Some resource types remain unsupported, and not all methods on all types are supported
+
+# How it works
+
+For producing PTX binaries Slang uses [NVRTC](https://docs.nvidia.com/cuda/nvrtc/index.html). NVRTC dll/shared library has to be available to Slang (for example in the appropriate PATH for example) for it to be able to produce PTX.
+
+The NVRTC compiler can be accessed directly via the pass through mechanism and is identified by the enum value `SLANG_PASS_THROUGH_NVRTC`.
+
+Much like other targets that use downstream compilers Slang can be used to compile CUDA source directly to PTX via the pass through mechansism. The Slang command line options will broadly be mapped down to the appropriate options for the NVRTC compilation. In the API the `SlangCompileTarget` for CUDA is `SLANG_CUDA_SOURCE` and for PTX is `SLANG_PTX`. These can also be specified on the Slang command line as `-target cuda` and `-target ptx`.
+
+## Locating NVRTC
+
+Finding NVRTC can require some nuance if a specific version is required. On the command line the `-nvrtc-path` option can be used to set the `path` to NVRTC. Also `spProcessCommandLineArguments`/`processCommandLineArguments` with `-nvrtc-path` or `setDownstreamCompilerPath` with `SLANG_PASS_THROUGH_NVRTC` can be used to set the location and/or name of NVRTC via the API.
+
+Important points of note are
+
+* The name of the shared library should *not* include any extension (such as `.dll`/`.so`/`.dynlib`) or prefix (such as `lib`).
+* The path also *doesn't* have to be path, it can just be the shared library name. Doing so will mean it will be searched for by whatever the default mechanism is on the target.
+* If a path and/or name is specified for NVRTC - this will be the *only* version searched for.
+
+If a path/name is *not* specified for NVRTC, Slang will attempt to load a shared library called `nvrtc`. For non Windows targets this should be enough to find and load the latest version.
+
+On Windows NVRTC dlls have a name the contains the version number, for example `nvrtc64_102_0.dll`. This will lead to the load of just `nvrtc` to fail. One approach to fix this is to place the NVRTC dll and associated files in the same directory as `slang-compiler.dll`, and rename the main dll to `nvrtc.dll`. Another approach is to specify directly on the command line the name including the version, as previously discussed. For example
+
+`-nvrtc-path nvrtc64_102_0`
+
+will load NVRTC 10.2 assuming that version of the dll can be found via the normal lookup mechanism.
+
+On Windows if NVRTC is not loadable directly as 'nvrtc' Slang will attempt to search for the newest version of NVRTC on your system. The places searched are...
+
+* The instance directory (where the slang-compiler.dll and/or program exe is)
+* The CUDA_PATH enivonment variable (if set)
+* Directories in PATH that look like a CUDA installation.
+
+If a candidate is found via an earlier mechanism, subsequent searches are not performed. If multiple candidates are found, Slang tries the newest version first.
+
+Binding
+=======
+
+Say we have some Slang source like the following:
+
+```
+struct Thing { int a; int b; }
+
+Texture2D<float> tex;
+SamplerState sampler;
+RWStructuredBuffer<int> outputBuffer;
+ConstantBuffer<Thing> thing3;
+
+[numthreads(4, 1, 1)]
+void computeMain(
+    uint3 dispatchThreadID : SV_DispatchThreadID,
+    uniform Thing thing,
+    uniform Thing thing2)
+{
+   // ...
+}
+```
+
+This will be turned into a CUDA entry point with
+
+```
+struct UniformEntryPointParams
+{
+    Thing thing;
+    Thing thing2;
+};
+
+struct UniformState
+{
+    CUtexObject tex;                // This is the combination of a texture and a sampler(!)
+    SamplerState sampler;           // This variable exists within the layout, but it's value is not used.
+    RWStructuredBuffer<int32_t> outputBuffer;    // This is implemented as a template in the CUDA prelude. It's just a pointer, and a size
+    Thing* thing3;                  // Constant buffers map to pointers
+};
+
+// [numthreads(4, 1, 1)]
+extern "C" __global__  void computeMain(UniformEntryPointParams* params, UniformState* uniformState)
+```
+
+With CUDA - the caller specifies how threading is broken up, so `[numthreads]` is available through reflection, and in a comment in output source code but does not produce varying code.
+
+The UniformState and UniformEntryPointParams struct typically vary by shader. UniformState holds 'normal' bindings, whereas UniformEntryPointParams hold the uniform entry point parameters. Where specific bindings or parameters are located can be determined by reflection. The structures for the example above would be something like the following...
+
+`StructuredBuffer<T>`,`RWStructuredBuffer<T>` become
+
+```
+    T* data;
+    size_t count;
+```
+
+`ByteAddressBuffer`, `RWByteAddressBuffer` become
+
+```
+    uint32_t* data;
+    size_t sizeInBytes;
+```
+
+## Texture
+
+Read only textures will be bound as the opaque CUDA type CUtexObject. This type is the combination of both a texture AND a sampler. This is somewhat different from HLSL, where there can be separate `SamplerState` variables. This allows access of a single texture binding with different types of sampling.
+
+If code relies on this behavior it will be necessary to bind multiple CtexObjects with different sampler settings, accessing the same texture data.
+
+Slang has some preliminary support for TextureSampler type - a combined Texture and SamplerState. To write Slang code that can target CUDA and other platforms using this mechanism will expose the semantics appropriately within the source.
+
+Load is only supported for Texture1D, and the mip map selection argument is ignored. This is because there is tex1Dfetch and no higher dimensional equivalents. CUDA also only allows such access if the backing array is linear memory - meaning the bound texture cannot have mip maps - thus making the mip map parameter superfluous anyway. RWTexture does allow Load on other texture types.
+
+## RWTexture
+
+RWTexture types are converted into CUsurfObject type.
+
+In regular CUDA it is not possible to do a format conversion on an access to a CUsurfObject. Slang does add support for hardware write conversions where they are available. To enable the feature it is necessary to attribute your RWTexture with `format`. For example
+
+```
+[format("rg16f")]
+RWTexture2D<float2> rwt2D_2;
+```
+
+The format names used are the same as for [GLSL layout format types](https://www.khronos.org/opengl/wiki/Layout_Qualifier_(GLSL)). If no format is specified Slang will *assume* that the format is the same as the type specified.
+
+Note that the format attribution is on variables/parameters/fields and not part of the type system. This means that if you have a scenario like...
+
+```
+[format(rg16f)]
+RWTexture2d<float2> g_texture;
+
+float2 getValue(RWTexture2D<float2> t)
+{
+    return t[int2(0, 0)];
+}
+
+void doThing()
+{
+    float2 v = getValue(g_texture);
+}
+```
+
+Even `getValue` will receive t *without* the format attribute, and so will access it, presumably erroneously. A workaround for this specific scenario would be to attribute the parameter
+
+```
+float2 getValue([format("rg16f")] RWTexture2D<float2> t)
+{
+    return t[int2(0, 0)];
+}
+```
+
+This will only work correctly if `getValue` is called with a `t` that has that format attribute. As it stands no checking is performed on this matching so no error or warning will be produced if there is a mismatch.
+
+There is limited software support for doing a conversion on reading. Currently this only supports only 1D, 2D, 3D RWTexture, backed with half1, half2 or half4. For this path to work NVRTC must have the `cuda_fp16.h` and associated files available. Please check the section on `Half Support`.
+
+If hardware read conversions are desired, this can be achieved by having a Texture<T> that uses the surface of a RWTexture<T>. Using the Texture<T> not only allows hardware conversion but also filtering.
+
+It is also worth noting that CUsurfObjects in CUDA are NOT allowed to have mip maps.
+
+By default surface access uses cudaBoundaryModeZero, this can be replaced using the macro SLANG_CUDA_BOUNDARY_MODE in the CUDA prelude. For HW format conversions the macro SLANG_PTX_BOUNDARY_MODE. These boundary settings are in effect global for the whole of the kernel.
+
+`SLANG_CUDA_BOUNDARY_MODE` can be one of
+
+* cudaBoundaryModeZero      causes an execution trap on out-of-bounds addresses
+* cudaBoundaryModeClamp     stores data at the nearest surface location (sized appropriately)
+* cudaBoundaryModeTrap      drops stores to out-of-bounds addresses
+
+`SLANG_PTX_BOUNDARY_MODE` can be one of `trap`, `clamp` or `zero`. In general it is recommended to have both set to the same type of value, for example `cudaBoundaryModeZero` and `zero`.
+
+## Sampler
+
+Samplers are in effect ignored in CUDA output. Currently we do output a variable `SamplerState`, but this value is never accessed within the kernel and so can be ignored. More discussion on this behavior is in `Texture` section.
+
+## Unsized arrays
+
+Unsized arrays can be used, which are indicated by an array with no size as in `[]`. For example
+
+```
+    RWStructuredBuffer<int> arrayOfArrays[];
+```
+
+With normal 'sized' arrays, the elements are just stored contiguously within wherever they are defined. With an unsized array they map to `Array<T>` which is...
+
+```
+    T* data;
+    size_t count;
+```
+
+Note that there is no method in the shader source to get the `count`, even though on the CUDA target it is stored and easily available. This is because of the behavior on GPU targets
+
+* That the count has to be stored elsewhere (unlike with CUDA)
+* On some GPU targets there is no bounds checking - accessing outside the bound values can cause *undefined behavior*
+* The elements may be laid out *contiguously* on GPU
+
+In practice this means if you want to access the `count` in shader code it will need to be passed by another mechanism - such as within a constant buffer. It is possible in the future support may be added to allow direct access of `count` work across targets transparently.
+
+## Prelude
+
+For CUDA the code to support the code generated by Slang is partly defined within the 'prelude'. The prelude is inserted text placed before the generated CUDA source code. For the Slang command line tools as well as the test infrastructure, the prelude functionality is achieved through a `#include` in the prelude text of the `prelude/slang-cuda-prelude.h` specified with an absolute path. Doing so means other files the `slang-cuda-prelude.h` might need can be specified relatively, and include paths for the backend compiler do not need to be modified.
+
+The prelude needs to define
+
+* 'Built in' types (vector, matrix, 'object'-like Texture, SamplerState etc)
+* Scalar intrinsic function implementations
+* Compiler based definations/tweaks
+
+For a client application - as long as the requirements of the generated code are met, the prelude can be implemented by whatever mechanism is appropriate for the client. For example the implementation could be replaced with another implementation, or the prelude could contain all of the required text for compilation. Setting the prelude text can be achieved with the method on the global session...
+
+```
+/** Set the 'prelude' for generated code for a 'downstream compiler'.
+@param passThrough The downstream compiler for generated code that will have the prelude applied to it.
+@param preludeText The text added pre-pended verbatim before the generated source
+
+That for pass-through usage, prelude is not pre-pended, preludes are for code generation only.
+*/
+
+void setDownstreamCompilerPrelude(SlangPassThrough passThrough, const char* preludeText);
+```
+
+The code that sets up the prelude for the test infrastructure and command line usage can be found in ```TestToolUtil::setSessionDefaultPrelude```. Essentially this determines what the absolute path is to `slang-cpp-prelude.h` is and then just makes the prelude `#include "the absolute path"`.
+
+Half Support
+============
+
+Slang supports the half/float16 types on CUDA. To do so NVRTC must have access to the `cuda_fp16.h` and `cuda_fp16.hpp` files that are typically distributed as part of the CUDA SDK. When Slang detects the use of half in source, it will define `SLANG_CUDA_ENABLE_HALF` when `slang-cuda-prelude.h` is included. This will in turn try to include `cuda_fp16.h` and enable extra functionality within the prelude for half support.
+
+Slang tries several mechanisms to locate `cuda_fp16.h` when NVRTC is initiated. The first mechanism is to look in the include paths that are passed to Slang. If `cuda_fp16.h` can be found in one of these paths, no more searching will be performed.
+
+If this fails, the path where NVRTC is located will be searched. In that path "include" and "CUDA/include" paths will be searched. This is probably most suitable for Windows based targets, where NVRTC dll is placed along with other binaries. The "CUDA/include" path is used to try and make clear in this scenario what the contained files are for.
+
+If this fails Slang will look for the CUDA_PATH environmental variable, as is typically set during a CUDA SDK installation.
+
+If this fails - the prelude include of `cuda_fp16.h` will most likely fail on NVRTC invocation.
+
+CUDA has the `__half` and `__half2` types defined in `cuda_fp16.h`. The `__half2` can produce results just as quickly as doing the same operation on `__half` - in essence for some operations `__half2` is [SIMD](https://en.wikipedia.org/wiki/SIMD) like. The half implementation in Slang tries to take advantage of this optimization.
+
+Since Slang supports up to 4 wide vectors Slang has to build on CUDAs half support. The types `__half3` and `__half4` are implemented in `slang-cuda-prelude.h` for this reason. It is worth noting that `__half3` is made up of a `__half2` and a `__half`. As `__half2` is 4 byte aligned, this means `__half3` is actually 8 bytes, rather than 6 bytes that might be expected.
+
+One area where this optimization isn't fully used is in comparisons - as in effect Slang treats all the vector/matrix half comparisons as if they are scalar. This could be perhaps be improved on in the future. Doing so would require using features that are not directly available in the CUDA headers.
+
+Wave Intrinsics
+===============
+
+There is broad support for [HLSL Wave intrinsics](https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/hlsl-shader-model-6-0-features-for-direct3d-12), including support for [SM 6.5 intrinsics](https://microsoft.github.io/DirectX-Specs/d3d/HLSL_ShaderModel6_5.html).
+
+Most Wave intrinsics will work with vector, matrix or scalar types of typical built in types - `uint`, `int`, `float`, `double`, `uint64_t`, `int64_t`.
+
+The support is provided via both the Slang core module as well as the Slang CUDA prelude found in 'prelude/slang-cuda-prelude.h'. Many Wave intrinsics are not directly applicable within CUDA which supplies a more low level mechanisms. The implementation of most Wave functions work most optimally if a 'Wave' where all lanes are used. If all lanes from index 0 to pow2(n) -1  are used (which is also true if all lanes are used) a binary reduction is typically applied. If this is not the case the implementation fallsback on a slow path which is linear in the number of active lanes, and so is typically significantly less performant.
+
+For more a more concrete example take
+
+```
+int sum = WaveActiveSum(...);
+```
+
+When computing the sum, if all lanes (32 on CUDA), the computation will require 5 steps to complete (2^5 = 32). If say just one lane is not being used it will take 31 steps to complete (because it is now linear in amount of lanes). So just having one lane disabled required 6 times as many steps. If lanes with 0 - 15 are active, it will take 4 steps to complete (2^4 = 16).
+
+In the future it may be possible to improve on the performance of the 'slow' path, however it will always remain the most efficient generally for all of 0 to pow2(n) - 1 lanes to be active.
+
+It is also worth noting that lane communicating intrinsics performance will be impacted by the 'size' of the data communicated. The size here is at a minimum the amount of built in scalar types used in the processing. The CUDA language only allows direct communication with built in scalar types.
+
+Thus
+
+```
+int3 v = ...;
+int3 sum = WaveActiveSum(v);
+```
+
+Will require 3 times as many steps as the earlier scalar example just using a single int.
+
+## WaveGetLaneIndex
+
+'WaveGetLaneIndex' defaults to `(threadIdx.x & SLANG_CUDA_WARP_MASK)`. Depending on how the kernel is launched this could be incorrect. There are other ways to get lane index, for example using inline assembly. This mechanism though is apparently slower than the simple method used here. There is support for using the asm mechanism in the CUDA prelude using the `SLANG_USE_ASM_LANE_ID` preprocessor define to enable the feature.
+
+There is potential to calculate the lane id using the [numthreads] markup in Slang/HLSL, but that also requires some assumptions of how that maps to a lane index.
+
+## Unsupported Intrinsics
+
+* Intrinsics which only work in pixel shaders
+  + QuadXXXX intrinsics
+
+OptiX Support
+=============
+
+Slang supports OptiX for raytracing. To compile raytracing programs, NVRTC must have access to the `optix.h` and dependent files that are typically distributed as part of the OptiX SDK. When Slang detects the use of raytracing in source, it will define `SLANG_CUDA_ENABLE_OPTIX` when `slang-cuda-prelude.h` is included. This will in turn try to include `optix.h`.
+
+Slang tries several mechanisms to locate `optix.h` when NVRTC is initiated. The first mechanism is to look in the include paths that are passed to Slang. If `optix.h` can be found in one of these paths, no more searching will be performed.
+
+If this fails, the default OptiX SDK install locations are searched. On Windows this is `%{PROGRAMDATA}\NVIDIA Corporation\OptiX SDK X.X.X\include`. On Linux this is `${HOME}/NVIDIA-OptiX-SDK-X.X.X-suffix`. 
+
+If OptiX headers cannot be found, compilation will fail.
+
+Limitations
+===========
+
+Some features are not available because they cannot be mapped with appropriate behavior to a target. Other features are unavailable because of resources to devote to more unusual features.
+
+* Not all Wave intrinsics are supported
+* There is not complete support for all methods on 'objects' like textures etc.
+* Does not currently support combined 'TextureSampler'. A Texture behaves equivalently to a TextureSampler and Samplers are ignored.
+* Half type is not currently supported
+* GetDimensions is not available on any Texture type currently - as there doesn't appear to be a CUDA equivalent
+
+Language aspects
+================
+
+# Arrays passed by Value
+
+Slang follows the HLSL convention that arrays are passed by value. This is in contrast with CUDA where arrays follow C++ conventions and are passed by reference. To make generated CUDA follow this convention an array is turned into a 'FixedArray' struct type.
+
+To get something more similar to CUDA/C++ operation the array can be marked in out or inout to make it passed by reference.
@@ -0,0 +1,70 @@
+# Debugging Slang
+
+This document gives examples showing how to run debuggers in the Slang codebase.
+Follow the [Building Slang From Source](/docs/building.md) instructions first.
+
+## Visual Studio
+
+This repo includes multiple `*.natvis` files which Visual Studio picks up
+automatically; no extra configuration is required.
+
+## LLDB
+
+If you use [LLDB][], we provide a `.lldbinit` file which enables data formatters
+for types in the Slang codebase. You can use this with LLDB in your terminal via
+the [`--local-lldbinit`][] flag; for example:
+
+```
+$ cmake --build --preset debug
+$ lldb --local-lldbinit build/Debug/bin/slangc -- tests/byte-code/hello.slang -dump-ir
+(lldb) breakpoint set --name dumpIR
+(lldb) run
+```
+
+LLDB can be used with either GCC or Clang, but Clang seems to behave better
+about respecting breakpoint locations and not having missing variables.
+
+### VS Code
+
+If instead you prefer to debug within VS Code, you can run LLDB via the
+[CodeLLDB][] extension. For example, to recreate the same debugging session as
+above, create a `.vscode/tasks.json` file with these contents:
+
+```json
+{
+  "version": "2.0.0",
+  "tasks": [
+    {
+      "label": "Debug build",
+      "type": "shell",
+      "command": "cmake",
+      "args": ["--build", "--preset", "debug"]
+    }
+  ]
+}
+```
+
+Then create a `.vscode/launch.json` file with these contents:
+
+```json
+{
+  "version": "0.2.0",
+  "configurations": [
+    {
+      "name": "LLDB",
+      "preLaunchTask": "Debug build",
+      "type": "lldb",
+      "request": "launch",
+      "initCommands": ["command source .lldbinit"],
+      "program": "build/Debug/bin/slangc",
+      "args": ["tests/byte-code/hello.slang", "-dump-ir"]
+    }
+  ]
+}
+```
+
+Finally, place any breakpoints you want, and hit F5.
+
+[`--local-lldbinit`]: https://lldb.llvm.org/man/lldb.html#cmdoption-lldb-local-lldbinit
+[codelldb]: https://marketplace.visualstudio.com/items?itemName=vadimcn.vscode-lldb
+[lldb]: https://lldb.llvm.org/index.html
@@ -0,0 +1,814 @@
+---
+layout: deprecated
+permalink: "docs/user-guide/a1-02-slangpy"
+---
+
+Using Slang to Write PyTorch Kernels
+=========================================================
+
+> #### Note
+> This documentation is about `slang-torch`, a way to use Slang with Python and PyTorch.
+> For new projects, we recommend exploring <a href="https://slangpy.shader-slang.org">SlangPy</a> as an alternative.
+> We plan to deprecate `slang-torch` in favor of SlangPy in the near future, and we will communicate any plans in advance.
+
+If you are a PyTorch user seeking to write complex, high-performance, and automatically differentiated kernel functions using a per-thread programming model, we invite you to try Slang. Slang is a cutting-edge shading language that provides a straightforward way to define kernel functions that run incredibly fast in graphics applications. With the latest addition of automatic differentiation and PyTorch interop features, Slang offers an efficient solution for developing auto-differentiated kernels that run at lightning speed with a strongly typed, per-thread programming model.
+
+One of the primary advantages of a per-thread programming model in kernel programming is the elimination of concerns regarding maintaining masks for branches. When developing a kernel in Slang, you can use all control flow statements, composite data types (structs, arrays, etc.), and function calls without additional effort. Code created with these language constructs can be automatically differentiated by the compiler without any restrictions. Additionally, Slang is a strongly typed language, which ensures that you will never encounter type errors at runtime. Most code errors can be identified as you type thanks to the [compiler's coding assistance service](https://marketplace.visualstudio.com/items?itemName=shader-slang.slang-language-extension), further streamlining the development process.
+
+In addition, using a per-thread programming model also results in more optimized memory usage. When writing a kernel in Slang, most intermediate results do not need to be written out to global memory and then read back, reducing global memory bandwidth consumption and the delay caused by these memory operations. As a result, a Slang kernel can typically run at higher efficiency compared to the traditional bulk-synchronous programming model.
+
+## Getting Started with SlangTorch
+
+In this tutorial, we will use a simple example to walk through the steps to use Slang in your PyTorch project.
+
+### Installation
+`slangtorch` is available via PyPI, so you can install it simply through
+```sh
+pip install slangtorch
+```
+
+Note that `slangtorch` requires `torch` with CUDA support. See the [pytorch](https://pytorch.org/) installation page to find the right version for your platform.
+
+You can check that you have the right installation by running: 
+```sh
+python -c "import torch; print(f'cuda: {torch.cuda.is_available()}')"
+```
+
+### Writing Slang kernels for `slangtorch` >= **v1.1.5**
+
+From **v2023.4.0**, Slang supports auto-binding features that make it easier than ever to invoke Slang kernels from python, and interoperate seamlessly with `pytorch` tensors.
+
+Here's a barebones example of a simple squaring kernel written in Slang (`square.slang`):
+
+```csharp
+[AutoPyBindCUDA]
+[CUDAKernel]
+void square(TensorView<float> input, TensorView<float> output)
+{
+    // Get the 'global' index of this thread.
+    uint3 dispatchIdx = cudaThreadIdx() + cudaBlockIdx() * cudaBlockDim();
+
+    // If the thread index is beyond the input size, exit early.
+    if (dispatchIdx.x >= input.size(0))
+        return;
+
+    output[dispatchIdx.x] = input[dispatchIdx.x] * input[dispatchIdx.x];
+}
+
+```
+
+This code follows the standard pattern of a typical CUDA kernel function. It takes as input
+two tensors, `input` and `output`. 
+It first obtains the global dispatch index of the current thread and performs range check to make sure we don't read or write out
+of the bounds of input and output tensors, and then calls `square()` to compute the per-element result, and
+store it at the corresponding location in `output` tensor.
+
+
+`slangtorch` works by compiling kernels to CUDA and it identifies the functions to compile by checking for the `[CUDAKernel]` attribute.
+The second attribute `[AutoPyBindCUDA]` allows us to call `square` directly from python without having to write any host code. If you would like to write the host code yourself for finer control, see the other version of this example [here](#manually-binding-kernels).
+
+You can now simply invoke this kernel from python:
+
+```python
+import torch
+import slangtorch
+
+m = slangtorch.loadModule('square.slang')
+
+A = torch.randn((1024,), dtype=torch.float).cuda()
+
+output = torch.zeros_like(A).cuda()
+
+# Number of threads launched = blockSize * gridSize
+m.square(input=A, output=output).launchRaw(blockSize=(32, 1, 1), gridSize=(64, 1, 1))
+
+print(output)
+```
+
+The python script `slangtorch.loadModule("square.slang")` returns a scope that contains a handle to the `square` kernel.
+
+The kernel can be invoked by 
+1. calling `square` and binding `torch` tensors as arguments for the kernel, and then
+2. launching it using `launchRaw()` by specifying CUDA launch arguments to `blockSize` & `gridSize`. (Refer to the [CUDA documentation](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications) for restrictions around `blockSize`)
+
+Note that for semantic clarity reasons, calling a kernel requires the use of keyword arguments with names that are lifted from the `.slang` implementation.
+
+### Invoking derivatives of kernels using slangtorch
+
+The `[AutoPyBindCUDA]` attribute can also be used on differentiable functions defined in Slang, and will automatically bind the derivatives. To do this, simply add the `[Differentiable]` attribute.
+
+One key point is that the basic `TensorView<T>` objects are not differentiable. They can be used as buffers for data that does not require derivatives, or even as buffers for the manual accumulation of derivatives.
+
+Instead, use the `DiffTensorView` type for when you need differentiable tensors. Currently, `DiffTensorView` only supports the `float` dtype variety.
+
+Here's a barebones example of a differentiable version of `square`:
+
+```csharp
+[AutoPyBindCUDA]
+[CUDAKernel]
+[Differentiable]
+void square(DiffTensorView input, DiffTensorView output)
+{
+    uint3 dispatchIdx = cudaThreadIdx() + cudaBlockIdx() * cudaBlockDim();
+
+    if (dispatchIdx.x >= input.size(0))
+        return;
+    
+    output[dispatchIdx.x] = input[dispatchIdx.x] * input[dispatchIdx.x];
+}
+```
+
+Now, `slangtorch.loadModule("square.slang")` returns a scope with three callable handles `square`, `square.fwd` for the forward-mode derivative & `square.bwd` for the reverse-mode derivative.
+
+You can invoke `square()` normally to get the same effect as the previous example, or invoke `square.fwd()` / `square.bwd()` by binding pairs of tensors to compute the derivatives.
+
+
+```python
+import torch
+import slangtorch
+
+m = slangtorch.loadModule('square.slang')
+
+input = torch.tensor((0, 1, 2, 3, 4, 5), dtype=torch.float).cuda()
+output = torch.zeros_like(input).cuda()
+
+# Invoke normally
+m.square(input=input, output=output).launchRaw(blockSize=(6, 1, 1), gridSize=(1, 1, 1))
+
+print(output)
+
+# Invoke reverse-mode autodiff by first allocating tensors to hold the gradients
+input = torch.tensor((0, 1, 2, 3, 4, 5), dtype=torch.float).cuda()
+input_grad = torch.zeros_like(input).cuda()
+
+output = torch.zeros_like(input)
+# Pass in all 1s as the output derivative for our example
+output_grad = torch.ones_like(output) 
+
+m.square.bwd(
+    input=(input, input_grad), output=(output, output_grad)
+).launchRaw(
+    blockSize=(6, 1, 1), gridSize=(1, 1, 1))
+
+# Derivatives get propagated to input_grad
+print(input_grad)
+
+# Note that the derivatives in output_grad are 'consumed'.
+# i.e. all zeros after the call.
+print(output_grad)
+```
+
+`slangtorch` also binds the forward-mode version of your kernel (propagate derivatives of inputs to the output) which can be invoked the same way using `module.square.fwd()`
+
+You can refer to [this documentation](autodiff) for a detailed reference of Slang's automatic differentiation feature.
+
+### Wrapping your kernels as pytorch functions
+
+`pytorch` offers an easy way to define a custom operation using `torch.autograd.Function`, and defining the `.forward()` and `.backward()` members.
+
+This can be a very helpful way to wrap your Slang kernels as pytorch-compatible operations. Here's an example of the `square` kernel as a differentiable pytorch function.
+
+```python
+import torch
+import slangtorch
+
+m = slangtorch.loadModule("square.slang")
+
+class MySquareFunc(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, input):
+        output = torch.zeros_like(input)
+
+        kernel_with_args = m.square(input=input, output=output)
+        kernel_with_args.launchRaw(
+            blockSize=(32, 32, 1),
+            gridSize=((input.shape[0] + 31) // 32, (input.shape[1] + 31) // 32, 1))
+
+        ctx.save_for_backward(input, output)
+
+        return output
+
+    @staticmethod
+    def backward(ctx, grad_output):
+        (input, output) = ctx.saved_tensors
+
+        input_grad = torch.zeros_like(input)
+        
+        # Note: When using DiffTensorView, grad_output gets 'consumed' during the reverse-mode.
+        # If grad_output may be reused, consider calling grad_output = grad_output.clone()
+        #
+        kernel_with_args = m.square.bwd(input=(input, input_grad), output=(output, grad_output))
+        kernel_with_args.launchRaw(
+            blockSize=(32, 32, 1),
+            gridSize=((input.shape[0] + 31) // 32, (input.shape[1] + 31) // 32, 1))
+        
+        return input_grad
+```
+
+Now we can use the autograd function `MySquareFunc` in our python script:
+
+```python
+x = torch.tensor((3.0, 4.0), requires_grad=True, device='cuda')
+print(f"X = {x}")
+y_pred = MySquareFunc.apply(x)
+loss = y_pred.sum()
+loss.backward()
+print(f"dX = {x.grad.cpu()}")
+```
+
+Output:
+```
+X = tensor([3., 4.],
+           device='cuda:0', requires_grad=True)
+dX = tensor([6., 8.])
+```
+
+And that's it! `slangtorch.loadModule` uses JIT compilation to compile your Slang source into CUDA binary.
+It may take a little longer the first time you execute the script, but the compiled binaries will be cached and as long as the kernel code is not changed, future runs will not rebuild the CUDA kernel.
+
+Because the PyTorch JIT system requires `ninja`, you need to make sure `ninja` is installed on your system
+and is discoverable from the current environment, you also need to have a C++ compiler available on the system.
+On Windows, this means that Visual Studio need to be installed.
+
+## Specializing shaders using slangtorch
+
+`slangtorch.loadModule` allows specialization parameters to be specified since it might be easier to write shaders with placeholder definitions that can be substituted at load-time.
+For instance, here's a sphere tracer that uses a _compile-time_ specialization parameter for its maximum number of steps (`N`):
+
+```csharp
+float sphereTrace<let N:int>(Ray ray, SDF sdf)
+{
+    var pt = ray.o;
+    for (int i = 0; i < N; i++)
+    {
+        pt += sdf.eval(pt) * ray.d;
+    }
+
+    return pt;
+}
+
+float render(Ray ray)
+{
+    // Use N=20 for sphere tracing.
+    float3 pt = sphereTrace<20>(ray, sdf);
+    return shade(pt, sdf.normal());
+}
+```
+
+However, instead of using a fixed `20` steps, the renderer can be configured to use an arbitrary compile-time constant.
+
+```csharp
+// Compile-time constant. Expect "MAX_STEPS" to be set by the loadModule call.
+static const uint kMaxSteps = MAX_STEPS;
+
+float render(Ray ray)
+{
+    float3 pt = sphereTrace<kMaxSteps>(ray, sdf);
+    return shade(pt, sdf.normal());
+}
+```
+
+Then multiple versions of this shader can be compiled from Python using the `defines` argument:
+```python
+import slangtorch
+
+sdfRenderer20Steps = slangtorch.loadModule('sdf.slang', defines={"MAX_STEPS": 20})
+sdfRenderer50Steps = slangtorch.loadModule('sdf.slang', defines={"MAX_STEPS": 50})
+...
+```
+
+This is often helpful for code re-use, parameter sweeping, comparison/ablation studies, and more, from the convenience of Python.
+
+## Back-propagating Derivatives through Complex Access Patterns
+
+In most common scenarios, a kernel function will access input tensors in a complex pattern instead of mapping
+1:1 from an input element to an output element, like the `square` example shown above. When you have a kernel
+function that access many different elements from the input tensors and use them to compute an output element,
+the derivatives of each input element can't be represented directly as a function parameter, like the `x` in `square(x)`.
+
+Consider a 3x3 box filtering kernel that computes for each pixel in a 2D image, the average value of its 
+surrounding 3x3 pixel block. We can write a Slang function that computes the value of an output pixel:
+```csharp
+float computeOutputPixel(TensorView<float> input, uint2 pixelLoc)
+{
+    int width = input.size(0);
+    int height = input.size(1);
+
+    // Track the sum of neighboring pixels and the number
+    // of pixels currently accumulated.
+    int count = 0;
+    float sumValue = 0.0;
+
+    // Iterate through the surrounding area.
+    for (int offsetX = -1; offsetX <= 1; offsetX++)
+    {
+        // Skip out of bounds pixels.
+        int x = pixelLoc.x + offsetX;
+        if (x < 0 || x >= width) continue;
+
+        for (int offsetY = -1; offsetY <= 1; offsetY++)
+        {
+            int y = pixelLoc.y + offsetY;
+            if (y < 0 || y >= height) continue;
+            sumValue += input[x, y];
+            count++;
+        }
+    }
+
+    // Compute the average value.
+    sumValue /= count;
+
+    return sumValue;
+}
+```
+
+We can define our kernel function to compute the entire output image by calling `computeOutputPixel`:
+
+```csharp
+[CudaKernel]
+void boxFilter_fwd(TensorView<float> input, TensorView<float> output)
+{
+    uint2 pixelLoc = (cudaBlockIdx() * cudaBlockDim() + cudaThreadIdx()).xy;
+    int width = input.dim(0);
+    int height = input.dim(1);
+    if (pixelLoc.x >= width) return;
+    if (pixelLoc.y >= height) return;
+
+    float outputValueAtPixel = computeOutputPixel(input, pixelLoc)
+
+    // Write to output tensor.
+    output[pixelLoc] = outputValueAtPixel;
+}
+```
+
+How do we define the backward derivative propagation kernel? Note that in this example, there
+isn't a function like `square` that we can just mark as `[Differentiable]` and
+call `bwd_diff(square)` to get back the derivative of an input parameter.
+
+In this example, the input comes from multiple elements in a tensor. How do we propagate the
+derivatives to those input elements?
+
+The solution is to wrap tensor access with a custom function:
+```csharp
+float getInputElement(
+    TensorView<float> input,
+    TensorView<float> inputGradToPropagateTo,
+    uint2 loc)
+{
+    return input[loc];
+}
+```
+
+Note that the `getInputElement` function simply returns `input[loc]` and is not using the
+`inputGradToPropagateTo` parameter. That is intended. The `inputGradToPropagateTo` parameter
+is used to hold the backward propagated derivatives of each input element, and is reserved for later use.
+
+Now we can replace all direct accesses to `input` with a call to `getInputElement`. The
+`computeOutputPixel` can be implemented as following:
+
+```csharp
+[Differentiable]
+float computeOutputPixel(
+    TensorView<float> input,
+    TensorView<float> inputGradToPropagateTo,
+    uint2 pixelLoc)
+{
+    int width = input.dim(0);
+    int height = input.dim(1);
+
+    // Track the sum of neighboring pixels and the number
+    // of pixels currently accumulated.
+    int count = 0;
+    float sumValue = 0.0;
+
+    // Iterate through the surrounding area.
+    for (int offsetX = -1; offsetX <= 1; offsetX++)
+    {
+        // Skip out of bounds pixels.
+        int x = pixelLoc.x + offsetX;
+        if (x < 0 || x >= width) continue;
+
+        for (int offsetY = -1; offsetY <= 1; offsetY++)
+        {
+            int y = pixelLoc.y + offsetY;
+            if (y < 0 || y >= height) continue;
+            sumValue += getInputElement(input, inputGradToPropagateTo, uint2(x, y));
+            count++;
+        }
+    }
+
+    // Compute the average value.
+    sumValue /= count;
+
+    return sumValue;
+}
+```
+
+The main changes compared to our original version of `computeOutputPixel` are:
+- Added a `inputGradToPropagateTo` parameter.
+- Modified `input[x,y]` with a call to `getInputElement`.
+- Added a `[Differentiable]` attribute to the function.
+
+With that, we can define our backward kernel function:
+
+```csharp
+[CudaKernel]
+void boxFilter_bwd(
+    TensorView<float> input,
+    TensorView<float> resultGradToPropagateFrom,
+    TensorView<float> inputGradToPropagateTo)
+{
+    uint2 pixelLoc = (cudaBlockIdx() * cudaBlockDim() + cudaThreadIdx()).xy;
+    int width = input.dim(0);
+    int height = input.dim(1);
+    if (pixelLoc.x >= width) return;
+    if (pixelLoc.y >= height) return;
+
+    bwd_diff(computeOutputPixel)(input, inputGradToPropagateTo, pixelLoc);
+}
+```
+
+The kernel function simply calls `bwd_diff(computeOutputPixel)` without taking any return values from the call
+and without writing to any elements in the final `inputGradToPropagateTo` tensor. But when exactly does the propagated
+output get written to the output gradient tensor (`inputGradToPropagateTo`)?
+
+And that logic is defined in our final piece of code:
+```csharp
+[BackwardDerivativeOf(getInputElement)]
+void getInputElement_bwd(
+    TensorView<float> input,
+    TensorView<float> inputGradToPropagateTo,
+    uint2 loc,
+    float derivative)
+{
+    float oldVal;
+    inputGradToPropagateTo.InterlockedAdd(loc, derivative, oldVal);
+}
+```
+
+Here, we are providing a custom defined backward propagation function for `getInputElement`.
+In this function, we simply add `derivative` to the element in `inputGradToPropagateTo` tensor.
+
+When we call `bwd_diff(computeOutputPixel)` in `boxFilter_bwd`, the Slang compiler will automatically
+differentiate all operations and function calls in `computeOutputPixel`. By wrapping the tensor element access
+with `getInputElement` and by providing a custom backward propagation function of `getInputElement`, we are effectively
+telling the compiler what to do when a derivative propagates to an input tensor element. Inside the body
+of `getInputElement_bwd`, we define what to do then: atomically adds the derivative propagated to the input element
+in the `inputGradToPropagateTo` tensor. Therefore, after running `boxFilter_bwd`, the `inputGradToPropagateTo` tensor will contain all the
+back propagated derivative values.
+
+Again, to understand all the details of the automatic differentiation system, please refer to the 
+[Automatic Differentiation](autodiff) chapter for a detailed explanation.
+
+## Manually binding kernels
+`[AutoPyBindCUDA]` works for most use cases, but in certain situations, it may be necessary to write the *host* function by hand. The host function can also be written in Slang, and `slangtorch` handles its compilation to C++.
+
+Here's the same `square` example from before:
+
+```csharp
+// square.slang
+float compute_square(float x)
+{
+    return x * x;
+}
+
+[CudaKernel]
+void square_kernel(TensorView<float> input, TensorView<float> output)
+{
+    uint3 globalIdx = cudaBlockIdx() * cudaBlockDim() + cudaThreadIdx();
+
+    if (globalIdx.x >= input.size(0))
+        return;
+
+    float result = compute_square(input[globalIdx.x]);
+
+    output[globalIdx.x] = result;
+}
+```
+
+To manually invoke this kernel, we then need to write a CPU(host) function that defines how this kernel is dispatched. This can be defined in the same Slang file:
+
+```csharp
+[TorchEntryPoint]
+TorchTensor<float> square(TorchTensor<float> input)
+{
+    var result = TorchTensor<float>.zerosLike(input);
+    let blockCount = uint3(1);
+    let groupSize = uint3(result.size(0), result.size(1), 1);
+    __dispatch_kernel(square_kernel, blockCount, groupSize)(input, result);
+    return result;
+}
+```
+
+Here, we mark the function with the `[TorchEntryPoint]` attribute, so it will be compiled to C++ and exported as a python callable. 
+Since this is a host function, we can perform tensor allocations. For instance, `square()` calls `TorchTensor<float>.zerosLike` to allocate a 2D-tensor that has the same size as the input.
+`zerosLike` returns a `TorchTensor<float>` object that represents a CPU handle of a PyTorch tensor.
+
+Then we launch `square_kernel` with the `__dispatch_kernel` syntax. Note that we can directly pass
+`TorchTensor<float>` arguments to a `TensorView<float>` parameter and the compiler will automatically convert the type and obtain a view into the tensor that can be accessed by the GPU kernel function.
+
+### Calling a `[TorchEntryPoint]` function from Python
+
+You can use the following code to call `square` from Python:
+
+```python
+import torch
+import slangtorch
+
+m = slangtorch.loadModule("square.slang")
+
+x = torch.randn(2,2)
+print(f"X = {x}")
+y = m.square(x)
+print(f"Y = {y.cpu()}")
+```
+
+Result output:
+```
+X = tensor([[ 0.1407,  0.6594],
+        [-0.8978, -1.7230]])
+Y = tensor([[0.0198, 0.4349],
+        [0.8060, 2.9688]])
+```
+
+### Manual binding for kernel derivatives
+
+The above example demonstrates how to write a simple kernel function in Slang and call it from Python.
+Another major benefit of using Slang is that the Slang compiler support generating backward derivative
+propagation functions automatically.
+
+In the following section, we walk through how to use Slang to generate a backward propagation function
+for `square`, and expose it to PyTorch as an autograd function.
+
+First we need to tell Slang compiler that we need the `square` function to be considered a differentiable function, so Slang compiler can generate a backward derivative propagation function for it:
+```csharp
+[Differentiable]
+float square(float x)
+{
+    return x * x;
+}
+```
+This is done by simply adding a `[Differentiable]` attribute to our `square` function.
+
+With that, we can now define `square_bwd_kernel` that performs backward propagation as:
+
+```csharp
+[CudaKernel]
+void square_bwd_kernel(TensorView<float> input, TensorView<float> grad_out, TensorView<float> grad_propagated)
+{
+    uint3 globalIdx = cudaBlockIdx() * cudaBlockDim() + cudaThreadIdx();
+
+    if (globalIdx.x >= input.size(0) || globalIdx.y >= input.size(1))
+        return;
+
+    DifferentialPair<float> dpInput = diffPair(input[globalIdx.xy]);
+    var gradInElem = grad_out[globalIdx.xy];
+    bwd_diff(square)(dpInput, gradInElem);
+    grad_propagated[globalIdx.xy] = dpInput.d;
+}
+```
+
+Note that the function follows the same structure of `square_fwd_kernel`, with the only difference being that
+instead of calling into `square` to compute the forward value for each tensor element, we are calling `bwd_diff(square)`
+that represents the automatically generated backward propagation function of `square`.
+`bwd_diff(square)` will have the following signature:
+```csharp
+void bwd_diff_square(inout DifferentialPair<float> dpInput, float dOut);
+```
+
+Where the first parameter, `dpInput` represents a pair of original and derivative value for `input`, and the second parameter,
+`dOut`, represents the initial derivative with regard to some latent variable that we wish to back-prop through. The resulting
+derivative will be stored in `dpInput.d`. For example:
+
+```csharp
+// construct a pair where the primal value is 3, and derivative value is 0.
+var dp = diffPair(3.0);
+bwd_diff(square)(dp, 1.0);
+// dp.d is now 6.0
+```
+
+Similar to `square_fwd`, we can define the host side function `square_bwd` as:
+
+```csharp
+[TorchEntryPoint]
+TorchTensor<float> square_bwd(TorchTensor<float> input, TorchTensor<float> grad_out)
+{
+    var grad_propagated = TorchTensor<float>.zerosLike(input);
+    let blockCount = uint3(1);
+    let groupSize = uint3(input.size(0), input.size(1), 1);
+    __dispatch_kernel(square_bwd_kernel, blockCount, groupSize)(input, grad_out, grad_propagated);
+    return grad_propagated;
+}
+```
+
+## Builtin Library Support for PyTorch Interop
+
+As shown in previous tutorial, Slang has defined the `TorchTensor<T>` and `TensorView<T>` type for interop with PyTorch
+tensors. The `TorchTensor<T>` represents the CPU view of a tensor and provides methods to allocate a new tensor object.
+The `TensorView<T>` represents the GPU view of a tensor and provides accessors to read write tensor data.
+
+Following is a list of built-in methods and attributes for PyTorch interop.
+
+### `TorchTensor` methods
+
+#### `static TorchTensor<T> TorchTensor<T>.alloc(uint x, uint y, ...)`
+Allocates a new PyTorch tensor with the given dimensions. If `T` is a vector type, the length of the vector is implicitly included as the last dimension.
+For example, `TorchTensor<float3>.alloc(4, 4)` allocates a 3D tensor of size `(4,4,3)`.
+
+#### `static TorchTensor<T> TorchTensor<T>.emptyLike(TorchTensor<T> other)`
+Allocates a new PyTorch tensor that has the same dimensions as `other` without initializing it.
+
+#### `static TorchTensor<T> TorchTensor<T>.zerosLike(TorchTensor<T> other)`
+Allocates a new PyTorch tensor that has the same dimensions as `other` and initialize it to zero.
+
+#### `uint TorchTensor<T>.dims()`
+Returns the tensor's dimension count.
+
+#### `uint TorchTensor<T>.size(int dim)`
+Returns the tensor's size (in number of elements) at `dim`.
+
+#### `uint TorchTensor<T>.stride(int dim)`
+Returns the tensor's stride (in bytes) at `dim`.
+
+### `TensorView` methods
+
+#### `TensorView<T>.operator[uint x, uint y, ...]`
+Provide an accessor to data content in a tensor.
+
+#### `TensorView<T>.operator[vector<uint, N> index]`
+Provide an accessor to data content in a tensor, indexed by a uint vector.
+`tensor[uint3(1,2,3)]` is equivalent to `tensor[1,2,3]`.
+
+#### `uint TensorView<T>.dims()`
+Returns the tensor's dimension count.
+
+#### `uint TensorView<T>.size(int dim)`
+Returns the tensor's size (in number of elements) at `dim`.
+
+#### `uint TensorView<T>.stride(int dim)`
+Returns the tensor's stride (in bytes) at `dim`.
+
+#### `void TensorView<T>.fillZero()`
+Fills the tensor with zeros. Modifies the tensor in-place.
+
+#### `void TensorView<T>.fillValue(T value)`
+Fills the tensor with the specified value, modifies the tensor in-place.
+
+#### `T* TensorView<T>.data_ptr_at(vector<uint, N> index)`
+Returns a pointer to the element at `index`.
+
+#### `void TensorView<T>.InterlockedAdd(vector<uint, N> index, T val, out T oldVal)`
+Atomically add `val` to element at `index`. 
+
+#### `void TensorView<T>.InterlockedMin(vector<uint, N> index, T val, out T oldVal)`
+Atomically computes the min of `val` and the element at `index`. Available for 32 and 64 bit integer types only.
+
+#### `void TensorView<T>.InterlockedMax(vector<uint, N> index, T val, out T oldVal)`
+Atomically computes the max of `val` and the element at `index`. Available for 32 and 64 bit integer types only.
+
+#### `void TensorView<T>.InterlockedAnd(vector<uint, N> index, T val, out T oldVal)`
+Atomically computes the bitwise and of `val` and the element at `index`. Available for 32 and 64 bit integer types only.
+
+#### `void TensorView<T>.InterlockedOr(vector<uint, N> index, T val, out T oldVal)`
+Atomically computes the bitwise or  of `val` and the element at `index`. Available for 32 and 64 bit integer types only.
+
+#### `void TensorView<T>.InterlockedXor(vector<uint, N> index, T val, out T oldVal)`
+Atomically computes the bitwise xor  of `val` and the element at `index`. Available for 32 and 64 bit integer types only.
+
+#### `void TensorView<T>.InterlockedExchange(vector<uint, N> index, T val, out T oldVal)`
+Atomically swaps `val` into the element at `index`. Available for `float` and 32/64 bit integer types only.
+
+#### `void TensorView<T>.InterlockedCompareExchange(vector<uint, N> index, T compare, T val)`
+Atomically swaps `val` into the element at `index` if the element equals to `compare`. Available for `float` and 32/64 bit integer types only.
+
+### `DiffTensorView` methods
+
+#### `DiffTensorView.operator[uint x, uint y, ...]`
+Provide an accessor to data content in a tensor. This method is **differentiable**, and has the same semantics as using a `.load()` to get data, and `.store()` to set data.
+
+#### `DiffTensorView.operator[vector<uint, N> index]`
+Provide an accessor to data content in a tensor, indexed by a uint vector.`tensor[uint3(1,2,3)]` is equivalent to `tensor[1,2,3]`. This method is **differentiable**, and has the same semantics as using a `.load()` to get data, and `.store()` to set data.
+
+#### `float DiffTensorView.load(vector<uint, N> index)`
+Loads the 32-bit floating point data at the specified multi-dimensional `index`. This method is **differentiable**, and in reverse-mode will perform an atomic-add.
+
+#### `void DiffTensorView.store(vector<uint, N> index, float val)`
+Stores the 32-bit floating point value `val` at the specified multi-dimensional `index`. This method is **differentiable**, and in reverse-mode will perform an *atomic exchange* to retrieve the derivative and replace with 0.
+
+#### `float DiffTensorView.loadOnce(vector<uint, N> index)`
+Loads the 32-bit floating point data at the specified multi-dimensional `index`. This method is **differentiable**, and uses a simple `store` for the reverse-mode for faster gradient aggregation, but `loadOnce` **must** be used at most once per index. `loadOnce` is ideal for situations where each thread loads data from a unique index, but will cause incorrect gradients when an index may be accessed multiple times.
+
+#### `void DiffTensorView.storeOnce(vector<uint, N> index, float val)`
+Stores the 32-bit floating point value `val` at the specified multi-dimensional `index`. This method is **differentiable**, and uses a simple `load` for the reverse-mode for faster gradient loading, but `storeOnce` **must** be used at most once per index. `loadOnce` is ideal for situations where each thread stores data to a unique index, but will cause incorrect gradient propagation when an index may be accessed multiple times.
+
+#### `uint DiffTensorView.size(int dim)`
+Returns the underlying primal tensor's size (in number of elements) at `dim`.
+
+#### `uint DiffTensorView.dims()`
+Returns the underlying primal tensor's dimension count.
+
+#### `uint DiffTensorView.stride(uint dim)`
+Returns the stride of the underlying primal tensor's `dim` dimension
+
+### CUDA Support Functions
+
+#### `cudaThreadIdx()`
+Returns the `threadIdx` variable in CUDA.
+
+#### `cudaBlockIdx()`
+Returns the `blockIdx` variable in CUDA.
+
+#### `cudaBlockDim()`
+Returns the `blockDim` variable in CUDA.
+
+#### `syncTorchCudaStream()`
+Waits for all pending CUDA kernel executions to complete on host.
+
+### Attributes for PyTorch Interop
+
+#### `[CudaKernel]` attribute
+Marks a function as a CUDA kernel (maps to a `__global__` function)
+
+#### `[TorchEntryPoint]` attribute
+Marks a function for export to Python. Functions marked with `[TorchEntryPoint]` will be accessible from a loaded module returned by `slangtorch.loadModule`.
+
+#### `[CudaDeviceExport]` attribute
+Marks a function as a CUDA device function, and ensures the compiler to include it in the generated CUDA source.
+
+#### `[AutoPyBindCUDA]` attribute
+Marks a cuda kernel for automatic binding generation so that it may be invoked from python without having to hand-code the torch entry point. The marked function **must** also be marked with `[CudaKernel]`. If the marked function is also marked with `[Differentiable]`, this will also generate bindings for the derivative methods.
+
+Restriction: methods marked with `[AutoPyBindCUDA]` will not operate 
+
+## Type Marshalling Between Slang and Python
+
+
+### Python-CUDA type marshalling for functions using `[AutoPyBindCUDA]` 
+
+When using auto-binding, aggregate types like structs are converted to Python `namedtuples` and are made available when using `slangtorch.loadModule`. 
+
+```csharp
+// mesh.slang
+struct Mesh
+{
+    TensorView<float> vertices;
+    TensorView<int> indices;
+};
+
+[AutoPyBindCUDA]
+[CUDAKernel]
+void processMesh(Mesh mesh)
+{
+    /* ... */ 
+}
+```
+
+Here, since `Mesh` is being used by `renderMesh`, the loaded module will provide `Mesh` as a python `namedtuple` with named fields.
+While using the `namedtuple` is the best way to use structured arguments, they can also be passed as a python `dict` or `tuple`
+
+```python
+m = slangtorch.loadModule('mesh.slang')
+
+vertices = torch.tensor()
+indices = torch.tensor()
+
+# use namedtuple to provide structured input.
+mesh = m.Mesh(vertices=vertices, indices=indices)
+m.processMesh(mesh=mesh).launchRaw(blockSize=(32, 32, 1), gridSize=(1, 1, 1))
+
+# use dict to provide input.
+mesh = {'vertices': vertices, 'indices':indices}
+m.processMesh(mesh=mesh).launchRaw(blockSize=(32, 32, 1), gridSize=(1, 1, 1))
+
+# use tuple to provide input (warning: user responsible for right order)
+mesh = (vertices, indices)
+m.processMesh(mesh=mesh).launchRaw(blockSize=(32, 32, 1), gridSize=(1, 1, 1))
+```
+
+
+### Python-CUDA type marshalling for functions using `[TorchEntryPoint]`
+
+The return types and parameters types of an exported `[TorchEntryPoint]` function can be a basic type (e.g. `float`, `int` etc.), a vector type (e.g. `float3`), a `TorchTensor<T>` type, an array type, or a struct type.
+
+When you use struct or array types in the function signature, it will be exposed as a Python tuple.
+For example,
+```csharp
+struct MyReturnType
+{
+    TorchTensor<T> tensors[3];
+    float v;
+}
+
+[TorchEntryPoint]
+MyReturnType myFunc()
+{
+    ...
+}
+```
+
+Calling `myFunc` from python will result in a python tuple in the form of
+```
+[[tensor, tensor, tensor], float]
+```
+
+The same transform rules apply to parameter types.
@@ -0,0 +1,25 @@
+Slang Design and Implementation Notes
+=====================================
+
+This directory contains documents that are primarily intended for developers working on the Slang implementation.
+They are not intended to be helpful to Slang users.
+
+These documents can only be trusted to reflect the state of the codebase or the plans of their authors at the time they were written. Changes to the implementation are not expected to always come with matching changes to these documents, so some amount of drift is to be expected.
+
+Developers interested in contributing to Slang might want to start with the [Overview](overview.md) document, which describes the overall compilation pipeline that Slang uses and the purpose of the various steps (both implemented and planned).
+
+The [Coding Conventions](coding-conventions.md) document describes the conventions that should be followed in all code added to the Slang project.
+
+The [Interfaces](interfaces.md) document describes the high-level design plan for Slang's interfaces and generics features.
+
+The [Declaration References](decl-refs.md) document is intended to help out developers who are mystified by the heavily used `DeclRef` type in the compiler implementation.
+
+The [Intermediate Representation (IR)](ir.md) document describes the design of Slang's internal IR.
+
+The [Existential Types](existential-types.md) document goes into some detail about what "existential types" are in the context of the Slang language, and explains how we may go about supporting them.
+
+The [Capabilities](capabilities.md) document explains the proposed model for how Slang will support general notions of profile- or capability-based overloading/dispatch.
+
+The [Casting](casting.md) document explains how casting works in the slang C++ compiler code base.
+
+The [Experimental API Interfaces](experimental.md) document explains how experimental Slang API changes are to be deployed.
@@ -0,0 +1,333 @@
+Reverse Mode Autodiff (Out of Date)
+==================================
+
+
+This document serves as a design reference for reverse-mode auto-diff in the Slang compiler.
+
+## Reverse-Mode Passes
+
+Rather than implementing reverse-mode as a separate pass, Slang implements this as a series of independent passes:
+
+If a function needs a reverse-mode version generated:
+ - *Linearize* the function, and all dependencies.
+ - *Propagate* differential types through the linearized code.
+ - *Unzip* by moving primal insts to before differential insts.
+ - *Transpose* the differential insts.
+
+
+## Linearization (Forward-mode)
+
+### Overview
+(This is a incomplete section. More details coming soon)
+
+Consider an arbitrary function `float f(float a, float b, float c, ..., z)` which takes in N inputs and generates one output `y`. Linearization aims to generate the first-order Taylor expansion of f about _all_ of it's inputs.
+
+Mathematically, the forward derivative `fwd_f` represents `df/da * (a_0 - a)  + df/db * (b_0 - b) + ...`, where `a_0` is the value at which the Taylor expansion was produced. The quantity `a_0 - a` is known as the 'differential' (for brevity we'll denote them da, db, dc, etc..), and there is at-most one differential per input.
+
+Thus, the new function's signature should be `fwd_f(float a, float da, float b, float db, float c, float dc, ...)`. For simplicity, we'll use *pairs* instead of interleaving the original and differential parameters. We use the intrinsic `DifferentialPair<T>` (or for short: `DP<T>`) to denote this.
+
+The signature we use is then `fwd_f(DP<float> a, DP<float> b, DP<float> c)`
+
+An example of linearization:
+```C
+
+float f(float a, float b)
+{
+    if (a > 0)
+    {
+        return a + b + 2.0 * a * b;
+    }
+    else
+    {
+        return sqrt(a);
+    }
+}
+```
+
+We'll write out the SSA form of this function.
+
+```C
+float f_SSA(float a, float b)
+{
+    bool _b1 = a > 0;
+    if (_b1)
+    {
+        float _t1 = a + b;
+        float _t2 = 2.0 * a;
+        float _t3 = _t2 * b;
+        float _t4 = _t1 + _t3;
+
+        return _t4;
+    }
+    else
+    {
+        float _t1 = sqrt(a);
+        return _t1;
+    }
+}
+
+DP<float> f_SSA(DP<float> dpa, DP<float> dpb)
+{
+
+    bool _b1 = dpa.p > 0;
+    if (_b1)
+    {
+        float _t1 = dpa.p + dpb.p;
+        float _t1_d = dpa.d + dpb.d;
+
+        float _t2 = 2.0 * dpa.p;
+        float _t2_d = 0.0 * dpa.p + 2.0 * dpa.d;
+
+        float _t3 = _t2 * dpb.p;
+        float _t3_d = _t2_d * dpb.p + _t2 * dpb.d;
+
+        float _t4 = _t1 + _t3;
+        float _t4_d = _t1_d + _t3_d;
+
+        return DP<float>(_t4, _t4_d);
+    }
+    else
+    {
+        DP<float> _t1_dp = sqrt_fwd(dpa);
+        return DP<float>(_t1_dp.p, _t1_dp.d);
+    }
+}
+
+```
+
+In the result, the primal part of the pair holds the original computation, while the differential part computes the dot product of the differentials with the derivatives of the function's output w.r.t each input. 
+
+
+## Propagation
+
+This step takes a linearized function and propagates information about which instructions are computing a differential and which ones are part of the primal (original) computation.
+
+Assuming first-order differentiation only:
+The approach will be to mark any instructions that extract the differential from the differential pair as a differential. Then any instruction that uses the differential is itself marked as a differential and so on. The only exception is the call instruction which is either non-differentiable (do nothing) or differentiable and returns a pair (follow the same process)
+
+
+Here's the above example with propagated type information (we use float.D to denote intermediaries that have been marked as differential, and also expand everything so that each line has a single operation)
+
+```C
+
+DP<float> f_SSA_Proped(DP<float> dpa, DP<float> dpb)
+{
+    bool _b1 = dpa.p > 0;
+    if (_b1)
+    {
+        float _t1 = dpa.p + dpb.p;
+        
+        float.D _q1_d = dpa.d;
+        float.D _q2_d = dpb.d;
+
+        float.D _t1_d = _q1_d + _q2_d;
+
+        float _t2 = 2.0 * dpa.p;
+        
+        float.D _q2_d = dpa.d;
+        float.D _q3_d = 2.0 * dpa.d;
+
+        float _q4 = dpa.p;
+        float.D _q4_d = 0.0 * dpa.p;
+
+        float.D _t2_d = _q4_d + _q3_d;
+
+        float _t3 = _t2 * dpb.p;
+
+        float _q5 = dpb.p;
+        float.D _q6_d = _q5 * _t2_d;
+
+        float.D _q7_d = dpb.d;
+        float.D _q8_d = _t2 * _q7_d
+
+        float _t3_d = _q6_d + _q8_d;
+
+        float _t4 = _t1 + _t3;
+
+        float.D _t4_d = _t1_d + _t3_d;
+
+        return DP<float>(_t4, _t4_d);
+    }
+    else
+    {
+        DP<float> _t1_dp = sqrt_fwd(dpa);
+
+        float _q1 = _t1_dp.p;
+        float.D _q1_d = _t1_dp.d;
+
+        return DP<float>(_q1, _q1_d);
+    }
+}
+
+```
+
+## Unzipping
+
+
+This is a fairly simple process when there is no control flow. We simply move all non-differential instructions to before the first differential instruction.
+
+When there is control flow, we need to be a bit more careful: the key is to *replicate* the control flow graph once for primal and once for the differential.
+
+Here's the previous example unzipped:
+
+
+```C
+
+DP<float> f_SSA_Proped(DP<float> dpa, DP<float> dpb)
+{
+    bool _b1 = dpa.p > 0;
+
+    float _t1, _t2, _q4, _t3, _q5, _t3_d, _t4, _q1;
+
+    if (_b1)
+    {
+        _t1 = dpa.p + dpb.p;
+        
+        _t2 = 2.0 * dpa.p;
+        
+        _q4 = dpa.p;
+        
+        _t3 = _t2 * dpb.p;
+
+        _q5 = dpb.p;
+
+        _t4 = _t1 + _t3;
+
+    }
+    else
+    {
+
+        _q1 = sqrt_fwd(DP<float>(dpa.p, 0.0));
+    }
+
+    // Note here that we have to 'store' all the intermediaries 
+    // _t1, _t2, _q4, _t3, _q5, _t3_d, _t4 and _q1. This is fundamentally
+    // the tradeoff between fwd_mode and rev_mode
+
+    if (_b1)
+    {
+        float.D _q1_d = dpa.d;
+        float.D _q2_d = dpb.d;
+
+        float.D _t1_d = _q1_d + _q2_d;
+
+        float.D _q2_d = dpa.d;
+        float.D _q3_d = 2.0 * dpa.d;
+
+        float.D _q4_d = 0.0 * dpa.p;
+
+        float.D _t2_d = _q4_d + _q3_d;
+
+        float.D _q6_d = _q5 * _t2_d;
+
+        float.D _q7_d = dpb.d;
+        float.D _q8_d = _t2 * _q7_d
+
+        float.D _t3_d = _q6_d + _q8_d;
+
+        float.D _t4_d = _t1_d + _t3_d;
+
+        return DP<float>(_t4, _t4_d);
+    }
+    else
+    {
+        DP<float> _t1_dp = sqrt_fwd(dpa);
+
+        float.D _q1_d = _t1_dp.d;
+
+        return DP<float>(_q1, _q1_d);
+    }
+}
+
+```
+
+## Transposition
+
+### Overview
+
+This transposition pass _assumes_ that provided function is linear in it's differentials.
+It is out of scope of this project to attempt to enforce that constraint for user-defined differential code.
+
+For transposition we walk all differential instructions in reverse starting from the return statement, and apply the following rules:
+
+We'll have an accumulator dictionary `Dictionary<IRInst, IRInst> accMap` holding assignments for
+intermediaries which don't have concrete variables. When we add a pair (A, C) and (A, B) already exists, this will form the pair (A, ADD(C, B)) in the dictionary. (ADD will be replaced with a call to `T.dadd` for a generic type T)
+
+ - If `inst` is a `RETURN(A)`, add pair `(A, d_out)` to `accMap`
+ - If an instruction is `MUL(P, D)` where D is the differential, add pair `(D, MUL(P, accMap[this_inst]))` to `accMap`
+ - If an instruction is `ADD(D1, D2)`, where both D1 and D2 are differentials (this is the only config that should occur), then add pair `(D1, accMap[this_inst])` to `accMap`
+ - If an instruction is `CALL(f_fwd, (P1, D1), (P2, D2), ...)`, create variables D1v, D2v, ... for D1, D2, ..., then replace with `CALL(f_rev, (P1, D1v), (P2, D2v), ..., accMap[this_inst])`, and finally add pairs `(D1, LOAD[D1v]), (D2, LOAD[D2v]), ...` to `accMap`
+
+ ```C
+
+void f_SSA_Rev(inout DP<float> dpa, inout DP<float> dpb, float dout)
+{
+    bool _b1 = dpa.p > 0;
+
+    float _t1, _t2, _q4, _t3, _q5, _t3_d, _t4, _q1;
+
+    if (_b1)
+    {
+        _t1 = dpa.p + dpb.p;
+        
+        _t2 = 2.0 * dpa.p;
+        
+        _q4 = dpa.p;
+        
+        _t3 = _t2 * dpb.p;
+
+        _q5 = dpb.p;
+
+        _t4 = _t1 + _t3;
+
+    }
+    else
+    {
+
+        _q1 = sqrt_fwd(DP<float>(dpa.p, 0.0));
+    }
+
+    // Note here that we have to 'store' all the intermediaries 
+    // _t1, _t2, _q4, _t3, _q5, _t3_d, _t4 and _q1. This is fundamentally
+    // the tradeoff between fwd_mode and rev_mode
+
+    if (_b1)
+    {
+
+        float.D _t4_rev = d_out;
+
+        float.D _t1_rev = _t4_rev;
+        float.D _t3_rev = _t4_rev;
+
+        float.D _q8_rev = _t3_rev;
+        float.D _q6_rev = _t3_rev;
+
+        float.D _q7_rev = _t2 * _q8_rev;
+
+        dpb.d += _q7_rev;
+
+        float.D _t2_rev = _q5 * _q6_rev;
+
+        float.D _q4_rev = _t2_rev;
+        float.D _q3_rev = _t2_rev;
+
+        dpa.d += 2.0 * _q3_rev;
+
+        float.D _q1_rev = _t1_rev;
+        float.D _q2_rev = _t1_rev;
+
+        dpb.d += _q2_rev;
+        dpa.d += _q1_rev;
+    }
+    else
+    {
+        _q1_rev = d_out;
+
+        DP<float> dpa_copy;
+        sqrt_rev(dpa_copy, _q1_rev);
+
+        dpa.d += dpa_copy.d;
+    }
+}
+
+```
@@ -0,0 +1,396 @@
+<!--The goal of this set of documents is to describe the design of Slang's automatic differentiation passes, along with the mechanisms & passes used to support various features. -->
+
+This documentation is intended for Slang contributors and is written from a compiler engineering point of view. For Slang users, see the user-guide at this link: [https://shader-slang.com/slang/user-guide/autodiff.html](https://shader-slang.com/slang/user-guide/autodiff.html)
+
+## What is Automatic Differentiation?
+
+Before diving into the design of the automatic differentiation (for brevity, we will call it 'auto-diff') passes, it is important to understand the end goal of what auto-diff tries to achieve.
+
+The over-arching goal of Slang's auto-diff is to enable the user to compute derivatives of a given shader program or function's output w.r.t its input parameters. This critical compiler feature enables users to quickly use their shaders with gradient-based parameter optimization algorithms, which forms the backbone of modern machine learning systems. It enables users to train and deploy graphics systems that contain ML primitives (like multi-layer perceptron's or MLPs) or use their shader programs as differentiable primitives within larger ML pipelines.
+
+### More Resources
+Here are some links to resources that talk more about differentiable programming from a more mathematical perspective:
+1. UCSD CSE 291 (Spring 2024): https://cseweb.ucsd.edu/~tzli/cse291/sp2024/
+2. UW CSE 5990 (Winter 2024): https://sites.google.com/cs.washington.edu/cse-599o-dppl
+
+## Definition of Derivatives
+
+This section is based off of these slides: https://cseweb.ucsd.edu/~tzli/cse291/sp2024/lectures/03_forward_mode.pdf.
+
+Here, we establish the mathematical definition of derivatives, starting with a simple 1D case (function with a single input and output), and extending to the general case of functions mapping multiple inputs to multiple outputs.
+
+To avoid confusion, we will denote mathematical functions using LaTeX italic script ($f$, $g$, etc..) and programs that compute these functions with markdown code (`f`, `g`, etc..)
+
+### Derivatives of scalar (1D) functions
+
+Consider the simplest case: a smooth scalar mathematical function that maps a real number to another real number:
+
+$$f : \mathbb{R} \to \mathbb{R}$$
+
+There are several definitions for a derivative, but we will use the definition that a derivative is the *closest linear approximation* of the output function at a given input location. 
+Concretely, given a specific input $x$, we can create a linear approximation of the function $f$ around $x$ as follows:
+
+$$ f(x + dx) \approx f(x) + Df(x) \cdot dx $$
+<!--// TODO: Add image here.-->
+
+This can also be understood as a geometric 'tangent' to the function at $x$. $Df(x)$ is the slope of $f$ at $x$, i.e. $\frac{\partial f}{\partial x}$, and $dx$ is the perturbation away from $x$. Our approximation is linear as a function of the perturbation $dx$. Note that no matter how non-linear or complex the underlying function $f(x)$ is, the approximation is always linear (this property becomes very important later).
+
+### Forward-mode derivative functions
+
+Now consider a concrete program `f` that computes some function.
+
+```C
+// Computes square of x
+float f(float x)
+{
+    return x * x;
+}
+```
+
+What should its derivative program look like? We the need the output $f(x)$ and the product of derivative at $x$, $Df(x)$ with the differential $dx$.
+
+In Slang, we put both of these together into a single function, called the *forward-mode derivative* function, which takes in a pair $(x, dx)$ returns a pair $(f(x), Df(x)\cdot dx)$ Note that in auto-diff literature, this is also often referred to as the *total derivative* function. 
+
+```C
+DifferentialPair<float> fwd_f(DifferentialPair<float> dpx)
+{
+    float x = dpx.getPrimal(); // Can also be accessed via property dpx.p
+    float dx = dpx.getDifferential(); // Can also be accessed via property dpx.d
+    return makePair(x * x, (2 * x) * dx);
+}
+```
+
+Note that `(2 * x)` is the multiplier corresponding to $Df(x)$. We refer to $x$ and $f(x)$ as "*primal*" values and the perturbations $dx$ and $Df(x)\cdot dx$ as "*differential*" values. The reason for this separation is that the "*differential*" output values are always linear w.r.t their "*differential*" inputs.
+
+As the name implies, `DifferentialPair<T>` is a special pair type used by Slang to hold values and their corresponding differentials.
+
+
+### Forward-mode derivatives for higher-dimensional functions
+In practice, most functions tend to have multiple inputs and multiple outputs, i.e. $f: \mathbb{R}^N \to \mathbb{R}^M$
+
+The definition above can be extended to higher dimensions, using the closest-linear-approximation idea. The main difference is that the derivative function represents a hyperplane rather than a line.
+
+Effectively, we want our forward-mode derivative to compute the following:
+
+$$ f(\mathbf{x} + \mathbf{dx}) \approx f(\mathbf{x}) + \langle Df(\mathbf{x}),\mathbf{dx}\rangle $$
+
+Here, the input and its differential can be represented as a vector quantity $\mathbf{x}, \mathbf{dx} \in \mathbb{R}^N$ and the multiplier $Df(\mathbf{x})$ (also known as the *Jacobian* matrix) is a NxM matrix, and $\left\< \cdot,\cdot \right\>$ denotes the inner product (i.e. matrix-vector multiplication)
+
+Here's an example of a Slang function taking in two inputs (N=2) and generating one output (M=1)
+
+```C
+// Compute length of hypotenuse.
+float f(float x, float y)
+{
+    return sqrt(x * x + y * y);
+}
+```
+
+and its forward-mode derivative:
+
+```C
+// Closest linear approximation at x, y
+DifferentialPair<float> fwd_f(DifferentialPair<float> dpx, DifferentialPair<float> dpy)
+{
+    float x = dpx.p;
+    float y = dpy.p;
+    float dx = dpx.d;
+    float dy = dpx.d;
+
+    return DifferentialPair<float>(
+        sqrt(x * x + y * y),                       // f(x, y)
+        (x * dx + y * dy) / sqrt(x * x, y * y));   // <Df(x,y), dx>
+}
+```
+
+Important note: the forward-mode function only needs to compute the inner product $\langle Df(\mathbf{x}),\mathbf{dx} \rangle$. The Jacobian matrix itself never needs to be fully materialized. This is a key design element of automatic differentiation, one which allows it to scale to huge input/output counts.
+
+### Building Blocks: Forward-mode derivatives compose in forward order of execution.
+
+In practice, we compute forward-mode derivatives of a complex function by decomposing them into constituent functions (or in compiler-speak: instructions) and composing the forward-mode derivative of each piece in the **same** order. 
+This is because of each forward derivative is a 'right-side' product (or product of Jacobian matrix with a vector)
+
+Here's an example of this in action (consider a complex function $h$ composed of $f$ and $g$):
+
+$$ h(\mathbf{x}) = f(g(\mathbf{x})) $$
+
+It's forward-mode derivative is then:
+
+$$ \langle Dh(\mathbf{x}), \mathbf{dx}\rangle = \big\langle Df(\mathbf{x}), \langle Dg(\mathbf{x}), \mathbf{dx}\rangle\big\rangle $$
+
+which is the forward-mode derivative of the outer function $f$ evaluated on the result of the forward-mode derivative of the inner function $g$. 
+
+An example of this in Slang code:
+```C
+// Compute square.
+float sqr(float x)
+{
+    return x * x;
+}
+
+// Compute length of hypotenuse.
+float f(float x, float y)
+{
+    float x_sqr = sqr(x);
+    float y_sqr = sqr(y)
+    return sqrt(x_sqr + y_sqr);
+}
+```
+
+The resulting derivative of `f` can be computed by composition:
+```C
+// Forward-mode derivative of sqr()
+DifferentialPair<float> fwd_sqr(DifferentialPair<float> dpx)
+{
+    float x = dpx.getPrimal();
+    float dx = dpx.getDifferential();
+
+    return DifferentialPair<float>(x * x, 2 * x * dx);
+}
+
+// Forward-mode derivative of f()
+DifferentialPair<float> fwd_f(DifferentialPair<float> dpx, DifferentialPair<float> dpy)
+{
+    DifferentialPair<float> dp_x_sqr = fwd_sqr(dpx);
+    DifferentialPair<float> dp_y_sqr = fwd_sqr(dpy);
+
+    float x_sqr = dp_x_sqr.getPrimal();
+    float y_sqr = dp_y_sqr.getPrimal();
+    float x_sqr_d = dp_x_sqr.getDifferential();
+    float y_sqr_d = dp_y_sqr.getDifferential();
+
+    return DifferentialPair<float>(
+        sqrt(x_sqr + y_sqr),
+        (x_sqr_d + y_sqr_d) / sqrt(x_sqr + y_sqr));
+}
+```
+
+### Tip: Extracting partial derivatives from a forward-mode derivative (i.e. a 'total' derivative)
+
+As we discussed above, forward-mode derivatives compute $\langle Df(\mathbf{x}),\mathbf{dx}\rangle$ rather than what you may be used to seeing in a calculus course (e.g. partial derivatives like $\frac{\partial f}{\partial x}$).
+
+In fact, the forward-mode derivative is simply an product of the partial derivative w.r.t each input parameter multiplied by their differential perturbations $\frac{\partial f}{\partial x} * dx + \frac{\partial f}{\partial x} * dy$. This is the reason for the alternative name: *total derivative*.
+
+Thus, partial derivative can be obtained by successively setting each input's differential to 1 (and 0 for everything else)
+Example:
+```C
+// Compute partial derivative w.r.t x (pass dx=1.0)
+float df_dx = fwd_f(DifferentialPair<float>(x, 1.0), DifferentialPair<float>(y, 0.0)).d;
+
+// Compute partial derivaive w.r.t y (pass dy=1.0)
+float df_dy = fwd_f(DifferentialPair<float>(x, 0.0), DifferentialPair<float>(y, 1.0)).d;
+```
+
+### Tip: Testing forward-mode derivatives using the first principles of calculus (i.e. the *finite difference* method)
+
+In Calculus, partial derivatives of a function are often defined in a 'black box' manner using limits, by perturbing a single parameter by an infinitesimal amount:
+
+$$ \frac{\partial f}{\partial x} = \lim_{dx\to 0} \frac{f(x + dx) - f(x - dx)}{2 * dx} $$
+
+At the moment, we cannot leverage programming languages to compute true inifinitesimal limits, but we can replace $dx \to 0$ with a sufficiently small $\epsilon$ leading to the following 'test' to check if derivatives produced by automatic differentiation match with their true mathematical expected values.
+
+Here's an example of using this idea to test functions (many autodiff tests were written this way)
+
+```C
+// Compute partial derivative w.r.t x analytically
+float df_dx_ad = fwd_f(DifferentialPair<float>(x, 1.0), DifferentialPair<float>(y, 0.0))
+
+// Compute partial derivative w.r.t x through the finite difference (FD) method.
+float eps = 1e-4
+float df_dx_fd = (f(x + eps, y) - f(x - eps, y)) / (2 * eps);
+
+// If computed correctly, df_dx_ad and df_dx_fd are very close.
+```
+
+**Caveats:**
+Since the finite difference method only produces a biased estimate of the derivative, the result is only numerically *close* to the auto-diff-based result. Poorly behaved functions (those that rapidly change, or are discontinuous or otherwise non-differentiable) will result in a (expected) mismatch between FD and AD results.
+
+## Reverse-mode derivative functions
+
+This section is based off of these slides: https://cseweb.ucsd.edu/~tzli/cse291/sp2024/lectures/05_reverse_mode.pdf.
+
+### Motivation: Challenges with scaling forward-mode derivatives
+
+A big problem with forward-mode derivatives is their inability to scale to great parameter counts.
+
+Machine learning pipelines often compute derivatives of a large complex pipeline with millions or even billions of input parameters, but a single output value, i.e. the *loss* or *objective* function, frequently denoted by $\mathcal{L}$.
+Computing $\frac{\partial \mathcal{L}}{\partial x_i}$ for $N$ inputs $x_i$ using the one-hot vector approach will involve invoking the forward-mode derivative function $N$ times.
+
+The reason for this limitation is that forward-mode derivatives pass derivatives from the inputs through to the outputs by computing the dot-product $\left\< Df(\mathbf{x}),\mathbf{dx}\right\>$. 
+Instead, we employ a different approach called the reverse-mode derivative, which propagates differentials *backwards* from outputs to inputs.
+
+### Key Idea: Generate code to compute $\langle \frac{\partial \mathcal{L}}{\partial f}, Df(\mathbf{x})\rangle$ rather than $\langle Df(\mathbf{x}),\mathbf{dx}\rangle$
+
+The fundamental building blocks of reverse-mode derivatives are the **left-side inner product**. That is, the product of a vector of derivatives of w.r.t outputs $\frac{\partial \mathcal{L}}{\partial f}$ with the Jacobian matrix $Df(\mathbf{x})$.
+
+An important thing to keep in mind is that it does not necessarily matter what the scalar quantity $\mathcal{L}$ is. The goal of this product is to propagate the derivatives of any scalar value $\mathcal{L}$ w.r.t output vector $f(\mathbf{x})$ (i.e., $\frac{\partial \mathcal{L}}{\partial f}$) into derivatives of that same scalar value $\mathcal{L}$ w.r.t the input vector $\mathbf{x}$ (i.e., $\frac{\partial \mathcal{L}}{\partial \mathbf{x}}$).
+
+Here's an example of a Slang function computing the `reverse-mode derivative`.
+
+```C
+// Compute length of hypotenuse
+float f(float x, float y)
+{
+    return sqrt(x * x + y * y);
+}
+
+// Reverse-mode derivative of f. dOutput represents the derivative dL/dOutput of the output w.r.t scalar value.
+void rev_f(inout DifferentialPair<float> dpx, inout DifferentialPair<float> dpy, float dOutput)
+{
+    float x = dpx.getPrimal();
+    float y = dpy.getPrimal();
+
+    float t = 1.0 / (sqrt(x * x + y * y));
+
+    dpx = DifferentialPair<float>(
+        x,                 // The primal part of the return value is *always* copied in from the input as-is.
+        dOutput * x * t);  // The differential part for x is the derivative dL/dx computed as 
+                           // (dL/dOutput) * (dOutput/dx), where dOutput/dx = x / sqrt(x*x+y*y).
+
+    dpy = DifferentialPair<float>(
+        y,                
+        dOutput * y * t);  // The differential part for y is the derivative dL/dy computed as 
+                           // (dL/dOutput) * (dOutput/dy), where dOutput/dy = y / sqrt(x*x+y*y).
+}
+```
+
+Note that `rev_f` accepts derivatives w.r.t the output value as the input, and returns derivatives w.r.t inputs as its output (through `inout` parameters). `rev_f` still needs the primal values `x` and `y` to compute the derivatives, so those are still passed in as an input through the primal part of the differential pair. 
+
+Also note that the reverse-mode derivative function does not have to compute the primal result value (its return is void). The reason for this is a matter of convenience: reverse-mode derivatives are often invoked after all the primal functions, and there is typically no need for these values. We go into more detail on this topic in the checkpointing chapter.
+
+The reverse mode function can be used to compute both `dOutput/dx` and `dOutput/dy` with a single invocation (unlike the forward-mode case where we had to invoke `fwd_f` once for each input)
+
+```C
+DifferentialPair<float> dpx = makePair<float>(x, 0.f); // Initialize diff-value to 0 (not necessary)
+DifferentialPair<float> dpx = makePair<float>(y, 0.f); // Initialize diff-value to 0 (not necessary)
+
+rev_f(dpx, dpy, 1.0); // Pass 1.0 for dL/dOutput so that the results are (1.0 * dOutput/dx) and (1.0 * dOutput/dy)
+
+float doutput_dx = dpx.getDifferential(); 
+float doutput_dy = dpy.getDifferential();
+```
+
+### Extension to multiple outputs
+The extension to multiple outputs is fairly natural. Each output gets a separate input for its derivative.
+Here is an example:
+```C
+// Computation involving multiple inputs and outputs.
+float2 f_multi_output(float x, float y)
+{
+    return float2(
+        x * x,
+        x + y);
+}
+
+// Reverse-mode derivative of 'f_multi_output'. The derivative of the outputs is also a vector quantity 
+// (type follows from return type of f_multi_output)
+void rev_f_multi_output(DifferentialPair<float> dpx, DifferentialPair<float> dpy, float2 dOut)
+{
+    float x = dpx.getPrimal();
+    float y = dpy.getPrimal();
+
+    dpx = DifferentialPair<float>(x, dOut[0] * 2 * x + dOut[1]);
+    dpy = DifferentialPair<float>(x, dOut[1]);
+}
+```
+
+### Jacobian method: Generate forward- and reverse-mode derivatives from first principles.
+A simple way to figure out what the generated reverse (or forward) derivative function is supposed to compute is to write down the entire Jacobian function. That is, write down the partial derivative of each input w.r.t each output
+
+$$
+D\mathbf{f}(\mathbf{x}) = \begin{bmatrix} 
+\partial f_0 / \partial x & \partial f_0 / \partial y \\  
+\partial f_1 / \partial x & \partial f_1 / \partial y \\
+\end{bmatrix} = 
+\begin{bmatrix} 
+2x    & 0.0 \\  
+1.0   & 1.0 \\
+\end{bmatrix}
+$$
+
+The **reverse-mode derivative**'s outputs should match the left-product of this matrix with the vector of derivatives w.r.t outputs:
+
+$$ \left\langle \frac{\partial \mathcal{L}}{\partial \mathbf{f}}, D\mathbf{f}(\mathbf{x})\right\rangle  = 
+\begin{bmatrix}
+\frac{\partial \mathcal{L}}{\partial f_0} & \frac{\partial \mathcal{L}}{\partial f_1}
+\end{bmatrix}
+\begin{bmatrix} 
+2x    & 0.0 \\  
+1.0   & 1.0 \\
+\end{bmatrix} = 
+\begin{bmatrix} \left(\frac{\partial \mathcal{L}}{\partial f_0} \cdot 2x + \frac{\partial \mathcal{L}}{\partial f_1}\right) & \frac{\partial \mathcal{L}}{\partial f_1} \end{bmatrix}
+$$
+
+and the **forward-mode derivative**'s outputs should match the right-product of this matrix with the vector of differentials of the inputs:
+
+$$ \langle D\mathbf{f}(\mathbf{x}), d\mathbf{x}\rangle  = 
+\begin{bmatrix} 
+2x    & 0.0 \\  
+1.0   & 1.0 \\
+\end{bmatrix}
+\begin{bmatrix}
+dx \\ dy
+\end{bmatrix} = 
+\begin{bmatrix} 2x \cdot dx & dx + dy \end{bmatrix}
+$$
+
+Note that when we generate derivative code in practice, we do not materialize the full Jacobian matrix, and instead use the composition property to chain together derivatives at the instruction level. 
+However, the resulting code is equivalent to the Jacobian method (mathematically), and it is a good, analytical way to confirm that the generated code is indeed correct (or when thinking about what the derivative of a particular instruction/set of instructions should be)
+
+
+### Building Blocks: Reverse-mode derivatives compose in reverse order of execution.
+A consequence of using the 'left-side inner product' is that derivatives of a composite function must be computed in the reverse of the order of primal computation.
+
+Here's an example of a composite function $h$ (similar to the example used in forward-mode building blocks):
+
+$$ h(\mathbf{x}) = f(g(\mathbf{x})) $$
+
+where (for brevity):
+
+$$ \mathbf{y} = g(\mathbf{x}) $$
+
+The reverse-mode derivative function for $h$ can be written as the composition of the reverse-mode derivatives of $f$ and $g$
+
+$$ \left\langle \frac{\partial L}{\partial h}, Dh(\mathbf{x})\right\rangle  = \left\langle \left\langle \frac{\partial L}{\partial h}, Df(\mathbf{y})\right\rangle , Dg(\mathbf{x})\right\rangle $$
+
+Note the 'backward' order here. We must first pass the derivatives through the outer function $f$, and then pass the result through the inner function $g$ to compute derivatives w.r.t inner-most inputs $\mathbf{x}$. This process of passing derivatives backwards is often referred to as *backpropagation*.
+
+A more concrete Slang example of the same:
+
+```C
+// Compute square
+float sqr(float x)
+{
+    return x * x;
+}
+
+// Compute length of hypotenuse
+float f(float x, float y)
+{
+    return sqrt(sqr(x) + sqr(y));
+}
+```
+
+The derivative functions are then:
+```C
+void rev_sqr(DifferentialPair<float> dpx, float dOutput)
+{
+    float x = dpx.getPrimal();
+
+    dpx = DifferentialPair<float>(x, dOutput * 2 * x);
+}
+
+void rev_f(DifferentialPair<float> dpx, DifferentialPair<float> dpy, float dOut)
+{
+    float t = 0.5f / sqrt(x * x + y * y);
+    
+    float d_xsqr = t * dOut; // Calculate derivatives w.r.t output of sqr(x)
+    float d_ysqr = t * dOut; // Calculate derivatives w.r.t output of sqr(y)
+
+    rev_sqr(dpx, d_xsqr); // Propagate to x
+    rev_sqr(dpx, d_ysqr); // Propagate to y
+}
+```
+
+When comparing `rev_f`'s implementation to `fwd_f`, note the order of computing derivative w.r.t `sqr` (in `rev_f`, `rev_sqr` is called at the end, while in `fwd_f` it is called at the beginning)
+
@@ -0,0 +1,92 @@
+This document details auto-diff-related decorations that are lowered in to the IR to help annotate methods with relevant information.
+
+## `[Differentiable]`
+The `[Differentiable]` attribute is used to mark functions as being differentiable. The auto-diff process will only touch functions that are marked explicitly as `[Differentiable]`. All other functions are considered non-differentiable and calls to such functions from a differentiable function are simply copied as-is with no transformation.
+
+Further, only `[Differentiable]` methods are checked during the derivative data-flow pass. This decorator is translated into `BackwardDifferentiableAttribute` (which implies both forward and backward differentiability), and then lowered into the IR `OpBackwardDifferentiableDecoration`
+
+**Note:** `[Differentiable]` was previously implemented as two separate decorators `[ForwardDifferentiable]` and `[BackwardDifferentiable]` to denote differentiability with each type of auto-diff transformation. However, these are now **deprecated**. The preferred approach is to use only `[Differentiable]`
+
+`fwd_diff` and `bwd_diff` cannot be directly called on methods that don't have the `[Differentiable]` tag (will result in an error). If non-`[Differentiable]` methods are called from within a `[Differentiable]` method, they must be wrapped in `no_diff()` operation (enforced by the [derivative data-flow analysis pass](./types.md#derivative-data-flow-analysis) )
+
+### `[Differentiable]` for `interface` Requirements
+The `[Differentiable]` attribute can also be used to decorate interface requirements. In this case, the attribute is handled in a slightly different manner, since we do not have access to the concrete implementations.
+
+The process is roughly as follows:
+1. During the semantic checking step, when checking a method that is an interface requirement (in `checkCallableDeclCommon` in `slang-check-decl.cpp`), we check if the method has a `[Differentiable]` attribute
+2. If yes, we construct create a set of new method declarations, one for the forward-mode derivative (`ForwardDerivativeRequirementDecl`) and one for the reverse-mode derivative (`BackwardDerivativeRequirementDecl`), with the appropriate translated function types and insert them into the same interface.
+3. Insert a new member into the original method to reference the new declarations (`DerivativeRequirementReferenceDecl`)
+4. When lowering to IR, the `DerivativeRequirementReferenceDecl` member is converted into a custom derivative reference by adding the `OpBackwardDerivativeDecoration(deriv-fn-req-key)` and `OpForwardDerivativeDecoration(deriv-fn-req-key)` decorations on the primal method's requirement key.
+
+Here is an example of what this would look like:
+
+```C
+interface IFoo
+{
+    [Differentiable]
+    float bar(float);
+};
+
+// After checking & lowering
+interface IFoo_after_checking_and_lowering
+{
+    [BackwardDerivative(bar_bwd)]
+    [ForwardDerivative(bar_fwd)]
+    float bar(float);
+
+    void bar_bwd(inout DifferentialPair<float>, float);
+
+    DifferentialPair<float> bar_fwd(DifferentialPair<float>);
+};
+```
+
+**Note:** All conforming types must _also_ declare their corresponding implementations as differentiable so that their derivative implementations are synthesized to match the interface signature. In this sense, the `[Differentiable]` attribute is part of the functions signature, so a `[Differentiable]` interface requirement can only be satisfied by a `[Differentiable]` function implementation
+
+### `[TreatAsDifferentiable]`
+In large codebases where some interfaces may have several possible implementations, it may not be reasonable to have to mark all possible implementations with `[Differentiable]`, especially if certain implementations use hacks or workarounds that need additional consideration before they can be marked `[Differentiable]`
+
+In such cases, we provide the `[TreatAsDifferentiable]` decoration (AST node: `TreatAsDifferentiableAttribute`, IR: `OpTreatAsDifferentiableDecoration`), which instructs the auto-diff passes to construct an 'empty' function that returns a 0 (or 0-equivalent) for the derivative values. This allows the signature of a `[TreatAsDifferentiable]` function to match a `[Differentiable]` requirement without actually having to produce a derivative.
+
+## Custom derivative decorators
+In many cases, it is desirable to manually specify the derivative code for a method rather than let the auto-diff pass synthesize it from the method body. This is usually desirable if:
+1. The body of the method is too complex, and there is a simpler, mathematically equivalent way to compute the same value (often the case for intrinsics like `sin(x)`, `arccos(x)`, etc..)
+2. The method involves global/shared memory accesses, and synthesized derivative code may cause race conditions or be very slow due to overuse of synchronization. For this reason Slang assumes global memory accesses are non-differentiable by default, and requires that the user (or the core module) define separate accessors with different derivative semantics.
+
+The Slang front-end provides two sets of decorators to facilitate this:
+1. To reference a custom derivative function from a primal function: `[ForwardDerivative(fn)]` and `[BackwardDerivative(fn)]` (AST Nodes: `ForwardDerivativeAttribute`/`BackwardDerivativeAttribute`, IR: `OpForwardDervativeDecoration`/`OpBackwardDerivativeDecoration`), and 
+2. To reference a primal function from its custom derivative function: `[ForwardDerivativeOf(fn)]` and `[BackwardDerivativeOf(fn)]` (AST Nodes: `ForwardDerivativeAttributeOf`/`BackwardDerivativeAttributeOf`). These attributes are useful to provide custom derivatives for existing methods in a different file without having to edit/change that module. For instance, we use `diff.meta.slang` to provide derivatives for the core module functions in `hlsl.meta.slang`. When lowering to IR, these references are placed on the target (primal function). That way both sets of decorations are lowered on the primal function.
+
+These decorators also work on generically defined methods, as well as struct methods. Similar to how function calls work, these decorators also work on overloaded methods (and reuse the `ResolveInoke` infrastructure to perform resolution)
+
+### Checking custom derivative signatures
+To ensure that the user-provided derivatives agree with the expected signature, as well as resolve the appropriate method when multiple overloads are available, we check the signature of the custom derivative function against the translated version of the primal function. This currently occurs in `checkDerivativeAttribute()`/`checkDerivativeOfAttribute()`. 
+
+The checking process re-uses existing infrastructure from `ResolveInvoke`, by constructing a temporary invoke expr to call the user-provided derivative using a set of 'imaginary' arguments according to the translated type of the primal method. If `ResolveInvoke` is successful, the provided derivative signature is considered to be a match. This approach also automatically allows us to resolve overloaded methods, account for generic types and type coercion.
+
+## `[PrimalSubstitute(fn)]` and `[PrimalSubstituteOf(fn)]`
+In some cases, we face the opposite problem that inspired custom derivatives. That is, we want the compiler to auto-synthesize the derivative from the function body, but there _is_ no function body to translate.
+This frequently occurs with hardware intrinsic operations that are lowered into special op-codes that map to hardware units, such as texture sampling & interpolation operations. 
+However, these operations do have reference 'software' implementations which can be used to produce the derivative.
+
+To allow user code to use the fast hardware intrinsics for the primal pass, but use synthesized derivatives for the derivative pass, we provide decorators `[PrimalSubstitute(ref-fn)]` and `[PrimalSubstituteOf(orig-fn)]` (AST Node: `PrimalSubstituteAttribute`/`PrimalSubstituteOfAttribute`, IR: `OpPrimalSubstituteDecoration`), that can be used to provide a reference implementation for the auto-diff pass.
+
+Example:
+```C
+[PrimalSubstitute(sampleTexture_ref)]
+float sampleTexture(TexHandle2D tex, float2 uv)
+{
+    // Hardware intrinsics
+}
+
+float sampleTexture_ref(TexHandle2D tex, float2 uv)
+{
+    // Reference SW implementation.
+}
+
+void sampleTexture_bwd(TexHandle2D tex, inout DifferentialPair<float2> dp_uv, float dOut)
+{
+    // Backward derivate code synthesized using the reference implementation.
+}
+```
+
+The implementation of `[PrimalSubstitute(fn)]` is relatively straightforward. When the transcribers are asked to synthesize a derivative of a function, they check for a `OpPrimalSubstituteDecoration`, and swap the current function out for the substitute function before proceeding with derivative synthesis.
@@ -0,0 +1,290 @@
+
+This documentation is intended for Slang contributors and is written from a compiler engineering point of view. For Slang users, see the user-guide at this link: [https://shader-slang.com/slang/user-guide/autodiff.html](https://shader-slang.com/slang/user-guide/autodiff.html)
+
+Before diving into this document, please review the document on [Basics](./basics.md) for the fundamentals of automatic differentiation. 
+
+# Components of the Type System
+Here we detail the main components of the type system: the `IDifferentiable` interface to define differentiable types, the `DifferentialPair<T>` type to carry a primal and corresponding differential in a single type. 
+We also detail how auto-diff operators are type-checked (the higher-order function checking system), how the `no_diff` decoration can be used to avoid differentiation through attributed types, and the derivative data flow analysis that warns the the user of unintentionally stopping derivatives.
+
+## `interface IDifferentiable`
+Defined in core.meta.slang, `IDifferentiable` forms the basis for denoting differentiable types, both within the core module, and otherwise. 
+The definition of `IDifferentiable` is designed to encapsulate the following 4 items:
+1. `Differential`: The type of the differential value of the conforming type. This allows custom data-structures to be defined to carry the differential values, which may be optimized for space instead of relying solely on compiler synthesis/
+
+Since the computation of derivatives is inherently linear, we only need access to a few operations. These are:
+
+2. `dadd(Differential, Differential) -> Differential`: Addition of two values of the differential type. It's implementation must be associative and commutative, or the resulting derivative code may be incorrect.
+3. `dzero() -> Differential`: Additive identity (i.e. the zero or empty value) that can be used to initialize variables during gradient aggregation
+4. `dmul<S:__BuiltinRealType>(S, Differential)`: Scalar multiplication of a real number with the differential type. It's implementation must be distributive over differential addition (`dadd`).
+
+Points 2, 3 & 4 are derived from the concept of vector spaces. The derivative values of any Slang function always form a vector space (https://en.wikipedia.org/wiki/Vector_space).
+
+### Derivative member associations
+In certain scenarios, the compiler needs information on how the fields in the original type map to the differential type. Particularly, this is a problem when differentiate the implicit construction of a struct through braces (i.e. `{}`), represented by `kIROp_MakeStruct`. We provide the decorator `[DerivativeMember(DifferentialTypeName.fieldName)]` (ASTNode: DerivativeMemberAttribute, IR: kIROp_DerivativeMemberDecoration) to explicitly mark these associations.
+Example
+```C
+struct MyType : IDifferentiable
+{
+    typealias Differential = MyDiffType;
+    float a;
+
+    [DerivativeMember(MyDiffType.db)]
+    float b;
+
+    /* ... */
+};
+
+struct MyDiffType
+{
+    float db;
+};
+```
+
+### Automatic Synthesis of `IDifferentible` Conformances for Aggregate Types
+It can be tedious to expect users to hand-write the associated `Differential` type, the corresponding mappings and interface methods for every user-defined `struct` type. For aggregate types, these are trivial to construct by analysing which of their components conform to `IDifferentiable`. 
+The synthesis proceeds in roughly the following fashion:
+1. `IDifferentiable`'s components are tagged with special decorator `__builtin_requirement(unique_integer_id)` which carries an enum value from `BuiltinRequirementKind`.
+2. When checking that types conform to their interfaces, if a user-provided definition does not satisfy a requirement with a built-in tag, we perform synthesis by dispatching to `trySynthesizeRequirementWitness`. 
+3. For _user-defined types_, Differential **types** are synthesized during conformance-checking through `trySynthesizeDifferentialAssociatedTypeRequirementWitness` by checking if each constituent type conforms to `IDifferentiable`, looking up the corresponding `Differential` type, and constructing a new aggregate type from these differential types. Note that since it is possible that a `Differential` type of a constituent member has not yet been synthesized, we have additional logic in the lookup system (`trySynthesizeRequirementWitness`) that synthesizes a temporary empty type with a `ToBeSynthesizedModifier`, so that the fields can be filled in later, when the member type undergoes conformance checking.
+4. For _user-defined types_, Differential methods (`dadd`, `dzero` and `dmul`) are synthesized in `trySynthesizeDifferentialMethodRequirementWitness` by utilizing the `Differential` member and its `[DifferentialMember]` decorations to determine which fields need to be considered and the base type to use for each field. There are two synthesis patterns. The fully-inductive pattern is used for `dadd` and `dzero` which works by calling `dadd` and `dzero` respectively on the individual fields of the `Differential` type under consideration. 
+Example:
+```C
+// Synthesized from "struct T {FT1 field1; FT2 field2;}"
+T.Differential dadd(T.Differential a, T.Differential b)
+{
+    return Differential(
+        FT1.dadd(a.field1, b.field1),
+        FT2.dadd(a.field2, b.field2),
+    )
+}
+```
+On the other hand, `dmul` uses the fixed-first arg pattern since the first argument is a common scalar, and proceeds inductively on all the other args.
+Example:
+```C
+// Synthesized from "struct T {FT1 field1; FT2 field2;}"
+T.Differential dmul<S:__BuiltinRealType>(S s, T.Differential a)
+{
+    return Differential(
+        FT1<S>.dmul(s, a.field1),
+        FT2<S>.dmul(s, a.field2),
+    )
+}
+```
+5. During auto-diff, the compiler can sometimes synthesize new aggregate types. The most common case is the intermediate context type (`kIROp_BackwardDerivativeIntermediateContextType`), which is lowered into a standard struct once the auto-diff pass is complete. It is important to synthesize the `IDifferentiable` conformance for such types since they may be further differentiated (through higher-order differentiation). This implementation is contained in `fillDifferentialTypeImplementationForStruct(...)` and is roughly analogous to the AST-side synthesis.
+
+### Differentiable Type Dictionaries
+During auto-diff, the IR passes frequently need to perform lookups to check if an `IRType` is differentiable, and retrieve references to the corresponding `IDifferentiable` methods. These lookups also need to work on generic parameters (that are defined inside generic containers), and existential types that are interface-typed parameters.
+
+To accommodate this range of different type systems, Slang uses a type dictionary system that associates a dictionary of relevant types with each function. This works in the following way:
+1. When `CheckTerm()` is called on an expression within a function that is marked differentiable (`[Differentiable]`), we check if the resolved type conforms to `IDifferentiable`. If so, we add this type to the dictionary along with the witness to its differentiability. The dictionary is currently located on `DifferentiableAttribute` that corresponds to the `[Differentiable]` modifier.
+
+2. When lowering to IR, we create a `DifferentiableTypeDictionaryDecoration` which holds the IR versions of all the types in the dictionary as well as a reference to their `IDifferentiable` witness tables.
+
+3. When synthesizing the derivative code, all the transcriber passes use `DifferentiableTypeConformanceContext::setFunc()` to load the type dictionary. `DifferentiableTypeConformanceContext` then provides convenience functions to lookup differentiable types, appropriate `IDifferentiable` methods, and construct appropriate `DifferentialPair<T>`s.
+
+### Looking up Differential Info on _Generic_ types
+Generically defined types are also lowered into the differentiable type dictionary, but rather than having a concrete witness table, the witness table is itself a parameter. When auto-diff passes need to find the differential type or place a call to the IDifferentiable methods, this is turned into a lookup on the witness table parameter (i.e. `Lookup(<InterfaceRequirementKey>, <WitnessTableParameter>)`). Note that these lookups instructions are inserted into the generic parent container rather than the inner most function. 
+Example:
+```C
+T myFunc<T:IDifferentiable>(T a)
+{
+    return a * a;
+}
+
+// Reverse-mode differentiated version
+void bwd_myFunc<T:IDifferentiable>(
+    inout DifferentialPair<T> dpa,
+    T.Differential dOut) // T.Differential is Lookup('Differential', T_Witness_Table)
+{
+    T.Differential da = T.dzero(); // T.dzero is Lookup('dzero', T_Witness_Table)
+
+    da = T.dadd(dpa.p * dOut, da); // T.dadd is Lookup('dadd', T_Witness_Table)
+    da = T.dadd(dpa.p * dOut, da);
+
+    dpa = diffPair(dpa.p, da);
+}
+```
+
+### Looking up Differential Info on _Existential_ types
+Existential types are interface-typed values, where there are multiple possible implementations at run-time. The existential type carries information about the concrete type at run-time and is effectively a 'tagged union' of all possible types.
+
+#### Differential type of an Existential
+The differential type of an existential type is tricky to define since our type system's only restriction on the `.Differential` type is that it also conforms to `IDifferentiable`. The differential type of any interface `IInterface : IDifferentiable` is therefore the interface type `IDifferentiable`. This is problematic since Slang generally requires a static `anyValueSize` that must be a strict upper bound on the sizes of all conforming types (since this size is used to allocate space for the union). Since `IDifferentiable` is defined in the core module `core.meta.slang` and can be used by the user, it is impossible to define a reliable bound. 
+We instead provide a new **any-value-size inference** pass (`slang-ir-any-value-inference.h`/`slang-ir-any-value-inference.cpp`) that assembles a list of types that conform to each interface in the final linked IR and determines a relevant upper bound. This allows us to ignore types that conform to `IDifferentiable` but aren't used in the final IR, and generate a tighter upper bound. 
+
+**Future work:**
+This approach, while functional, creates a locality problem since the size of `IDifferentiable` is the max of _all_ types that conform to `IDifferentiable` in visible modules, even though we only care about the subset of types that appear as `T.Differential` for `T : IInterface`. The reason for this problem is that upon performing an associated type lookup, the Slang IR drops all information about the base interface that the lookup starts from and only considers the constraint interface (in this case `Differential : IDifferentiable`). 
+There are several ways to resolve this issue, including (i) a static analysis pass that determines the possible set of types at each use location and propagates them to determine a narrower set of types, or (ii) generic (or 'parameterized') interfaces, such as `IDifferentiable<T>` where each version can have a different set of conforming types.
+
+<!--#### IDifferentiable Method lookups on an Existential
+All other method lookups are performed using existential-type lookups on the existential parameter. The idea is that existential-typed parameters come with a witness-table component that can be accessed by invoking `kIROp_ExtractExistentialWitnessTable` on them. This allows us to look up the `dadd`/`dzero` methods on this witness table in the same way as we did for generic types.-->
+
+Example:
+```C
+interface IInterface : IDifferentiable
+{
+    [Differentiable]
+    This foo(float val);
+
+    [Differentiable]
+    float bar();
+};
+
+float myFunc(IInterface obj, float a)
+{
+    IInterface k = obj.foo(a);
+    return k.bar();
+}
+
+// Reverse-mode differentiated version (in pseudo-code corresponding to IR, some of these will get lowered further)
+void bwd_myFunc(
+    inout DifferentialPair<IInterface> dpobj,
+    inout DifferentialPair<float> dpa,
+    float.Differential dOut) // T.Differential is Lookup('Differential', T_Witness_Table)
+{
+    // Primal pass..
+    IInterface obj = dpobj.p;
+    IInterface k = obj.foo(a);
+    // .....
+
+    // Backward pass
+    DifferentialPair<IInterface> dpk = diffPair(k);
+    bwd_bar(dpk, dOut);
+    IDifferentiable dk = dpk.d; // Differential of `IInterface` is `IDifferentiable`
+
+    DifferentialPair<IInterface> dp = diffPair(dpobj.p);
+    bwd_foo(dpobj, dpa, dk);
+}
+
+```
+
+#### Looking up `dadd()` and `dzero()` on Existential Types
+There are two distinct cases for lookup on an existential type. The more common case is the closed-box existential type represented simply by an interface. Every value of this type contains a type identifier & a witness table identifier along with the value itself.  The less common case is when the function calls are performed directly on the value after being cast to the concrete type.
+
+**`dzero()` for "closed" Existential type: The `NullDifferential` Type**
+For concrete and even generic types, we can initialize a derivative accumulator variable by calling the appropriate `Type.dzero()` method. This is unfortunately not possible when initializing an existential differential (which is currently of type `IDifferentiable`), since we must also initialize the type-id of this existential to one of the implementations, but we do not know which one yet since that is a run-time value that only becomes known after the first differential value is generated.
+
+To get around this issue, we declare a special type called `NullDifferential` that acts as a "none type" for any `IDifferentiable` existential object. 
+
+**`dadd()` for "closed" Existential types: `__existential_dadd`**
+We cannot directly use `dadd()` on two existential differentials of type `IDifferentiable` because we must handle the case where one of them is of type `NullDifferential` and `dadd()` is only defined for differentials of the same type. 
+We handle this currently by synthesizing a special method called `__existential_dadd` (`getOrCreateExistentialDAddMethod` in `slang-ir-autodiff.cpp`) that performs a run-time type-id check to see if one of the operand is of type `NullDifferential` and returns the other operand if so. If both are non-null, we dispatch to the appropriate `dadd` for the concrete type.
+
+**`dadd()` and `dzero()` for "open" Existential types**
+If we are dealing with values of the concrete type (i.e. the opened value obtained through `ExtractExistentialValue(ExistentialParam)`). Then we can perform lookups in the same way we do for generic type. All existential parameters come with a witness table. We insert instructions to extract this witness table and perform lookups accordingly. That is, for `dadd()`, we use `Lookup('dadd', ExtractExistentialWitnessTable(ExistentialParam))` and place a call to the result.
+
+## `struct DifferentialPair<T:IDifferentiable>`
+The second major component is `DifferentialPair<T:IDifferentiable>` that represents a pair of a primal value and its corresponding differential value. 
+The differential pair is primarily used for passing & receiving derivatives from the synthesized derivative methods, as well as for block parameters on the IR-side.
+Both `fwd_diff(fn)` and `bwd_diff(fn)` act as function-to-function transformations, and so the Slang front-end translates the type of `fn` to its derivative version so the arguments can be type checked.
+
+### Pair type lowering.
+The differential pair type is a special type throughout the AST and IR passes (AST Node: `DifferentialPairType`, IR: `kIROp_DifferentialPairType`) because of its use in front-end semantic checking and when synthesizing the derivative code for the functions. Once the auto-diff passes are complete, the pair types are lowering into simple `struct`s so they can be easily emitted (`DiffPairLoweringPass` in `slang-ir-autodiff-pairs.cpp`). 
+We also define additional instructions for pair construction (`kIROp_MakeDifferentialPair`) and extraction (`kIROp_DifferentialPairGetDifferential` & `kIROp_DifferentialPairGetPrimal`) which are lowered into struct construction and field accessors, respectively.
+
+### "User-code" Differential Pairs
+Just as we use special IR codes for differential pairs because they have special handling in the IR passes, sometimes differential pairs should be _treated as_ regular struct types during the auto-diff passes.
+This happens primarily during higher-order differentiation when the user wishes to differentiate the same code multiple times. 
+Slang's auto-diff approaches this by rewriting all the relevant differential pairs into 'irrelevant' differential pairs (`kIROp_DifferentialPairUserCode`) and 'irrelevant' accessors (`kIROp_DifferentialPairGetDifferentialUserCode`, `kIROp_DifferentialPairGetPrimalUserCode`) at the end of **each auto-diff iteration** so that the next iteration treats these as regular differentiable types. 
+The user-code versions are also lowered into `struct`s in the same way.
+
+## Type Checking of Auto-Diff Calls (and other _higher-order_ functions)
+Since `fwd_diff` and `bwd_diff` are represented as higher order functions that take a function as an input and return the derivative function, the front-end semantic checking needs some notion of higher-order functions to be able to check and lower the calls into appropriate IR.
+
+### Higher-order Invocation Base: `HigherOrderInvokeExpr`
+All higher order transformations derive from `HigherOrderInvokeExpr`. For auto-diff there are two possible expression classes `ForwardDifferentiateExpr` and `BackwardDifferentiateExpr`, both of which derive from this parent expression.
+
+### Higher-order Function Call Checking: `HigherOrderInvokeExprCheckingActions`
+Resolving the concrete method is not a trivial issue in Slang, given its support for overloading, type coercion and more. This becomes more complex with the presence of a function transformation in the chain. 
+For example, if we have `fwd_diff(f)(DiffPair<float>(...), DiffPair<double>(...))`, we would need to find the correct match for `f` based on its post-transform argument types.
+
+To facilitate this we use the following workflow:
+1. The `HigherOrderInvokeExprCheckingActions` base class provides a mechanism for different higher-order expressions to implement their type translation (i.e. what is the type of the transformed function). 
+2. The checking mechanism passes all detected overloads for `f` through the type translation and assembles a new group out of the results (the new functions are 'temporary')
+3. This new group is used by `ResolveInvoke` when performing overload resolution and type coercion using the user-provided argument list.
+4. The resolved signature (if there is one) is then replaced with the corresponding function reference and wrapped in the appropriate higher-order invoke.
+
+**Example:**
+
+Let's say we have two functions with the same name `f`: (`int -> float`, `double, double -> float`)
+and we want to resolve `fwd_diff(f)(DiffPair<float>(1.0, 0.0), DiffPair<float>(0.0, 1.0))`.
+
+The higher-order checking actions will synthesize the 'temporary' group of translated signatures (`int -> DiffPair<float>`, `DiffPair<double>, DiffPair<double> -> DiffPair<float>`). 
+Invoke resolution will then narrow this down to a single match (`DiffPair<double>, DiffPair<double> -> DiffPair<float>`) by automatically casting the `float`s to `double`s. Once the resolution is complete, 
+we return `InvokeExpr(ForwardDifferentiateExpr(f : double, double -> float), casted_args)` by wrapping the corresponding function in the corresponding higher-order expr
+
+## Attributed Types (`no_diff` parameters)
+
+Often, it will be necessary to prevent gradients from propagating through certain parameters, for correctness reasons. For example, values representing random samples are often not differentiated since the result may be mathematically incorrect.
+
+Slang provides the `no_diff` operator to mark parameters as non-differentiable, even if they use a type that conforms to `IDifferentiable`
+
+```C
+float myFunc(float a, no_diff float b)
+{
+    return a * b;
+}
+
+// Resulting fwd-mode derivative:
+DiffPair<float> myFunc(DiffPair<float> dpa, float b)
+{
+    return diffPair(dpa.p * b, dpa.d * b);
+}
+```
+
+Slang uses _OpAttributedType_ to denote the IR type of such parameters. For example, the lowered type of `b` in the above example is `OpAttributedType(OpFloat, OpNoDiffAttr)`. In the front-end, this is represented through the `ModifiedType` AST node. 
+
+Sometimes, this additional layer can get in the way of things like type equality checks and other mechanisms where the `no_diff` is irrelevant. Thus, we provide the `unwrapAttributedType` helper to remove attributed type layers for such cases.
+
+## Derivative Data-Flow Analysis
+Slang has a derivative data-flow analysis pass that is performed on a per-function basis immediately after lowering to IR and before the linking step (`slang-ir-check-differentiability.h`/`slang-ir-check-differentiability.cpp`). 
+
+The job of this pass is to enforce that instructions that are of a differentiable type will propagate a derivatives, unless explicitly dropped by the user through `detach()` or `no_diff`. The reason for this is that Slang requires functions to be decorated with `[Differentiable]` to allow it to propagate derivatives. Otherwise, the function is considered non-differentiable, and effectively produces a 0 derivative. This can lead to frustrating situations where a function may be dropping non-differentiable on purpose. Example:
+```C
+float nonDiffFunc(float x)
+{
+    /* ... */
+}
+
+float differentiableFunc(float x) // Forgot to annotate with [Differentiable]
+{
+    /* ... */
+}
+
+float main(float x)
+{
+    // User doesn't realise that the function that is supposed to be differentiable is not 
+    // getting differentiated, because the types here are all 'float'.
+    // 
+    return nonDiffFunc(x) * differentiableFunc(x);
+}
+```
+
+The data-flow analysis step enforces that non-differentiable functions used in a differentiable context should get their derivative dropped explicitly. That way, it is clear to the user whether a call is getting differentiated or dropped.
+
+Same example with `no_diff` enforcement:
+```C
+float nonDiffFunc(float x)
+{
+    /* ... */
+}
+
+[Differentiable]
+float differentiableFunc(float x)
+{
+    /* ... */
+}
+
+float main(float x)
+{
+    return no_diff(nonDiffFunc(x)) * differentiableFunc(x);
+}
+```
+
+A `no_diff` can only be used directly on a function call, and turns into a `TreatAsDifferentiableDecoration` that indicates that the function will not produce a derivative.
+
+The derivative data-flow analysis pass works similar to a standard data-flow pass:
+1. We start by assembling a set of instructions that 'produce' derivatives by starting with the parameters of differentiable types (and without an explicit `no_diff`), and propagating them through each instruction in the block. An inst carries a derivative if there one of its operands carries a derivative, and the result type is differentiable.
+2. We then assemble a set of instructions that expect a derivative. These are differentiable operands of differentiable functions (unless they have been marked by `no_diff`). We then reverse-propagate this set by adding in all differentiable operands (and repeating this process).
+3. During this reverse-propagation, if there is any `OpCall` in the 'expect' set that is not also in the 'produce' set, then we have a situation where the gradient hasn't been explicitly dropped, and we create a user diagnostic.
@@ -0,0 +1,199 @@
+# Design Document: Slang IR Module Backwards Compatibility
+
+## Overview
+
+This document describes the design and implementation of backwards compatibility support for serialized Slang IR modules. The feature enables Slang to load IR modules compiled with different versions of the compiler, providing version information and graceful handling of incompatible modules.
+
+## Motivation
+
+As Slang evolves, the intermediate representation (IR) may change with new instructions being added or existing ones being modified. Without backwards compatibility:
+
+- Users cannot load modules compiled with older versions of Slang
+- There's no way to detect version mismatches between modules
+- Module compatibility issues are opaque to users
+
+This feature addresses these issues by introducing versioning and stable instruction naming.
+
+## User-Facing Changes
+
+### New Command Line Options
+
+1. **`-get-module-info <module-file>`**
+
+   - Prints information about a serialized IR module without loading it
+   - Output includes:
+     - Module name
+     - Module version
+     - Compiler version that created the module
+   - Example usage: `slangc -get-module-info mymodule.slang-module`
+
+2. **`-get-supported-module-versions`**
+   - Prints the range of module versions this compiler supports
+   - Output includes minimum and maximum supported versions
+   - Example usage: `slangc -get-supported-module-versions`
+
+### API Changes
+
+New method in `ISession` interface:
+
+```cpp
+SlangResult loadModuleInfoFromIRBlob(
+    slang::IBlob* source,
+    SlangInt& outModuleVersion,
+    const char*& outModuleCompilerVersion,
+    const char*& outModuleName);
+```
+
+This allows programmatic inspection of module metadata without full deserialization.
+
+## Technical Design
+
+### Stable Instruction Names
+
+The core mechanism for backwards compatibility is the introduction of stable names for IR instructions:
+
+1. **Stable Name Table** (`slang-ir-insts-stable-names.lua`)
+
+   - Maps instruction names to unique integer IDs
+   - IDs are permanent once assigned
+   - New instructions get new IDs, never reusing old ones
+
+2. **Runtime Mapping**
+   - `getOpcodeStableName(IROp)`: Convert runtime opcode to stable ID
+   - `getStableNameOpcode(UInt)`: Convert stable ID back to runtime opcode
+   - Unknown stable IDs map to `kIROp_Unrecognized`
+
+### Module Versioning
+
+Two types of versions are tracked:
+
+1. **Module Version** (`IRModule::m_version`)
+
+   - Semantic version of the IR instruction set
+   - Range: `k_minSupportedModuleVersion` to `k_maxSupportedModuleVersion`
+   - Stored in each serialized module
+
+2. **Serialization Version** (`IRModuleInfo::serializationVersion`)
+   - Version of the serialization format itself
+   - Currently version 0
+   - Allows future changes to serialization structure
+
+### Compiler Version Tracking
+
+Each module stores the exact compiler version (`SLANG_TAG_VERSION`) that created it. This enables version-specific workarounds if needed in the future.
+
+### Validation System
+
+A GitHub Actions workflow (`check-ir-stable-names.yml`) ensures consistency:
+
+1. **Check Mode**: Validates that:
+
+   - All IR instructions have stable names
+   - No duplicate stable IDs exist
+   - The stable name table is a bijection with current instructions
+
+2. **Update Mode**: Automatically assigns stable IDs to new instructions
+
+The validation is implemented in `check-ir-stable-names.lua` which:
+
+- Loads instruction definitions from `slang-ir-insts.lua`
+- Compares against `slang-ir-insts-stable-names.lua`
+- Reports missing entries or inconsistencies
+
+## Breaking Changes and Version Management
+
+### When to Update Module Version
+
+The module version must be updated when:
+
+1. **Adding Instructions** (Minor Version Bump)
+
+   - Increment `k_maxSupportedModuleVersion`
+   - Older compilers can still load modules that don't use new instructions
+
+2. **Removing Instructions** (Major Version Bump)
+
+   - Increment `k_maxSupportedModuleVersion`
+   - Update `k_minSupportedModuleVersion` to exclude versions with removed instructions
+   - This breaks compatibility with older modules using removed instructions
+
+3. **Changing Instruction Semantics**
+   - Even if the instruction name remains the same
+   - Requires version bump to prevent incorrect behavior
+   - To avoid bumping the minimum supported version, one may instead introduce
+     a new instruction and just bump `k_maxSupportedModuleVersion`
+
+### Serialization Format Changes
+
+Changes to how data is serialized (not what data) require updating `serializationVersion`:
+
+- Changes to the RIFF container structure
+- Different encoding for instruction payloads
+- Reordering of serialized data
+
+## Implementation Details
+
+### Module Loading Flow
+
+1. **Version Check**
+
+   ```cpp
+   if (fossilizedModuleInfo->serializationVersion != IRModuleInfo::kSupportedSerializationVersion)
+       return SLANG_FAIL;
+   ```
+
+2. **Instruction Deserialization**
+
+   - Stable IDs are converted to runtime opcodes
+   - Unknown IDs become `kIROp_Unrecognized`
+
+3. **Validation Pass**
+   - After deserialization, check for any `kIROp_Unrecognized` instructions
+   - Fail loading if any are found
+
+### Error Handling
+
+- Incompatible serialization versions: Immediate failure
+- Unknown instructions: Mark as unrecognized, fail after full deserialization
+  (this should be caught by the next check)
+- Module version out of range: Fail after deserialization
+
+## Future Considerations
+
+### Potential Enhancements
+
+1. **Graceful Degradation**
+
+   - Skip unrecognized instructions if they're not critical
+   - Provide compatibility shims for removed instructions
+
+2. **Module Migration Tools**
+
+   - Utility to upgrade old modules to new formats
+   - Batch processing for large codebases
+
+### Maintenance Guidelines
+
+1. **Regular CI Validation**
+
+   - The GitHub Action ensures stable names stay synchronized
+   - Catches missing entries before merge
+
+2. **Version Documentation**
+
+   - Maintain changelog of what changed in each module version
+   - Document any version-specific workarounds
+
+3. **Testing**
+   - Test loading of modules from previous versions
+   - Verify error messages for incompatible modules
+
+## Conclusion
+
+This backwards compatibility system provides a robust foundation for Slang IR evolution while maintaining compatibility where possible. The combination of stable instruction naming, comprehensive versioning, and automated validation ensures that:
+
+- Users can reliably use modules across Slang versions
+- Developers can evolve the IR with clear compatibility boundaries
+- Version mismatches are detected and reported clearly
+
+The system is designed to be maintainable and extensible, with clear guidelines for when and how to make breaking changes.
@@ -0,0 +1,271 @@
+Capabilities (Out of Date)
+============
+
+Slang aims to be a portable language for shader programming, which introduces two complementary problems:
+
+1. We need a way to indicate that certain constructs (types, functions, etc.) are only allowed on certain targets, so that a user gets a meaningful error if they try to do something that won't work on one or more of the APIs or platforms they want to target. Similarly, the user expects to get an error if they call a fragment-shader-specific function inside of, say, compute shader code, or vice versa.
+
+2. If the same feature can be implemented across multiple platforms, but the best (or only) implementation path differs across platforms, then we need a way to express the platform specific code and pick the right implementation per-target.
+
+Item (2) is traditionally handled with preprocessor techniques (e.g., `#ifdef`ing the body of a function based on target platform), but that of course requires that the user invoke the Slang front end once for each target platform, and target-specific coding in a library will then "infect" code that uses that library, forcing them to invoke the front-end once per target as well.
+
+We are especially sensitive to this problem in the compiler itself, because we have to author and maintain the Slang standard modules, which needs to (1) expose the capabilities of many platforms and (2) work across all those platforms. It would be very unfortunate if we had to build different copies of our standard modules per-target.
+
+The intention in Slang is to solve both of these problems with a system of *capabilities*.
+
+What is a capability?
+---------------------
+
+For our purposes a capability is a discrete feature that a compilation target either does or does not support.
+We could imagine defining a capability for the presence of texture sampling operations with implicit gradients; this capability would be supported when generating fragment shader kernel code, but not when generating code for other stages.
+
+Let's imagine a language syntax that the standard modules could use to define some *atomic* capabilities:
+
+```
+capability implicit_gradient_texture_fetches;
+```
+We can then imagine using attributes to indicate that a function requires a certain capability:
+
+```
+struct Texture2D
+{
+	...
+
+	// Implicit-gradient sampling operation.
+	[availableFor(implicit_gradient_texture_fetches)]
+	float4 Sample(SamplerState s, float2 uv);
+}
+```
+
+(Note that the `[availableFor(...)]` syntax is just a straw-man to write up examples, and a better name would be desirable if/when we implement this stuff.)
+
+Given those declarations, we could then check when compiling code if the user is trying to call `Texture2D.Sample` in code compiled for a target that *doesn't* support implicit-gradient texture fetches, and issue an appropriate error.
+The details on how to sequence this all in the compiler will be covered later.
+
+Derived Capabilities
+--------------------
+
+Once we can define atomic capabilities, the next step is to be able to define *derived* capabilities.
+Let's imagine that we extend our `capability` syntax so that we can define a new capability that automatically implies one or more other capabilities:
+
+```
+capability fragment : implicit_gradient_texture_fetches;
+```
+
+Here we've said that whenever the `fragment` capability is available, we can safely assume that the `implicit_gradient_texture_fetches` capability is available (but not vice versa).
+
+Given even a rudimentary tool like that, we can start to build up capabilities that relate closely to the "profiles" in things like D3D:
+
+```
+capability d3d;
+capability sm_5_0 : d3d;
+capability sm_5_1 : sm_5_0;
+capability sm_6_0 : sm_5_1;
+...
+
+capability d3d11 : d3d, sm_5_0;
+capability d3d12 : d3d, sm_6_0;
+
+capability khronos;
+capability glsl_400 : khronos;
+capability glsl_410 : glsl_400;
+...
+
+capability vulkan : khronos, glsl_450;
+capability opengl : khronos;
+```
+
+Here we are saying that `sm_5_1` supports everything `sm_5_0` supports, and potentially more. We are saying that `d3d12` supports `sm_6_0` but maybe not, e.g., `sm_6_3`.
+We are expressing that fact that having a `glsl_*` capability means you are on some Khronos API target, but that it doesn't specify which one.
+(The exact details of these declarations obviously aren't the point; getting a good hierarchy of capabilities will take time.)
+
+Capability Composition
+----------------------
+
+Sometimes we'll want to give a distinct name to a specific combination of capabilities, but not say that it supports anything new:
+
+```
+capability ps_5_1 = sm_5_1 & fragment;
+```
+
+Here we are saying that the `ps_5_1` capability is *equivalent* to the combination of `sm_5_1` and `fragment` (that is, if you support both `sm_5_1` and `fragment` then you support `ps_5_1` and vice versa).
+
+Compositions should be allowed in `[availableFor(...)]` attributes (e.g., `[availableFor(vulkan & glsl_450)]`), but pre-defined compositions should be favored when possible.
+
+When composing things with `&` it is safe for the compiler to filter out redundancies based on what it knows so that, e.g., `ps_5_0 & fragment` resolves to just `ps_5_0`.
+
+Once we have an `&` operator for capabilities, it is easy to see that "derived" capabilities are really syntax sugar, so that a derived capability like:
+
+```
+capability A : B, C
+```
+
+could have been written instead as :
+
+```
+capability A_atomic
+capability A = A_atomic & B & C
+```
+
+Where the `A_atomic` capability guarantees that `A` implies `B` and `C` but not vice versa.
+
+It is also useful to think of an `|` operator on capabilities.
+In particular if a function has multiple `[availableFor(...)]` attributes:
+
+```
+[availableFor(vulkan & fragment)]
+[availableFor(d3d12 & fragment)]
+void myFunc();
+```
+
+This function should be equivalent to one with just a single `[availableFor((vulkan & fragment) | (d3d12 & fragment))]` which is equivalent to `[availableFor((vulkan | d3d12) & fragment)]`.
+Simplification should generally push toward "disjunctive normal form," though, rather than pursue simplifications like that.
+Note that we do *not* include negation, so that capabilities are not general Boolean expressions.
+
+Validation
+----------
+
+For a given function definition `F`, the front end will scan its body and see what it calls, and compose the capabilities required by the called functions using `&` (simplifying along the way). Call the resulting capability (in disjunctive normal form) `R`.
+
+If `F` doesn't have an `[availableFor(...)]` attribute, then we can derive its *effective* `[availableFor(...)]` capability as `R` (this probably needs to be expressed as an iterative dataflow problem over the call graph, to handle cycles).
+
+If `F` *does* have one or more `[availableFor(...)]` clauses that amount to a declared capability `C` (again in disjunctive normal form), then we can check that `C` implies `R` and error out if it is not the case.
+A reasonable implementation would track which calls introduced which requirements, and be able to explain *why* `C` does not capture the stated requirements.
+
+For a shader entry point, we should check it as if it had an `[availableFor(...)]` that is the OR of all the specified target profiles (e.g., `sm_5_0 | glsl_450 | ...`) ANDed with the specified stage (e.g., `fragment`).
+Any error here should be reported to the user.
+If an entry point has an explicit `[availableFor(...)]` then we should AND that onto the profile computed above, so that the user can restrict certain entry points to certain profiles.
+
+In order to support separate compilation, the functions that are exported from a module should probably either have explicit availability attributes, or else they will be compiled against a kind of "default capability" used for the whole module.
+Downstream code that consumes such a module would see declarations with explicit capabilities only.
+Picking an appropriate "default capability" to use when compiling modules is an important challenge; it would in practice define the "min spec" to use when compiling.
+
+Capability Overriding
+---------------------
+
+It should be possible to define multiple versions of a function, having different `[availableFor(...)]` attributes:
+
+```
+[availableFor(vulkan)] void myFunc() { ... }
+
+[availableFor(d3d12)] void myFunc() { ... }
+```
+
+For front-end checking, these should be treated as if they were a single definition of `myFunc` with an ORed capability (e.g., `vulkan | d3d12`).
+Overload resolution will pick the "best" candidate at a call site based *only* on the signatures of the function (note that this differs greatly from how profile-specific function overloading works in Cg).
+
+The front-end will then generate initial IR code for each definition of `myFunc`.
+Each of the IR functions will have the *same* mangled name, but different bodies, and each will have appropriate IR decorations to indicate the capabilities it requires.
+
+The choice of which definition to use is then put off until IR linking for a particular target.
+At that point we can look at all the IR functions matching a given mangled name, filter them according to the capabilities of the target, and then select the "best" one.
+
+In general a definition `A` of an IR symbol is better than another definition `B` if the capabilities on `A` imply those on `B` but not versa.
+(In practice this probably needs to be "the capabilities on `A` intersected with those of the target," and similarly for `B`)
+
+This approach allows us to defer profile-based choices of functions to very late in the process. The one big "gotcha" to be aware of is when functions are overloaded based on pipeline stage, where we would then have to be careful when generating DXIL or SPIR-V modules with multiple entry points (as a single function `f` might need to be specialized twice if it calls a stage-overloaded function `g`).
+
+Capabilities in Other Places
+----------------------------
+
+So far I've talked about capabilities on functions, but they should also be allowed on other declarations including:
+
+- Types, to indicate that code using that type needs the given capability
+- Interface conformances, to indicate that a type only conforms to the interface when the capabilities are available
+- Struct fields, to indicate that the field is only present in the type when the capabilities are present
+- Extension declarations, to indicate that everything in them requires the specified capabilities
+
+We should also provide a way to specify that a `register` or other layout modifier is only applicable for specific targets/stages. Such a capability nominally exists in HLSL today, but it would be much more useful if it could be applied to specify target-API-specific bindings.
+
+Only functions should support overloading based on capability. In all other cases there can only be one definition of an entity, and capabilities just decide when it is available.
+
+API Extensions as Capabilities
+------------------------------
+
+One clear use case for capabilities is to represent optional extensions, including cases where a feature is "built-in" in D3D but requires an extension in Vulkan:
+
+```
+capability KHR_secret_sauce : vulkan;
+
+[available_for(sm_7_0)] // always available for D3D Shader Model 7.0
+[available_for(KHR_secret_sauce)] // Need the "secret sauce" extension for Vulkan
+void improveShadows();
+```
+
+When generating code for Vulkan, we should be able to tell the user that the `improveShadows()` function requires the given extension. The user should be able to express compositions of capabilities in their `-profile` option (and similarly for the API):
+
+```
+slangc code.slang -profile vulkan+KHR_secret_sauce
+```
+(Note that for the command line, it is beneficial to use `+` instead of `&` to avoid conflicts with shell interpreters)
+
+An important question is whether the compiler should automatically infer required extensions without them being specified, so that it produces SPIR-V that requires extensions the user didn't ask for.
+The argument against such inference is that users should opt in to non-standard capabilities they are using, but it would be unfortunate if this in turn requires verbose command lines when invoking the compiler.
+It should be possible to indicate the capabilities that a module or entry point should be compiled to use without command-line complications.
+
+(A related challenge is when a capability can be provided by two different extensions: how should the compiler select the "right" one to use?)
+
+Disjoint Capabilities
+---------------------
+
+Certain compositions of capabilities make no sense. If a user declared a function as needing `vulkan & d3d12` they should probably get an error message.
+
+Knowing that certain capabilities are disjoint can also help improve the overall user experience.
+If a function requires `(vulkan & extensionA) | (d3d12 & featureb)` and we know we are compiling for `vulkan` we should be able to give the user a pointed error message saying they need to ask for `extensionA`, because adding `featureB` isn't going to do any good.
+
+As a first-pass model we could have a notion of `abstract` capabilities that are used to model the root of hierarchies of disjoint capabilities:
+
+```
+abstract capability api;
+
+abstract capability d3d : api;
+capability d3d11 : d3d;
+capability d3d12 : d3d;
+
+abstract capability khronos : api;
+capability vulkan : khronos;
+capability opengl : khronos;
+```
+
+As a straw man:  we could have a rule that to decide if non-abstract capabilities `A` and `B` are disjoint, we look for their common ancestor in the tree of capabilities.
+If the common ancestor is abstract, they are disjoint, and if not they not disjoint.
+We'd also know that if the user tries to compile for a profile that includes an abstract capability but *not* some concrete capability derived from it, then that is an error (we can't generate code for just `d3d`).
+
+The above is an over-simplification because we don't have a *tree* of capabilities, but a full *graph*, so we'd need an approach that works for the full case.
+
+Interaction with Generics/Interfaces
+------------------------------------
+
+It should be possible for an interface requirement to have a capability requirement attached to it.
+This would mean that users of the interface can only use the method/type/whatever when the capability is present (just like for any other function):
+
+```
+interface ITexture
+{
+	float4 sampleLevel(float2 uv, float lod);
+
+	[availableFor(fragment)]
+	float4 sample(float2 uv); // can only call this from fragment code
+}
+```
+When implementing an interface, any capability constraints we put on a member that satisfies an interface requirement would need to guarantee that either:
+
+- the capabilities on our method are implied by those on the requirement (we don't require more), or
+
+- the capabilities on the method are implied by those on the type itself, or its conformance to the interface (you can't use the conformance without the capabilities), or
+
+- the capabilities are already implied by those the whole module is being compiled for
+
+In each case, you need to be sure that `YourType` can't be passed as a generic argument to some function that uses just the `ITexture` interface above and have them call a method on your type from a profile that doesn't have the required capabilities.
+
+Interaction with Heterogeneity
+------------------------------
+
+If Slang eventually supports generating CPU code as well as shaders, it should use capabilities to handle the CPU/GPU split similar to how they can be used to separate out vertex- and fragment-shader functionality.
+Something like a `cpu` profile that works as a catch-all for typical host CPU capabilities would be nice, and could be used as a convenient way to mark "host" functions in a file that is otherwise compiled for a "default profile" that assumes GPU capabilities.
+
+Conclusion
+----------
+
+Overall, the hope is that in many cases developers will be able to use capability-based partitioning and overloading of APIs to build code that only has to pass through the Slang front-end once, but that can then go through back-end code generation for each target.
+In cases where this can't be achieved, the way that capability-based overloading is built into the Slang IR design means that we should be able to merge multiple target-specific definitions into one IR module, so that a module can employ target-specific specializations while still presenting a single API to consumers.
@@ -0,0 +1,150 @@
+Casting in the Slang Compiler
+=============================
+
+The following discussion is about casting within the C++ implementation of the slang compiler. 
+
+C++'s built in mechanisms for casting (principally dynamic_cast) is problematic within the slang compiler codebase. Code using 'dynamic_cast' requires RTTI information is available, and that a type that uses it must have a vtbl (have at least one virtual member). Some problems with this...
+
+* There are types which we want to 'dynamic_cast' that do not have, and we do not want to have a Vtbl (for example Slang::IRInst). 
+* There are types which a 'dynamic_cast' doesn't do quite what we want (for example casting on Type* derived types typically wants to work on their canonical type)
+* We may want to replace use of dynamic_cast in the future for speed/space or other reasons
+* It is common in the code base when using a 'smart pointer' type to cast it, but still return a smart pointer 
+
+To deal with these issues we need casting within Slang to follow it's own methodology. In summary it is as follows...
+
+* Use 'as' free function to do a typical 'dynamic like' cast. 
+    * 'as' doesn't guarantee the returned pointer points to the same object.
+    * For example with Type* it *actually* does the cast on the canonical type which is often a different object. 
+* If you want to *literally* do a dynamic cast use 'dynamicCast' free function. 
+    * This guarantees the returned pointer points to the same object (like normal dynamic_cast)
+* If you want to return a smart pointer from a cast from a smart pointer use the .as or .dynamicCast *methods*
+* If you want to determine if an 'as' cast is possible on a smart pointer use the .is method
+    * Doing so will produce more efficient code because a new smart pointer does not need to be constructed
+
+These functions will also work with types that do not have Vtbl - like IRInst derived types. 
+
+Both 'as' and 'dynamicCast' handle the case if the pointer is a nullptr, by returning a nullptr. If the cast succeeds the cast pointer is returned otherwise nullptr is returned. If a cast is performed with a free function it always returns a raw pointer. 
+
+So why have 'as' and 'dynamicCast' - they seem sort of similar? The primary difference is dynamicCast *must* always return a pointer to the same object, whilst 'as' *can* return a pointer to a different object if that is the desired 'normal' casting behavior for the type. This is the case for Type* when using 'as' it may return a different object - the 'canonical type' for the Type*. For a concrete example take 'NamedExpressionType', its canonical type is the type the name relates to. If you use 'as' on it - it will produce a pointer to a different object, an object that will not be castable back into a NamedExpressionType.
+
+Also keep in mind that 'as' behavior is based on the pointer type being cast from. For any pointer to a type derived from Type it will cast the canonical type. **BUT** if the pointer is pointing to a Type derived *object*, but the pointer type is *not* derived from Type (like say RefObject*), then 'as' will behave like dynamicCast. 
+
+All this being said 'as' in usage is seen as the 'default' way to do a 'dynamic like' cast with these special behaviour appropriate for the type when necessary.
+
+By having the free function and method versions of 'as' and 'dynamicCast', you can choose if you want a 'raw' or 'smart' pointer type returned from the cast. If you just want to test if something is a certain type, then using as/dynamicCast free functions is the faster way to do it. If you *know* that a raw pointer is ok, because the object will remain in scope, then again using the free function is better because it does less work. But as the examples following show, care is needed because if you get it wrong the object might go out of scope and leave the raw pointer pointing to a deleted object. When in doubt the safe choice is to typically use .as (or .dynamicCast if appropriate) methods. 
+
+Following example shows the different types of casting...
+
+```C++
+
+void someFunction(Decl* decl, Type* type)
+{
+    RefPtr<Decl> declRefPtr(decl);
+    RefPtr<Type> typeRefPtr(type);
+
+    // Use of as
+    {
+        // Casting with as on a free function returns a raw pointer
+        GenericDecl* genericDeclRaw0 = as<GenericDecl>(decl);
+        // Free function again returns a raw pointer
+        GenericDecl* genericDeclRaw1 = as<GenericDecl>(declRefPtr);
+
+        // Using the as *method* returns a smart pointer holding the cast result
+        RefPtr<GenericDecl> genericDeclRefPtr0 = declRefPtr.as<GenericDecl>();
+        
+        // Of course you can use auto with either
+        auto genericDeclRefPtr1 = declRefPtr.as<GenericDecl>();
+        
+        auto genericDeclRaw2 = as<GenericDecl>(declRefPtr);
+    }
+    
+    // Currently using as on anything not cast *from* Type is the same as dynamicCast.
+    // But on Type* sometimes you may want to control the cast
+    {
+        // With a NamedExpressionType sometimes you don't want 'as' behaviour - if we want to see the information about the name (not the thing 
+        // it relates to (the canonical type)
+        NamedExpressionType* namedExpressionRawPtr = dynamicCast<NamedExpressionType>(type);
+        
+        
+        // Returns the smart pointer 
+        auto namedExpressionRefPtr = typeRefPtr.as<NamedExpressionType>();
+    }
+    
+```
+
+It is important to be aware of what style of cast you use where. Take for example the following function ...
+```C++
+    RefPtr<Expr> substitute(RefPtr<Expr> expr) const
+    {
+        return DeclRefBase::Substitute(expr);
+    }
+``` 
+    
+If you want to do a cast on it, you need to be careful especially about scope, for example...
+
+```C++
+    RefPtr<Expr> expr = ...;
+    
+    {
+        // Whoops! This is a problem. When using the free function, the cast is to a *raw* pointer, so obj 
+        // receives a raw pointer. When the RefPtr returned from Substitute goes out of scope (when the statement is left)
+        // the ref will be removed and if the ref count was 1 destroyed. Now obj points to a freed object and so a crash is
+        // likely to follow in the future! 
+        
+        auto obj = as<RefObject>(substitute(expr));
+    }
+    // So how do we avoid this? Well it depends what the function is returning and the scope. If it's returning a smart pointer, 
+    // you could use the .as method
+    {
+        // This can only compile if it is a smart pointer (raw pointers don't have an as method)
+        auto obj = substitute(expr).as<RefObject>();
+    }
+
+    // Another option is to put the created thing in a smart pointer so you know it's in scope
+    {
+        RefPtr<Expr> sub = substitute(expr);
+        // Ok as long as sub is in scope
+        auto obj = as<RefObject>(sub);
+       
+    }
+ 
+    // More awkwardly you could use free function, but assign to a smart pointer, thus maintaining scope
+    {
+        RefPtr<RefObject> obj = as<RefObject>(substitute(expr));
+    }
+
+```
+
+The following code shows the change in behavior of 'as' is based on the source *pointer* type **NOT** the *object* type..
+
+```C++
+    // Derives from Type
+    NamedExpressionType* exprType = ...;
+
+    
+    // Will be the Type* of the *canonical* type, because the pointer is Type derived and we are using as!
+    Type* type0 = as<Type>(exprType);
+    // It' going to be pointing to a different object, because type0 is the cast of the *canonical* type, because exprType derives from Type
+    SLANG_ASSERT(type0 != exprType);
+    
+    // If I do a dynamicCast the result is either nullptr or a pointer that *must* point to the same object
+    Type* type1 = dynamicCast<Type>(exprType);
+    SLANG_ASSERT(type1 == exprType);
+    
+    
+    // Here, the pointer is pointing to a NamedExpressionType derived object. Which derives from Type. BUT our pointer here does *not* derive from type.
+    RefObject* refObj = exprType;
+    
+    // 'as' just looks at the from type, and it doesn't derive from Type (it's just RefObject), so it does regular as, which is dynamicCast
+    Type* type2 = as<Type>(refObject);
+    
+    SLANG_ASSERT(type2 == exprType);
+    
+    // Finally... 
+    
+    // Is true even though exprType is a NamedExpression, because the cast is on the canonical type
+    SLANG_ASSERT(as<NamedExpression>(exprType) == nullptr);
+    
+    // dynamicCast is always the same object returned, so must match
+    SLANG_ASSERT(dynamicCast<NamedExpression>(exprType) == exprType);
+```
@@ -0,0 +1,282 @@
+Slang Project Coding Conventions
+================================
+
+Principles
+----------
+
+This document attempts to establish conventions to be used in the Slang codebase.
+We have two goals for this convention.
+
+The first goal is to make the code look relatively consistent so that it is easy to navigate and understand for contributors.
+Having varying styles across different modules, files, functions, or lines of code makes the overall design and intention of the codebase harder to follow.
+
+The second goal is to minimize the scope complexity of diffs when multiple maintainers work together on the codebase.
+In the absence of an enforced style, developers tend to "clean up" code they encounter to match their personal preferences, and in so doing create additional diffs that increase the chances of merge conflicts and pain down the line.
+
+Because the Slang codebase has passed through many hands and evolved without a pre-existing convention, these two goals can come into conflict.
+We encourage developers to err on the side of leaving well enough alone (favoring the second goal).
+Don't rewrite or refactor code to match these conventions unless you were already going to have to touch all of those lines of code anyway.
+
+Note that external code that is incorporated into the project is excluded from all of these conventions.
+
+Languages
+---------
+
+### C++
+
+Most code in the Slang project is implemented in C++.
+We currently assume support for some C++11 idioms, but have explicitly avoided adding dependencies on later versions.
+
+As a general rule, be skeptical of "modern C++" ideas unless they are clearly better to simpler alternatives.
+We are not quite in the realm of "Orthodox C++", but some of the same guidelines apply:
+
+* Don't use exceptions for non-fatal errors (and even then support a build flag to opt out of exceptions)
+* Don't use the built-in C++ RTTI system (home-grown is okay)
+* Don't use the C++ variants of C headers (e.g., `<cstdio>` instead of `<stdio.h>`)
+* Don't use the STL containers
+* Don't use iostreams
+
+The compiler implementation does not follow some of these guidelines at present; that should not be taken as an excuse to further the proliferation of stuff like `dynamic_cast`.
+Do as we say, not as we do.
+
+Some relatively recent C++ features that are okay to use:
+
+* Rvalue references for "move semantics," but only if you are implementing performance-critical containers or other code where this really matters.
+
+* `auto` on local variables, if the expected type is clear in context
+
+* Lambdas are allowed, but think carefully about whether just declaring a subroutine would also work.
+
+* Using `>>` to close multiple levels of templates, instead of `> >` (but did you really need all those templates?)
+
+* `nullptr`
+
+* `enum class`
+
+* Range-based `for` loops
+
+* `override`
+
+* Default member initializers in `class`/`struct` bodies
+
+Templates are suitable in cases where they improve clarity and type safety.
+As a general rule, it is best when templated code is kept minimal, and forwards to a non-templated function that does the real work, to avoid code bloat.
+
+Any use of template metaprogramming would need to prove itself exceptionally useful to pay for the increase in cognitive complexity.
+We don't want to be in the business of maintaining "clever" code.
+
+As a general rule, `const` should be used sparingly and only with things that are logically "value types."
+If you find yourself having to `const`-qualify a lot of member function in type that you expect to be used as a heap-allocated object, then something has probably gone wrong.
+
+As a general rule, default to making the implementation of a type `public`, and only encapsulate state or operations with `private` when you find that there are complex semantics or invariants that can't be provided without a heavier hand.
+
+### Slang
+
+The Slang project codebase also includes `.slang` files implementing the Slang core module, as well as various test cases and examples.
+The conventions described here are thus the "official" recommendations for how users should format Slang code.
+
+To the extent possible, we will try to apply the same basic conventions to both C++ and Slang.
+In places where we decide that the two languages merit different rules, we will point it out.
+
+Files and Includes
+------------------
+
+### File Names
+
+All files and directories that are added to codebase should have names that contain only ASCII lower-case letters, digits, dots (`.`) and dashes (`-`).
+Operating systems still vary greatly in their handling of case sensitivity for file names, and non-ASCII code points are handled with even less consistency; sticking to a restricted subset of ASCII helps avoids some messy interactions between case-insensitive file systems and case-sensitive source-control systems like Git.
+As with all these conventions, files from external projects are exempted from these restrictions.
+
+### Naming of Source and Header Files
+
+In general the C++ codebase should be organized around logical features/modules/subsystem, each of which has a single `.h` file and zero or more `.cpp` files to implement it.
+
+If there is a single `.cpp` file, its name should match the header: e.g., `parser.h` and `parser.cpp`.
+
+If there is more than one `.cpp` file, their names should start with the header name: e.g., `parser.h` and `parser-decls.cpp` and `parser-exprs.cpp`.
+If there are declarations that need to be shared by the `.cpp` files, but shouldn't appear in the public interface, then can go in a `*-impl.h` header (e.g., `parser-impl.h`).
+
+Use best judgement when deciding what counts as a "feature." One class per file is almost always overkill, but the codebase currently leans too far in the other direction, with some oversized source files.
+
+### Headers
+
+Every header file should have an include guard.
+Within the implementation we can use `#pragma once`, but exported API headers (`slang.h`) should use traditional `#ifdef` style guards (and they should be consumable as both C and C++).
+
+A header should include or forward-declare everything it needs in order to compile.
+It is *not* up to the programmer who `#include`s a header to sort out the dependencies.
+
+Avoid umbrella or "catch-all" headers.
+
+### Source Files
+
+Every source file should start by including the header for its feature/module, before any other includes (this helps ensure that the header correctly includes its dependencies).
+
+Functions that are only needed within that one source file can be marked `static`, but we should avoid using the same name for functions in different files (in order to support lumped/unified builds).
+
+### Includes
+
+In general, includes should be grouped as follows:
+
+* First, the correspodning feature/module header, if we are in a source file
+* Next, any `<>`-enlosed includes for system/OS headers
+* Next, any `""`-enclosed includes for external/third-part code that is stored in the project repository
+* Finally, any includes for other features in the project
+
+Within each group, includes should be sorted alphabetically.
+If this breaks because of ordering issues for system/OS/third-party headers (e.g., `<windows.h>` must be included before `<GL/GL.h>`), then ideally those includes should be mediated by a Slang-project-internal header that features can include.
+
+Namespaces
+----------
+
+Favor fewer namespaces when possible.
+Small programs may not need any.
+
+All standard module code that a Slang user might link against should go in the `Slang` namespace for now, to avoid any possibility of clashes in a static linking scenario.
+The public C API is obviously an exception to this.
+
+
+Code Formatting
+------------------------------
+
+- For C++ files, please format using `clang-format`; `.clang-format` files in
+  the source tree define the style.
+- For CMake files, please format using `gersemi`
+- For shell scripts, please format using `shfmt`
+- For YAML files, please use `prettier`
+
+The formatting for the codebase is overall specified by the
+[`extras/formatting.sh`](./extras/formatting.sh) script.
+
+If you open a pull request and the formatting is incorrect, you can comment
+`/format` and a bot will format your code for you.
+
+Naming
+------
+
+### Casing
+
+Types should in general use `UpperCamelCase`. This includes `struct`s, `class`es, `enum`s and `typedef`s.
+
+Values should in general use `lowerCamelCase`. This includes functions, methods, local variables, global variables, parameters, fields, etc.
+
+Macros should in general use `SCREAMING_SNAKE_CASE`.
+It is important to prefix all macros (e.g., with `SLANG_`) to avoid collisions, since `namespace`s don't affect macros).
+
+In names using camel case, acronyms and initialisms should appear eniterly in either upper or lower case (e.g., `D3DThing d3dThing`) and not be capitalized as if they were ordinary words (e.g., `D3dThing d3dThing`).
+Note that this also applies to uses of "ID" as an abbreviation for "identifier" (e.g., use `nodeID` instead of `nodeId`).
+
+### Prefixes
+
+Prefixes based on types (e.g., `p` for pointers) should never be used.
+
+Global variables should have a `g` prefix, e.g. `gCounter`.
+Non-`const` `static` class members can have an `s` prefix if that suits your fancy.
+Of course, both of these should be avoided, so this shouldn't come up often.
+
+Constant data (in the sense of `static const`) should have a `k` prefix.
+
+In contexts where "information hiding" is relevant/important, such as when a type has both `public` and `private` members, or just has certain operations/fields that are considered "implementation details" that most clients should not be using, an `m_` prefix on member variables and a `_` prefix on member functions is allowed (but not required).
+
+In function parameter lists, an `in`, `out`, or `io` prefix can be added to a parameter name to indicate whether a pointer/reference/buffer is intended to be used for input, output, or both input and output.
+For example:
+
+```c++
+void copyData(void* outBuffer, void const* inBuffer, size_t size);
+
+Result lookupThing(Key k, Thing& outThing);
+
+void maybeAppendExtraNames(std::vector<Name>& ioNames);
+```
+
+Public C APIs will prefix all symbol names while following the casing convention (e.g. `SlangModule`, `slangLoadModule`, etc.).
+
+### Enums
+
+C-style `enum` should use the following convention:
+
+```c++
+enum Color
+{
+    kColor_Red,
+    kColor_Green,
+    kColor_Blue,
+
+    kColorCount,
+};
+```
+
+When using `enum class`, drop the `k` and type name as prefix, but retain the `UpperCamelCase` tag names:
+
+```c++
+enum class Color
+{
+    Red,
+    Green,
+    Blue,
+
+    Count,
+};
+```
+
+When defining a set of flags, separate the type definition from the `enum`:
+
+```c++
+typedef unsigned int Axes;
+enum
+{
+    kAxes_None = 0,
+
+    kAxis_X = 1 << 0,
+    kAxis_Y = 1 << 1,
+    kAxis_Z = 1 << 2,
+
+    kAxes_All = kAxis_X | kAxis_Y | kAxis_Z,
+};
+```
+
+Note that the type name reflects the plural case, while the cases that represent individual bits are named with a singular prefix.
+
+In public APIs, all `enum`s should use the style of separating the type definition from the `enum`, and all cases should use `SCREAMING_SNAKE_CASE`:
+
+```c++
+typedef unsigned int SlangAxes;
+enum
+{
+    SLANG_AXES_NONE = 0,
+
+    SLANG_AXIS_X = 1 << 0,
+    SLANG_AXIS_Y = 1 << 1,
+    SLANG_AXIS_Z = 1 << 2,
+
+    SLANG_AXES_ALL = SLANG_AXIS_X | SLANG_AXIS_Y | SLANG_AXIS_Z,
+};
+```
+
+### General
+
+Names should default to the English language and US spellings, to match the dominant conventions of contemporary open-source projects.
+
+Function names should either be named with action verbs (`get`, `set`, `create`, `emit`, `parse`, etc.) or read as questions (`isEnabled`, `shouldEmit`, etc.).
+
+Whenever possible, compiler concepts should be named using the most widely-understood term available: e.g., we use `Token` over `Lexeme`, and `Lexer` over `Scanner` simply because they appear to be the more common names.
+
+Avoid abbreviations and initialisms unless they are already widely established across the codebase; a longer name may be cumbersome to write in the moment, but the code will probably be read many more times than it is written, so clarity should be preferred.
+An important exception to this is common compiler concepts or techniques which may have laboriously long names: e.g., Static Single Assignment (SSA), Sparse Conditional Copy Propagation (SCCP), etc.
+
+One gotcha particular to compiler front-ends is that almost every synonym for "type" has some kind of established technical meaning; most notably the term "kind" has a precise meaning that is relevant in our domain.
+It is common practice in C and C++ to define tagged union types with a selector field called a "type" or "kind," which does not usually match this technical definition.
+If a developer wants to avoid confusion, they are encouraged to use the term "flavor" instead of "type" or "kind" since this term (while a bit silly) is less commonly used in the literature.
+
+Comments and Documentation
+--------------------------
+
+You probably know the drill: comments are good, but an out-of-date comment can be worse than no comment at all.
+Try to write comments that explain the "why" of your code more than the "what."
+When implementing a textbook algorithm or technique, it may help to imagine giving the reviewer of your code a brief tutorial on the topic.
+
+In cases where comments would benefit from formatting, use Markdown syntax.
+We do not currently have a setup for extracting documentation from comments, but if we add one we will ensure that it works with Markdown.
+
+When writing comments, please be aware that your words could be read by many people, from a variety of cultures and backgrounds.
+Default to a plain-spoken and professional tone and avoid using slang, idiom, profanity, etc.
@@ -0,0 +1,166 @@
+Understanding Declaration References (Out of Date)
+====================================
+
+This document is intended as a reference for developers working on the Slang compiler implementation.
+
+As you work on the code, you'll probably notice a lot of places where we use the `DeclRef<T>` type:
+
+* Expressions like `VarExpr` and `MemberExpr` are subclasses of `DeclRefExpr`, which holds a `DeclRef<Decl>`.
+
+* The most common subclass of `Type` is `DeclRefType`, which holds a `DeclRef<Decl>` for the type declaration.
+
+* Named types (references to `typedef`s) hold a `DeclRef<TypedefDecl>`
+
+* The name lookup process relies a lot on `DeclRef<ContainerDecl>`
+
+So what in the world is a `DeclRef`?
+
+The short answer is that a `DeclRef` packages up two things:
+
+1. A pointer to a `Decl` in the parsed program AST
+
+2. A set of "substitutions" to be applied to that decl
+
+Why do we need `DeclRef`s?
+--------------------------
+
+In a compiler for a simple language, we might represent a reference to a declaration as simply a pointer to the AST node for the declaration, or some kind of handle/ID that references that AST node.
+A representation like that will work in simple cases, for example:
+
+```hlsl
+struct Cell { int value };
+
+Cell a = { 3 };
+int b = a.value + 4;
+```
+
+In this case, the expression node for `a.value` can directly reference the declaration of the field `Cell::value`, and from that we can conclude that the type of the field (and hence the expression) is `int`.
+
+In contrast, things get more complicated as soon as we have a language with generics:
+
+```hlsl
+struct Cell<T> { T value; };
+
+// ...
+
+Cell<int> a = { 3 };
+int b = a.value + 4;
+```
+
+In this case, if we try to have the expression `a.value` only reference `Cell::value`, then the best we can do is conclude that the field has type `T`.
+
+In order to correctly type the `a.value` expression, we need enough additional context to know that it references `Cell<int>::value`, and from that to be able to conclude that a reference to `T` in that context is equivalent to `int`.
+
+We can represent that information as a substitution which maps `T` to `int`:
+
+```
+[ Cell::T => int ]
+```
+
+Then we can encode a reference to `Cell<int>::value` as a reference to the single declaration `Cell::value` with such a substitution applied:
+
+```
+Cell::value [Cell::T => int]
+```
+
+If we then want to query the type of this field, we can first look up the type stored on the AST (which will be a reference to `Cell::T`) and apply the substitutions from our field reference to get:
+
+```
+Cell::T [Cell::T => int]
+```
+
+Of course, we can then simplify the reference by applying the substitutions, to get:
+
+```
+int
+```
+
+How is this implemented?
+------------------------
+
+At the highest level, a `DeclRef` consists of a pointer to a declaration (a `Decl*`) plus a single-linked list of `Substution`s.
+These substitutions fill in the missing information for any declarations on the ancestor chain for the declaration.
+
+Each ancestor of a declaration can introduce an expected substitution along the chain:
+
+* Most declarations don't introduce any substitutions: e.g., when referencing a non-generic `struct` we don't need any addition information.
+
+* A surrounding generic declaration requires a `GenericSubstitution` which specifies the type argument to be plugged in for each type parameter of the declaration.
+
+* A surrounding `interface` declaration usually requires a `ThisTypeSubstitution` that identifies the specific type on which an interface member has been looked up.
+
+All of the expected substitutions should be in place in the general case, even when we might not have additional information. E.g., within a generic declaration like this:
+
+```hlsl
+struct Cell<T>
+{
+	void a();
+	void b() { a(); }
+}
+```
+
+The reference to `a` in the body of `b` will be represented as a declaration reference to `Cell::a` with a substitution that maps `[Cell::T => Cell::T]`. This might seem superfluous, but it makes it clear that we are "applying" the generic to arguments (even if they are in some sense placeholder arguments), and not trying to refer to an unspecialized generic.
+
+There are a few places in the compiler where we might currently bend these rules, but experience has shown that failing to include appropriate substitutions is more often than not a source of bugs.
+
+What in the world is a "this type" substitution?
+------------------------------------------------
+
+When using interface-constrained generics, we need a way to invoke methods of the interface on instances of a generic parameter type.
+For example, consider this code:
+
+```hlsl
+interface IVehicle
+{
+	associatedtype Driver;
+	Driver getDriver();
+}
+
+void ticketDriver<V : IVehicle>(V vehicle)
+{
+	V.Driver driver = vehicle.getDriver();
+	sentTicketTo(driver);
+}
+```
+
+In the expression `vehicle.getDriver`, we are referencing the declaration of `IVehicle::getDriver`, and so a naive reading tells us that the return type of the call is `IVehicle.Driver`, but that is an associated type and not a concrete type. It is clear in context that the expression `vehicle.getDriver()` should result in a `V.Driver`.
+
+The way the compiler encodes that is that we treat the expression `v.getDriver` as first "up-casting" the value `v` (of type `V`) to the interface `IVehicle`. We know this is valid because of the generic constraint `V : IVehicle`. The result of the up-cast operation is an expression with a type that references `IVehicle`, but with a substitution to track the fact that the underlying implementation type is `V`. This amounts to something like:
+
+```
+IVehicle [IVehicle.This => V]
+```
+
+where `IVehicle.This` is a way to refer to "the concrete type that is implementing `IVehicle`".
+
+Looking up the `getDriver` method on this up-cast expression yields a reference to:
+
+```
+IVehicle::getDriver [IVehicle.This => V]
+```
+
+And extracting the return type of that method gives us a reference to the type:
+
+```
+IVehicle::Driver [IVehicle.This => V]
+```
+
+which turns out to be exactly what the front end produces when it evaluates the type reference `V.Driver`.
+
+As this example shows, a "this type" substitution allows us to refer to interface members while retaining knowledge of the specific type on which those members were looked up, so that we can compute correct references to things like associated types.
+
+What does any of this mean for me?
+----------------------------------
+
+When working in the Slang compiler code, try to be aware of whether you should be working with a plain `Decl*` or a full `DeclRef`.
+There are many queries like "what is the return type of this function?" that typically only make sense if you are applying them to a `DeclRef`.
+
+The `syntax.h` file defines helpers for most of the existing declaration AST nodes for querying properties that should represent substitutions (the type of a variable, the return type of a function, etc.).
+If you are writing code that is working with a `DeclRef`, try to use these accessors and avoid being tempted to extract the bare declaration and start querying it.
+
+Some things like `Modifier`s aren't (currently) affected by substitutions, so it can make sense to query them on a bare declaration instead of a `DeclRef`.
+
+Conclusion
+----------
+
+Working with `DeclRef`s can be a bit obtuse at first, but they are the most elegant solution we've found to the problems that arise when dealing with generics and interfaces in the compiler front-end. Hopefully this document gives you enough context to see why they are important, and hints at how their representation in the compiler helps us implement some cases that would be tricky otherwise.
@@ -0,0 +1,252 @@
+Existential Types
+=================
+
+This document attempts to provide some background on "existential types" as they pertain to the design and implementation of Slang.
+The features described here are *not* reflected in the current implementation, so this is mostly a sketch of where we can go with the language and compiler.
+
+Background: Generics and Universal Quantification
+-------------------------------------------------
+
+Currently Slang supports using interfaces as generic constraints. Let's use a contrived example:
+
+```hlsl
+interface IImage { float4 getValue(float2 uv); }
+
+float4 offsetImage<T : IImage>(T image, float2 uv)
+{
+	float2 offset = ...;
+	return image.getValue(uv + offset)
+}
+```
+
+Generics like this are a form of "universal quantification" in the terminology of type theory.
+This makes sense, because *for all* types `T` that satisfy the constraints, `offsetImage` provides an implementation of its functionality.
+
+When we think of translating `offsetImage` to code, we might at first only think about how we can specialize it once we have a particular type `T` in mind.
+However, we can also imagine trying to generate one body of code that can implement `offsetImage` for *any* type `T`, given some kind of runtime representation of types.
+For example, we might generate C++ code like:
+
+```c++
+struct IImageWitnessTable { float4 (*getValue)(void* obj, float2 uv); };
+
+float4 offsetImage(Type* T, IImageWitnessTable* W, void* image, float2 uv)
+{
+	float2 offset = ...;
+	return W->getvalue(image, uv + offset);
+}
+```
+
+This translation takes the generic parameters and turns them into ordinary runtime parameters: the type `T` becomes a pointer to a run-time type representation, while the constraint that `T : IImage` becomes a "witness table" of function pointers that, we assume, implements the `IImage` interface for `T`. So, the syntax of generics is *not* tied to static specialization, and can admit a purely runtime implementation as well.
+
+Readers who are familiar with how languages like C++ are implemented might see the "witness table" above and realize that it is kind of like a virtual function table, just being passed alongside the object, rather than stored in its first word.
+
+Using Interfaces Like Types
+---------------------------
+
+It is natural for a user to want to write code like the following:
+
+```hlsl
+float4 modulateImage(IImage image, float2 uv)
+{
+	float4 factor = ...;
+	return factor * image.getValue(uv);
+}
+```
+
+Unlike `offsetImage`, `modulateImage` is trying to use the `IImage` interface as a *type* and not just a constraint.
+
+This code appears to be asking for a dynamic implementation rather than specialization (we'll get back to that...) and so we should be able to implement it similarly to our translation of `offsetImage` to C++.
+Something like the following makes a lot of sense:
+
+```c++
+struct IImage { Type* T; IImageWitnessTable* W; void* obj; };
+
+float4 modulateImage(IImage image, float2 uv)
+{
+	float4 factor = ...;
+	return factor * image.W->getvalue(image.obj, uv);
+}
+```
+
+Similar to the earlier example, there is a one-to-one mapping of the parameters of the Slang function the user wrote to the parameters of the generated C++ function.
+To make this work, we had to bundle up the information that used to be separate parameters to the generic as a single value of type `IImage`.
+
+Existential Types
+-----------------
+
+It turns out that when we use `IImage` as a type, it is what we'd call an *existential* type.
+That is because if I give you a value `img` of type `IImage` in our C++ model, then you know that *there exists* some type `img.T`, a witness table `img.W` proving the type implements `IImage`, and a value `img.obj` of that type.
+
+Existential types are the bread and butter of object-oriented programming.
+If I give you an `ID3D11Texture2D*` you don't know what its concrete type is, and you just trust me that some concrete type *exists* and that it implements the interface.
+A C++ class or COM component can implement an existential type, with the constraint that the interfaces that a given type can support is limited by the way that virtual function tables are intrusively included inside the memory of the object, rather than externalized.
+Many modern languages (e.g., Go) support adapting existing types to new interfaces, so that a "pointer" of interface type is actually a fat pointer: one for the object, and one for the interface dispatch table.
+Our examples so far have assumed that the type `T` needs to be passed around separately from the witness table `W`, but that isn't strictly required in some implementations.
+
+In type theory, the most important operation you can do with an existential type is to "open" it, which means to have a limited scope in which you can refer to the constituent pieces of a "bundled up" value of a type like `IImage`.
+We could imagine "opening" an existential as something like:
+
+```
+void doSomethingCool<T : IImage>(T val);
+
+void myFunc(IImage img)
+{
+	open img as obj:T in
+	{
+		// In this scope we know that `T` is a type conforming to `IImage`,
+		// and `obj` is a value of type `T`.
+		//
+		doSomethingCool<T>(obj);
+	}
+}
+```
+
+Self-Conformance
+----------------
+
+The above code with `doSomethingCool` and `myFunc` invites a much simpler solution:
+
+```
+void doSomethingCool<T : IImage>(T val);
+
+void myFunc(IImage img)
+{
+	doSomethingCool(img);
+}
+```
+
+This seems like an appealing thing for a language to support, but there are some subtle reasons why this isn't possible to support in general.
+If we think about what `doSomethingCool(img)` is asking for, it seems to be trying to invoke the function `doSomethingCool<IImage>`.
+That function only accepts type parameters that implement the `IImage` interface, so we have to ask ourselves:
+
+Does the (existential) type `IImage` implement the `IImage` interface?
+
+Knowing the implementation strategy outline above, we can re-phrase this question to: can we construct a witness table that implements the `IImage` interface for values of type `IImage`?
+
+For simple interfaces this is sometimes possible, but in the general case there are other desirable language features that get in the way:
+
+* When an interface has associated types, there is no type that can be chosen as the associated type for the interface's existential type. The "obvious" approach of using the constraints on the associated type can lead to unsound logic when interface methods take associated types as parameters.
+
+* When an interface uses the "this type" (e.g., an `IComparable` interface with a `compareTo(ThisType other)` method), it isn't correct to simplify the this type to the interface type (just because you have two `IComarable` values doesn't mean you can compare them - they have to be of the same concrete type!)
+
+* If we allow for `static` method on interfaces, then what implementation would we use for these methods on the interface's existential type?
+
+Encoding Existentials in the IR
+-------------------------------
+
+Existentials are encoded in the Slang IR quite simply. We have an operation `makeExistential(T, obj, W)` that takes a type `T`, a value `obj` that must have type `T`, and a witness table `W` that shows how `T` conforms to some interface `I`. The result of the `makeExistential` operation is then a value of the type `I`.
+
+Rather than include an IR operation to "open" an existential, we can instead just provide accessors for the pieces of information in an existential: one to extract the type field, one to extract the value, and one to extract the witness table. These would idiomatically be used like:
+
+```
+let e : ISomeInterface = /* some existential */
+let T : Type = extractExistentialType(e);
+let W : WitnessTbale = extractExistentialWitnessTable(e);
+let obj : T = extractExistentialValue(e);
+```
+
+Note how the operation to extract `obj` gets its result type from the previously-executed extraction of the type.
+
+Simplifying Code Using Existentials
+-----------------------------------
+
+It might seem like IR code generated using existentials can only be implemented using dynamic dispatch.
+However, within a local scope it is clear that we can simplify expressions whenever `makeExistential` and `extractExistential*` operations are paired.
+For example:
+
+```
+let e : ISomeInterface = makeExistential(A, a, X);
+...
+let B = extractExistentialType(e);
+let b : B = extractExistentialValue(e);
+let Y = extractExistentialWitnessTable(e);
+```
+
+It should be clear in context that we can replace `B` with `A`, `b` with `a`, and `Y` with `X`, after which all of the `extract*` operations and the `makeExistential` operation are dead and can be eliminated.
+
+This kind of simplification works within a single function, as long as there is no conditional logic involving existentials.
+We require further transformation passes to allow specialization in more general cases:
+
+* Copy propagation, redundancy elimination and other dataflow optimizations are needed to simplify use of existentials within functions
+* Type legalization passes, including some amount of scalarization, are needed to "expose" existential-type fields that are otherwise buried in a type
+* Function specialization, is needed so that a function with existential parameters is specialized based on the actual types used at call sites
+
+Transformations just like these are already required when working with resource types (textures/samplers) on targets that don't support first-class computation on resources, so it is possible to share some of the same logic.
+Similarly, any effort we put into validation (to ensure that code is written in a way that *can* be simplified) can hopefully be shared between existentials and resources.
+
+Compositions
+------------
+
+So far I've only talked about existential types based on a single interface, but if you look at the encoding as a tuple `(obj, T, W)` there is no real reason that can't be generalized to hold multiple witness tables: `(obj, T, W0, ... WN)`. Interface compositions could be expressed at the language level using the `&` operator on interface (or existential) types.
+
+The IR encoding doesn't need to change much to support compositions: we just need to allow multiple witness tables on `makeExistential` and have an index operand on `extractExistentialWitnessTable` to get at the right one.
+
+The hardest part of supporting composition of interfaces is actually in how to linearize the set of interfaces in a way that is stable, so that changing a function from using `IA & IB` to `IB & IA` doesn't change the order in which witness tables get packed into an existential value.
+
+Why are we passing along the type?
+----------------------------------
+
+I'm glossing over something pretty significant here, which is why anybody would pass around the type as part of the existential value, when none of our examples so far have made use of it.
+This sort of thing isn't very important for languages where interface polymorphism is limited to heap-allocated "reference" types (or values that have been "boxed" into reference types), because the dynamic type of an object can almost always be read out of the object itself.
+
+When dealing with a value type, though, we have to deal with things like making *copies*:
+
+```
+interface IWritable { [mutating] void write(int val); }
+
+struct Cell : IWritable { int data; void write(int val) { data = val; } }
+
+T copyAndClobber<T : IWritable>(T obj)
+{
+	T copy = obj;
+	obj.write(9999);
+	return copy;
+}
+
+void test()
+{
+	Cell cell = { 0 };
+	Cell result = copyAndClobber(cell);
+	// what is in `result.data`?
+}
+```
+
+If we call `copyAndClober` on a `Cell` value, then does the line `obj.write` overwrite the data in the explicit `copy` that was made?
+It seems clear that a user would expect `copy` to be unaffected in the case where `T` is a value type.
+
+How does that get implemented in our runtime version of things? Let's imagine some C++ translation:
+
+```
+void copyAndClobber(Type* T, IWriteableWitnessTable* W, void* obj, void* _returnVal)
+{
+    void* copy = alloca(T->sizeInBytes);
+    T->copyConstruct(copy, obj);
+
+    W->write(obj, 9999);
+    T->moveConstruct(_returnVal, copy);
+}
+```
+
+Because this function returns a value of type `T` and we don't know how big that is, let's assume the caller is passing in a pointer to the storage where we should write the result.
+Now, in order to have a local `copy` of the `obj` value that was passed in, we need to allocate some scratch storage, and only the type `T` can know how many bytes we need.
+Furthermore, when copying `obj` into that storage, or subsequently copying the `copy` variable into the function result, we need the copy/move semantics of type `T` to be provided by somebody.
+
+This is the reason for passing through the type `T` as part of an existential value.
+
+If we only wanted to deal with reference types, this would all be greatly simplified, because the `sizeInBytes` and the copy/move semantics would be fixed: everything is a single pointer.
+
+All of the same issues arise if we're making copies of existential values:
+
+```
+IWritable copyAndClobberExistential(IWritable obj)
+{
+	IWritable copy = obj;
+	obj.write(9999);
+	return copy;
+}
+```
+
+If we want to stay consistent and say that `copy` is an actual copy of `obj` when the underlying type is a value rather than a reference type, then we need the copy/move operations for `IWritable` to handle invoking the copy/move operations of the underlying encapsulated type.
+
+Aside: it should be clear from these examples that implementing generics and existential types with dynamic dispatch has a lot of complexity when we have to deal with value types (because copying requires memory allocation).
+It is likely that a first implementation of dynamic dispatch support for Slang would restrict it to reference types (and would thus add a `class` keyword for defining reference types).
@@ -0,0 +1,74 @@
+Deploying Experimental API Additions
+====================================
+
+This page intends to provide guidance to Slang developers when extending the Slang API, particularly when working on experimental features.
+It applies to the "COM-lite" Slang API, rather than the deprecated C Slang API (sp* functions).
+
+* Note: This guidance relates to Slang API changes, not to language changes. That is, what Slang does with shader source code across releases is not discussed here.
+
+The goal is to maintain binary compatibility as much as possible between Slang releases, and to aid applications in dealing with changes to Slang.
+
+Slang is distributed as a dynamic library, and there is an expectation from Slang API users that upgrading by installing an updated slang-compiler.dll or slang-compiler.so will not break their application unnecessarily.
+
+ABI compatibility within the Slang API can be preserved between releases if some rules are followed by developers.
+
+Slang API uses a "COM-lite" structure wherein functionality is exposed through interfaces on objects. If the interfaces never change, ABI compatibility is preserved, but changes happen. When adding or changing interfaces, please observe the following:
+
+1. It is preferred to create *new* COM interfaces when adding new functionality.
+* This maintains ABI compatibility.
+* Applications must acquire access to the new functionality using QueryInterface(), which will gracefully fail if the slang-compiler.dll/libslang-compiler.so does not implement the functionality.
+
+2. Changes to existing virtual methods in COM interfaces should be avoided, as that is an ABI breakage.
+* If a change is required though, change the interface's UUID.
+
+3. New virtual methods _may_ be added (only) to the end of existing COM interface structs.
+* This does not disturb the ABI compatibility of the associated vtable. Old apps can remain unaware of the new function pointers appended to the end of the vtable.
+* A UUID change is not necessary.
+* Note that in the event that a Slang application which uses the added feature is run with an old slang-compiler.dll/libslang-compiler.so, the experience for the user is not as clean as if the added method belongs to a new interface.
+
+Adding Experimental Interfaces
+==============================
+
+When the above recommendations cannot be followed, as with features that are expected to be iterated on or are regarded as temporary, there are additional recommendations.
+
+Interfaces that are expected to change must be marked `_Experimental` in their class name and in their UUID name.
+
+For example,
+
+
+```csharp
+/* Experimental interface for doing something cool. This interface is susceptible to ABI breakage. */
+struct ICoolNewFeature_Experimental : public ISlangUnknown
+{
+    SLANG_COM_INTERFACE(0x8e12e8e3, 0x5fcd, 0x433e, { 0xaf, 0xcb, 0x13, 0xa0, 0x88, 0xbc, 0x5e, 0xe5 })
+
+    virtual SLANG_NO_THROW SlangResult SLANG_MCALL coolMethod() = 0;
+};
+
+#define SLANG_UUID_ICoolNewFeature_Experimental ICoolNewFeature_Experimental::getTypeGuid()
+```
+
+Note: Use uuidgen to generate IIDs new interfaces.
+
+Removing Experimental Interfaces
+================================
+
+By the nature of being marked "Experimental", users have been warned that the interfaces are not officially supported and may be removed. You may simply delete the class and UUID, e.g. "ICoolNewFeature_Experimental" struct may be deleted from slang.h along with the definition of SLANG_UUID_ICoolNewFeature_Experimental.
+
+This will show up in applications as QueryInterface failures.
+
+It is nice, but not required, to retain the interface declarations for some time after removing internal support before deleting them from slang.h, so that applications have time to remove their dependence on the unsupported feature while still being able to compile in the interim.
+
+Changing Experimental Interfaces
+================================
+
+Backwards incompatible changes to Slang COM interfaces should be accompanied with a UUID change.
+
+In the event that an old application runs with a new slang library, applications are more capable of gracefully handling an unavailable interface than a changed one. The former may be still be functional, or include a helpful error message, whereas the latter is most likely a crash of some sort.
+
+Promoting Experimental Interfaces
+=================================
+
+The class name and the UUID name should be changed in slang.h and in the slang source code, e.g. Rename "ICoolNewFeature_Experimental" to just "ICoolFeature".
+
+The SLANG_UUID for the interface should be renamed to omit "EXPERIMENTAL" but its value should remain the same. This is because, if there are no backwards incompatible changes that accompany the promotion from experimental to permanent, applications written against the experimental version can continue working against Slang libraries where the interface was promoted to permanent.
@@ -0,0 +1,486 @@
+Interfaces Design
+=================
+
+This document intends to lay out the proposed design for a few inter-related features in Slang:
+
+- Interfaces
+- Associated Types
+- Generics
+
+Introduction
+------------
+
+The basic problem here is not unique to shader programming: you want to write code that accomplished one task, while abstracting over how to accomplish another task.
+As an example, we might want to write code to integrate incident radiance over a list of lights, while not concerning ourself with how to evaluate a reflectance function at each of those lights.
+
+If we were doing this task on a CPU, and performance wasn't critical, we could probably handle this with higher-order functions or an equivalent mechanism like function pointers:
+
+    float4 integrateLighting(
+    	Light[] lights,
+    	float4 (*brdf)(float3 wi, float3 wi, void* userData),
+    	void const* brdfUserData)
+    {
+    	float4 result = 0;
+    	for(/* ... */) {
+    		// ...
+    		result += brdf(wi, wo, brdfUserDat);
+    	}
+    	return result;
+    }
+
+Depending on the scenario, we might be able to generate statically specialized code by using templates instead:
+
+    template<typename BRDF>
+    float4 integrateLighting(Light[] lights, BRDF const& brdf)
+    {
+    	// ...
+    	result += brdf(wi, wo);
+    	// ...
+    }
+
+Current shading languages support neither higher-order functions nor templates/generics, so neither of these options is viable.
+Instead practitioners typically use preprocessor techniques to either stich together the final code, or to substitute in different function/type definitions to make a definition like `integrateLighting` reusable.
+
+These ad hoc approaches actually work well in practice; we aren't proposing to replace them *just* to make code abstractly "cleaner."
+Rather, we've found that the ad hoc approaches end up interacting poorly with the resource binding model in modern APIs, so that *something* less ad hoc is required to achieve our performance goals.
+At that point, we might as well ensure that the mechanism we introduce is also a good fit for the problem.
+
+Overview
+--------
+
+The basic idea for our approach is as follows:
+
+- Start with the general *semantics* of a generic-based ("template") approach
+
+- Use the accumulated experience of the programming language community to ensure that our generics are humane (in other words: not like C++)
+
+- Expore the possibility of syntax sugar to let people use more traditional OOP-style syntax when it can reduce verbosity without harming understanding
+
+In general, our conceptual model is being ripped off wholesale from Rust and Swift.
+The basic design principle is "when in doubt, do what Swift does."
+
+Interfaces
+----------
+
+An **interface** in Slang is akin to a `protocol` in Swift or a `trait` in Rust.
+The choice of the `interface` keyword is to highlight the overlap with the conceptually similar construct that appeared in Cg, and then later in HLSL.
+
+### Declaring an interface
+
+An interface is a named collection of **requirements**; any type that **implements** the interface must provide definitions that satisfy those requirements.
+
+Here is a simple interface, with one requirement:
+
+    interface Light
+    {
+    	float3 illuminate(float3 P_world);
+    }
+
+The `Light` interface requires a (member) function called `illuminate` with the given signature.
+
+### Declaring that a type implementats an interface
+
+A user-defined `struct` type can declare that it implements an interface, by using conventional "inheritance" syntax:
+
+    struct PointLight : Light
+    {
+    	float3 P_light;
+
+    	float3 illuminate(float3 P_world)
+    	{
+    		float distance = length(P_light - P_world);
+    		// ...
+    	}
+    }
+
+It is a static error if a type declares that it implements an interface, but it does not provide all of the requirements:
+
+    struct BadLight : Light
+    {
+    	// ERROR: type 'BadLight' cannot implement 'Light'
+    	// because it does not provide the required 'illuminate' function
+    }
+
+### Interface Inheritance
+
+While this document does not propose general notions of inheritance be added to Slang, it does make sense to allow an interface to inherit from zero or more other interfaces:
+
+    interface InfinitessimalLight : Light
+    {
+    	float3 getDirection(float3 P_world);
+    }
+
+In this case the `InfinitessimalLight` interface inherits from `Light`, and declares one new requirement.
+In order to check that a type implements `InfinitessimalLight`, the compiler will need to check both that it implements `Light` and that it provides the new "direct" requirements in `InfinitessimalLight`.
+
+Declaring that a type implements an interface also implicitly declares that it implements all the interfaces that interface transitively inherits from:
+
+    struct DirectionalLight : InfinitessimalLight
+    {
+    	float3 L;
+    	float3 dir;
+
+    	float3 getDirection(float3 P_world) { return dir; }
+
+    	float3 illuminate(float3 P_world)
+    	{
+    		// Okay, this is the point where I recognize
+    		// that this function definition is not
+    		// actually reasonable for a light...
+    }
+
+
+
+### Interfaces and Extensions
+
+It probably needs its own design document, but Slang currently has very basic support for `extension` declarations that can add members to an existing type.
+These blocks correspond to `extension` blocks in Swift, or `impl` blocks in Rust.
+This can be used to declare that a type implements an interface retroactively:
+
+    extension PointLight : InfinitessimalLight
+    {
+    	float3 getDirection(float3 P_world)
+    	{
+    		return normalize(P_light - P_world);
+    	}
+    }
+
+In this case we've used an extension to declare the `PointLight` also implements `InfinitessimalLight`. For the extension to type-check we need to provide the new required function (the compiler must recognize that the implementation of `Light` was already provided by the original type definition).
+
+There are some subtleties around using extensions to add interface implementations:
+
+- If the type already provides a method that matches a requireemnt, can the extension "see" it to satisfying new requirements?
+
+- When can one extension "see" members (or interface implementations) added by another?
+
+A first implementation can probably ignore the issue of interface implementations added by extensions, and only support them directly on type definitions.
+
+Generics
+--------
+
+All of the above discussion around interfaces neglected to show how to actually *use* the fact that, e.g., `PointLight` implements the `Light` interface.
+That is intentional, because at the most basic level, interfaces are designed to be used in the context of **generics**.
+
+### Generic Declarations
+
+The Slang compiler currently has some ad hoc support for generic declarations that it uses to implement the HLSL standard module (which has a few generic types).
+The syntax for those is currently very bad, and it makes sense to converge on the style for generic declarations used by C# and Swift:
+
+    float myGenericFunc<T>(T someValue);
+
+Types can also be generic:
+
+    struct MyStruct<T> { float a; T b; }
+
+Ideally we should also allow interfaces and interface requirements to be generic, but there will probably be some limits due to implementation complexity.
+
+### Type Constraints
+
+Unlike C++, Slang needs to be able to type-check the body of a generic function ahead of time, so it can't rely on `T` having particular members:
+
+    // This generic is okay, because it doesn't assume anything about `T`
+    // (other than the fact that it can be passed as input/output)
+    T okayGeneric<T>(T a) { return a; }
+
+    // This generic is not okay, because it assumes that `T` supports
+    // certain operators, and we have no way of knowing it this is true:
+    T notOkayGeneric<T>(T a) { return a + a; }
+
+In order to rely on non-trivial operations in a generic parameter type like `T`, the user must **constrain** the type parameter using an interface:
+
+    float3 mySurfaceShader<L : Light>(L aLight)
+    {
+    	return aLight.illuminate(...);
+    }
+
+In this example, we have constrained the type parameter `L` so that it must implement the interface `Light`.
+As a result, in the body of the function, the compiler can recognize that `aLight`, which is of type `L`, must implement `Light` and thus have a member `illuminate`.
+
+When calling a function with a constrained type parameter, the compiler must check that the actual type argument (whether provided explicitly or inferred) implements the interface given in the constraint:
+
+    mySurfaceShader<PointLight>(myPointLight);  // OK
+    mySurfaceShader(myPointLight);				// equivalent to previous
+    mySurfaceShader(3.0f); // ERROR: `float` does not implement `Light`
+
+Note that in the erroneous case, the error is reported at the call site, rather than in the body of the callee (as it would be for C++ templates).
+
+For cases where we must constrain a type parameter to implement multiple interfaces, we can join the interface types with `&`:
+
+	interface Foo { void foo(); }
+	interface Bar { void bar(); }
+
+    void myFunc<T : Foo & Bar>(T val)
+    {
+    	val.foo();
+    	val.bar();
+    }
+
+If we end up with very complicated type constraints, then it makes sense to support a "`where` clause" that allows requirements to be stated outside of the generic parameter list:
+
+    void myFunc<T>(T val)
+        where T : Foo,
+        	  T : Bar
+    {}
+
+Bot the use of `&` and `where` are advanced features that we might cut due to implementation complexity.
+
+### Value Parameters
+
+Because HLSL has generics like `vector<float,3>` that already take non-type parameters, the language will need *some* degree of support for generic parameters that aren't types (at least integers need to be supported).
+We need syntax for this that doesn't bloat the common case.
+
+In this case, I think that what I've used in the current Slang implementation is reasonable, where a value parameter needs a `let` prefix:
+
+    void someFunc<
+    	T, 					// type parameter
+    	T : X, 				// type parameter with constraint
+    	T = Y, 				// type parameter with default
+    	T : X = Y, 			// type parameter with constraint and default
+    	let N : int,		// value parameter (type must be explicit)
+    	let N : int = 3>	// value parameter with default
+    	()
+    { ... }
+
+We should also extend the `where` clauses to support inequality constraints on (integer) value parameters to enforce rules about what ranges of integers are valid.
+The front-end should issue error messages if it can statically determine these constraints are violated, but it should probably defer full checking until the IR (maybe... we need to think about how much of a dependent type system we are willing to have).
+
+Associated Types
+----------------
+
+While the syntax is a bit different, the above mechanisms have approximately the same capabilities as Cg interfaces.
+What the above approach can't handle (and neither can Cg) is a reusable definition of a surface material "pattern" that might blend multiple material layers to derive parameters for a specific BRDF.
+
+That is, suppose we have two BRDFs: one with two parameters, and one with six.
+Different surface patterns may want to target different BRDFs.
+So if we write a `Material` interface like:
+
+    interface Material
+    {
+    	BRDFParams evaluatePattern(float2 uv);
+    }
+
+Then what should `BRDFParams` be? The two-parameter or six-parameter case?
+
+An **associated type** is a concept that solves exactly this problem.
+We don't care *what* the concrete type of `BRDFParams` is, so long as *every* implementation of `Material` has one.
+The exact `BRDFParams` type can be different for each implementation of `Material`; the type is *associated* with a particular implementation.
+
+We will crib our syntax for this entirely from Swift, where it is verbose but explicit:
+
+    interface Material
+    {
+    	associatedtype BRDFParams;
+
+    	BRDFParams evaluatePattern(float2 uv);
+
+    	float3 evaluateBRDF(BRDFParams param, float3 wi, float3 wo);
+    }
+
+In this example we've added an associated type requirement so that every implementation of `Material` must supply a type named `BRDFParams` as a member.
+We've also added a requirement that is a function to evaluate the BRDF given its parameters and incoming/outgoing directions.
+
+Using this declaration one can now define a generic function that works on any material:
+
+    float3 evaluateSurface<M : Material, L : Light>(
+    	M material,
+    	L[] lights,
+    	float3 P_world,
+    	float2 uv)
+    {
+    	M.BRDFParams brdfParams = material.evaluatePattern(uv);
+    	for(...)
+    	{
+    		L light = lights[i];
+    		// ...
+    		float3 reflectance = material.evaluateBRDF(brdfParams, ...);
+    	}
+    }
+
+Some quick notes:
+
+- The use of `associatedtype` (for associated types) and `typealias` (for `typedef`-like definitions) as distinct keywords in Swift was well motivated by their experience (they used to use `typealias` for both). I would avoid having the two cases be syntactically identical.
+
+- Swift has a pretty involved inference system where a type doesn't actually need to explicitly provide a type member with the chosen name. Instead, if you have a required method that takes or returns the associated type, then the compiler can infer what the type is by looking at the signature of the methods that meet other requirements. This is a complex and magical feature, and we shouldn't try to duplicate it.
+
+- Both Rust and Swift call this an "associated type." They are related to "virtual types" in things like Scala (which are in turn related to virtual classes in beta/gbeta). There are similar ideas that arise in Haskell-like languages with type classes (IIRC, the term "functional dependencies" is relevant).
+
+### Alternatives
+
+I want to point out a few alternatives to the `Material` design above, just to show that associated types seem to be an elegant solution compared to the alternatives.
+
+First, note that we could break `Material` into two interfaces, so long as we are allowed to place type constraints on associated types:
+
+    interface BRDF
+    {
+    	float3 evaluate(float3 wi, float3 wo);
+    }
+
+    interface Material
+    {
+    	associatedtype B : BRDF;
+
+    	B evaluatePattern(float2 uv);
+    }
+
+This refactoring might be cleaner if we imagine that a shader library would have family of reflectance functions (implementing `BRDF`) and then a large library of material patterns (implementing `Material`) - we wouldn't want each and every material to have to implement a dummy `evaluateBRDF` that just forwards to a BRDF instance nested in it.
+
+Looking at that type `B` there, we might start to wonder if we could just replace this with a generic type parameter on the interface:
+
+    interface Material< B : BRDF >
+    {
+    	B evaluatePattern(float2 uv);
+    }
+
+This would change any type that implements `Material`:
+
+    // old:
+    struct MyMaterial : Material
+    {
+    	typealias B = GGX;
+
+    	GGX evaluatePattern(...) { ... }
+    }
+
+    // new:
+    struct MyMaterial : Material<GGX>
+    {
+    	GGX evaluatePattern(...) { ... }
+    }
+
+That doesn't seem so bad, but it ignores the complexity that arises at any use sites, e.g.:
+
+    float3 evaluateSurface<B : BRDF, M : Material<B>, L : Light>(
+    	M material,
+    	L[] lights,
+    	float3 P_world,
+    	float2 uv)
+    { ... }
+
+The type `B` which is logically an implementation detail of `M` now surfaces to the generic parameter list of any function that wants to traffic in materials.
+This reduces the signal/noise ratio for anybody reading the code, and also means that any top-level code that is supposed to be specializing this function (suppose this was a fragment entry point) now needs to understand how to pick apart the `Material` it has on the host side to get the right type parameters.
+
+This kind of issue has existed in the PL community at least as far back as the ML module system (it is tough to name search, but the concepts of "parameterization" vs. "fibration" is relevant here), and the Scala researchers made a clear argument (I think it was in the paper on "un-types") that there is a categorical distinction between the types that are logicall the *inputs* to an abstraction, and the types that are logically the *outputs*. Generic type parameters and associated types handle these two distinct roles.
+
+Returning an Interface
+----------------------
+
+The revised `Material` definition:
+
+    interface BRDF
+    {
+    	float3 evaluate(float3 wi, float3 wo);
+    }
+
+    interface Material
+    {
+    	associatedtype B : BRDF;
+
+    	B evaluatePattern(float2 uv);
+    }
+
+has a function `evaluatePattern` that returns a type that implements an interface.
+In the case where the return type is concrete, this isn't a problem (and the nature of associated types means that `B` will be concrete in any actual concrete implementation of `Material`).
+
+There is an open question of whether it is ever necessary (or even helpful) to have a function that returns a value of *some* type known to implement an interface, without having to state that type in the function signature.
+This is a point that has [come up](https://github.com/rust-lang/rfcs/blob/master/text/1951-expand-impl-trait.md) in the Rust world, where they have discussed using a keyword like `some` to indicate the existential nature of the result type:
+
+	// A function that returns *some* implementation of `Light`
+	func foo<T>() -> some Light;
+
+The Rust proposal linked above has them trying to work toward `impl` as the keyword, and allowing it in both argument and result positions (to cover both universal and existential quantification).
+
+In general, such a feature would need to have many constraints:
+
+- The concrete return type must be fixed (even if clients of the function should be insulated from the choice), given the actual generic arguments provided.
+
+- If the existential is really going to be sealed, then the caller shouldn't be allowed to assume anything *except* that two calls to the same function with identical generic arguments should yield results of identical type.
+
+Under those constraints, it is pretty easy to see that an existential-returning method like:
+
+    interface Foo<T>
+    {
+    	func foo<U>() -> some Bar;
+    }
+
+can in principle be desugared into:
+
+    interface Foo<T>
+    {
+    	associatedtype B<U> : Bar;
+
+    	func foo<U>() -> B<U>;
+    }
+
+with particular loss in what can be expressed.
+The same desugaring approach should apply to global-scope functions that want to return an existential type (just with a global `typealias` instead of an `associatedtype`).
+
+
+It might be inconvenient for the user to have to explicitly write the type-level expression that yields the result type (consider cases where C++ template metaprogrammers would use `auto` as a result type), but there is really no added power.
+
+
+Object-Oriented Sugar
+---------------------
+
+Having to explicitly write out generic parameter lists is tedious, especially in the (common) case where we will have exactly one parameter corresponding to each generic type parameter:
+
+	// Why am I repeating myself?!
+	//
+    void foo<L : Light, M : Material, C : Camera)(
+    	     L   light, M   material, C   camera);
+
+The intent seems to be clear if we instead write:
+
+    void foo(Light light, Material material, Camera camera);
+
+We could consider the latter to be sugar for the former, and allow users to write in familiar syntax akin to what ws already supported in Cg.
+
+We'd have to be careful with such sugar, though, because there is a real and meaningful difference between saying:
+
+- "`material` has type `Material` which is an interface type"
+- "`material` has type `M` where `M` implements `Material`"
+
+In particular, if we start to work with associated types:
+
+    let b = material.evaluatePattern(...);
+
+It makes sense to say that `b` has type `M.BRDF`.
+It does **not** make sense to say that `b` has type `Material.BRDF`, because there is no such concrete type.
+
+(A third option is to say that `b` has type `material.BRDF`, which is basically the point where you have "virtual types" because we are now saying the type is a member of the *instance* and not of an enclosing *type*)
+
+Note that the issue of having or not having object-oriented sugar is technically orthogonal from whether we allow "existential return types."
+However, allowing the user to think of interfaces in traidtional OOP terms leads to it being more likely that they will try to declare:
+
+- functions that return an interface type
+- local variables of interface type (which they might even assign to!)
+- fields of interface type in their `struct`s
+
+All of these complicate the desugaring step, because we would de facto have types/functions that mix up two stages of evaluation: a compile-time type-level step and a run-time value-level step.
+Ultimately, we'd probably need to express these by having a multi-stage IR (with two stages) which we optimize in the staged setting before stage-splitting to get separate type-level and value-level operations (akin to the desugaring for existential return types I described above).
+
+My sense is that a certain amount of multi-stage programming may already be needed to deal with certain HLSL/GLSL idioms. In particular:
+
+- GLSL supports passing unsigned arrays (e.g., `int[] a`) to a function, and then having the function use the size of the array (`a.length`) to do loops, etc. These would need to be lowered to distinct SPIR-V code for every array size used (if I understand the restrictions correctly), and so the feature is perhaps best thought of as passing both a compile-time integer parameter and a run-time array parameter (where the size comes from that parameter)
+
+- HLSL and GLSL both have built-in functions where certain parameters are required to be compile-time constants. A feature-complete front-end must detect when calls to these functions are valid, and report errors to the user. In order to make the errors easier to explain to the user, it would be helpful to have an explicit notion of constant-rate computation, and require that the user express explicit constant-rate parameters/expressions.
+
+All of this ties into the question of whether we need/want to support more general kinds of compile-time evaluation for specialization (e.g., statically-determine `if` statements or loops).
+
+Other Languages
+---------------
+
+It is worth double-checking whether implementing all of this from scratch in Slang is a good idea, or if there is somewhere else we can achieve similar results more quickly:
+
+- The Metal shading language has much of what we'd want. It is based on C++ templates, which are maybe not the ideal mechanism, and the compiler is closed-source so we can't easily add functionality. Still, it should be possible to prototype a lot of what we want on top of Metal 2.
+
+- The open-source HLSL compiler doesn't support any of the new ideas here, but it may be that adding them to `dxc` would be faster than adding them to the Slang project code. Using `dxc` is a no-go for some of the other Slang requirements (that come from our users on the Falcor project).
+
+- Swift already supports almost every thing on our list of requirements, but as it stands today there is no easy path to using it for low-level GPU code generation. It also fails to meet our goals for incremental adoption, high-level source output, etc.
+
+  In the long run, however, the Swift compiler seems like an attractive intercept for this work, because their long-term roadmap seems like it will close a lot of the gap with what we've done so far.
+
+Conclusion
+----------
+
+This document has described the basic syntax and semantics for three related features -- interfaces, generics, and associated types -- along with some commentary on longer-term directions.
+My expectation is that we will use the syntax as laid down here, unless we have a very good reason to depart from it, and we will prioritize implementation work as needed to get interesting shader library functionality up and running.
@@ -0,0 +1,306 @@
+# Slang IR Instruction Management and Versioning
+
+This document explains how Slang's intermediate representation (IR) instructions are defined, generated, and versioned. It covers the workflow for adding or modifying instructions and the mechanisms that ensure backwards compatibility for serialized IR modules.
+
+## High-Level Concepts
+
+The Slang IR uses a code generation approach where instruction definitions are centralized in a Lua file (`slang-ir-insts.lua`), and various C++ headers and source files are generated from this single source of truth. This ensures consistency across the codebase and enables sophisticated features like backwards compatibility through stable instruction naming.
+
+### Key Components
+
+- **Instruction Definitions** (`slang-ir-insts.lua`): The canonical source for all IR instruction definitions
+- **Stable Names** (`slang-ir-insts-stable-names.lua`): Maps instruction names to permanent integer IDs for backwards compatibility
+- **Code Generation** (via Fiddle): Generates C++ enums, structs, and tables from the Lua definitions
+- **Module Versioning**: Tracks compatibility ranges for serialized IR modules
+
+## The Instruction Definition System
+
+### Source of Truth: `slang-ir-insts.lua`
+
+All IR instructions are defined in `source/slang/slang-ir-insts.lua`. This file contains a hierarchical table structure that defines:
+
+- Instruction names and their organization into categories
+- Struct names for the C++ representation (if different from the default)
+- Flags like `hoistable`, `parent`, `global`, etc.
+- (Optionally) Minimum operand counts
+- (Optionally) The operands themselves
+- Parent-child relationships in the instruction hierarchy
+
+Here's a simplified example of how instructions are defined:
+
+```lua
+local insts = {
+    { nop = {} },
+    {
+        Type = {
+            {
+                BasicType = {
+                    hoistable = true,
+                    { Void = { struct_name = "VoidType" } },
+                    { Bool = { struct_name = "BoolType" } },
+                    { Int = { struct_name = "IntType" } },
+                    -- ... more basic types
+                },
+            },
+            -- ... more type categories
+        },
+    },
+    -- ... more instruction categories
+}
+```
+
+The hierarchy is important: instructions inherit properties from their parent categories. For example, all `BasicType` instructions inherit the `hoistable = true` flag.
+
+### Code Generation Flow
+
+The Fiddle tool processes `slang-ir-insts.lua` and generates several outputs:
+
+1. **Enum Definitions** (`slang-ir-insts-enum.h`):
+
+   - `IROp` enum with values like `kIROp_Void`, `kIROp_Bool`, etc.
+   - Range markers like `kIROp_FirstBasicType` and `kIROp_LastBasicType`
+
+2. **Struct Definitions** (`slang-ir-insts.h`):
+
+   - C++ struct definitions for instruction types not manually defined
+   - `leafInst()` and `baseInst()` macros for RTTI support
+   - If operands of an IR are specified in `slang-ir-insts.lua` in the format `{ { "operand1_name", "operand1_type" }, {"operand2_name"} }` and so on,
+     Fiddle will generate getters for each of the operands as part of the IR's struct. Note that the order in which the operands are listed matters and
+     specification of the type of the operand is optional; defaulting to "IRInst" when the type is not specified.
+
+3. **Instruction Info Table** (`slang-ir-insts-info.cpp`):
+
+   - Maps opcodes to their string names, operand counts, and flags
+   - Used for debugging, printing, and validation
+
+4. **Stable Name Mappings** (`slang-ir-insts-stable-names.cpp`):
+   - Bidirectional mapping between opcodes and stable IDs
+   - Critical for backwards compatibility
+
+## Adding or Modifying Instructions
+
+### Adding a New Instruction
+
+To add a new IR instruction:
+
+1. **Edit `slang-ir-insts.lua`**: Add your instruction in the appropriate category:
+
+   ```lua
+   { MyNewInst = { min_operands = 2 } },
+   ```
+
+2. **Run the build**: The build system will automatically regenerate the C++ files.
+
+3. **Update the stable names**: Either
+
+   - Run the validation script:
+
+     **Note**: Skip make command if lua is already built.
+     ```bash
+     make -C external/lua MYCFLAGS="-DLUA_USE_POSIX" MYLIBS=""
+     ./external/lua/lua extras/check-ir-stable-names.lua update
+     ```
+
+   - Or add a new ID to the mapping in `source/slang/slang-ir-insts-stable-names.lua`, this is checked for consistency in CI so it's safe to add manually.
+
+   This assigns a permanent ID to your new instruction.
+
+4. **Implement the instruction logic**: Add handling in relevant files like:
+
+   - `slang-ir-insts.h` (if you need a custom struct definition)
+   - `slang-emit-*.cpp` files for code generation
+   - `slang-ir-lower-*.cpp` files for transformations
+
+5. **Update the module version**: In `slang-ir.h`, increment `k_maxSupportedModuleVersion`:
+   ```cpp
+   const static UInt k_maxSupportedModuleVersion = 1; // was 0
+   ```
+
+### Modifying an Existing Instruction
+
+Modifications require more care:
+
+- **Adding operands or changing semantics**: This is a breaking change. You must:
+
+  1. Increment both `k_minSupportedModuleVersion` and `k_maxSupportedModuleVersion`
+  2. Document the change in the version history
+
+- **Renaming**: Don't rename instructions directly. Instead:
+
+  1. Add the new instruction
+  2. Mark the old one as deprecated
+  3. Eventually remove it in a major version bump
+
+## The Stable Name System
+
+### Purpose
+
+When Slang serializes IR modules, it needs to handle the case where the compiler version that reads a module is different from the one that wrote it. Instructions might have been added, removed, or reordered in the `IROp` enum.
+
+The stable name system solves this by assigning permanent integer IDs to each instruction. These IDs never change once assigned.
+
+### How It Works
+
+1. **Assignment**: When a new instruction is added, the `check-ir-stable-names.lua` script assigns it the next available ID.
+
+2. **Serialization**: When writing a module, opcodes are converted to stable IDs:
+
+   ```cpp
+   auto stableName = getOpcodeStableName(value);
+   ```
+
+3. **Deserialization**: When reading, stable IDs are converted back:
+
+   ```cpp
+   value = getStableNameOpcode(stableName);
+   ```
+
+4. **Validation**: The CI system ensures the stable name table stays synchronized with the instruction definitions.
+
+### Maintenance
+
+The stable name table is validated in CI:
+
+```bash
+./extras/check-ir-stable-names-gh-actions.sh
+```
+
+This script:
+
+- Verifies all instructions have stable names
+- Checks for duplicate IDs
+- Ensures the mapping is bijective
+- Can automatically fix missing entries
+
+## Module Versioning
+
+### Version Types
+
+Slang tracks two version numbers:
+
+1. **Module Version** (`IRModule::m_version`): The semantic version of the IR instruction set
+
+   - Range: `k_minSupportedModuleVersion` to `k_maxSupportedModuleVersion`
+   - Stored in each serialized module
+
+2. **Serialization Version** (`IRModuleInfo::serializationVersion`): The format version
+   - Allows changes to how data is encoded
+
+### When to Update Versions
+
+**Minor Version Bump** (increment `k_maxSupportedModuleVersion` only):
+
+- Adding new instructions
+- Adding new instruction flags that don't affect existing code
+- Adding new optional operands
+
+**Major Version Bump** (increment both min and max):
+
+- Removing instructions
+- Changing instruction semantics
+- Modifying minimum operand counts or types
+- Any change that breaks compatibility
+
+### Version Checking
+
+During deserialization:
+
+```cpp
+if (fossilizedModuleInfo->serializationVersion != IRModuleInfo::kSupportedSerializationVersion)
+    return SLANG_FAIL;
+
+// Later, after loading instructions:
+if (hasUnrecognizedInsts)
+    return SLANG_FAIL;
+```
+
+## Serialization Details
+
+### The Flat Representation
+
+For efficiency, IR modules are serialized as a "flat" representation:
+
+```cpp
+struct FlatInstTable
+{
+    List<InstAllocInfo> instAllocInfo;  // Op + operand count
+    List<Int64> childCounts;            // Children per instruction
+    List<SourceLoc> sourceLocs;         // Source locations
+    List<Int64> operandIndices;         // Flattened operand references
+    List<Int64> stringLengths;          // For string/blob constants
+    List<uint8_t> stringChars;          // Concatenated string data
+    List<UInt64> literals;              // Integer/float constant values
+};
+```
+
+This representation:
+
+- Minimizes pointer chasing during deserialization
+- Groups similar data together for better cache performance
+- Enables efficient bulk operations
+
+### Traversal Order
+
+Instructions are serialized in a specific order for performance:
+
+```cpp
+traverseInstsInSerializationOrder(moduleInst, [&](IRInst* inst) {
+    // Process instruction
+});
+```
+
+The traversal:
+
+1. Visits instructions in preorder (parent before children)
+2. Optionally reorders module-level instructions to group constants together
+3. Maintains deterministic ordering for reproducible builds
+
+## Debugging and Validation
+
+### Available Tools
+
+1. **Module Info Inspection**:
+
+   ```bash
+   slangc -get-module-info module.slang-module
+   ```
+
+   Shows module name, version, and compiler version.
+
+2. **Version Query**:
+
+   ```bash
+   slangc -get-supported-module-versions
+   ```
+
+   Reports the supported version range.
+
+3. **IR Dumping**:
+   ```bash
+   slangc -dump-ir module.slang
+   ```
+   Shows the IR in human-readable form.
+
+### Common Issues
+
+**"Unrecognized instruction" errors**: The module contains instructions unknown to this compiler version. Update Slang or recompile the module.
+
+**Stable name validation failures**: Run the update script and commit the changes:
+
+**Note**: Skip make command if lua is already built.
+```bash
+make -C external/lua MYCFLAGS="-DLUA_USE_POSIX" MYLIBS=""
+./external/lua/lua extras/check-ir-stable-names.lua update
+```
+
+**Version mismatch**: The module was compiled with an incompatible Slang version. Check the version ranges and recompile if necessary.
+
+## Best Practices
+
+1. **Always update stable names**: After adding instructions, run the validation script before committing.
+
+2. **Document version changes**: When bumping module versions, add a comment explaining what changed.
+
+3. **Prefer addition over modification**: When possible, add new instructions rather than changing existing ones.
+
+4. **Group related changes**: If making multiple breaking changes, do them together in a single version bump.
@@ -0,0 +1,275 @@
+The Design of Slang's Intermediate Representation (IR)
+======================================================
+
+This document details some of the important design choices for Slang's IR.
+
+Goals and Non-Goals
+-------------------
+
+The IR needs to balance many goals which can sometimes come into conflict.
+We will start by enumerating these goals (and related non-goals) explicitly so that we can better motivate specific design choices.
+
+* Obviously it must be simple to lower any source code in Slang code to the IR. It is however a non-goal for the lowering process to be lossless; we do not need to recover source-level program structure from the IR.
+
+* The IR must be amenable to standard dataflow analyses and optimizations. It should be possible to read a paper on a compiler algorithm or technique and apply it to our IR in a straightforward manner, and with the expected asymptotic efficiency.
+
+* As a particular case of analysis and optimization, it should be possible to validate flow-dependent properties of an input function/program (e.g., whether an `[unroll]` loop is actually unrollable) using the IR, and emit meaningful error messages that reference the AST-level names/locations of constructs involved in an error.
+
+* It should be possible to compile modules to the IR separately and then "link" them in a way that depends only on IR-level (not AST-level) constructs. We want to allow changing implementation details of a module without forcing a re-compile of IR code using that module (what counts as "implementation details") is negotiable.
+
+* There should be a way to serialize IR modules in a round-trip fashion preserving all of the structure. As a long-term goal, the serialized format should provide stability across compiler versions (working more as an IL than an IR)
+
+* The IR must be able to encode "generic" (type-parameterized) constructs explicitly, and to express transformations from generic to specialized (or dynamic-dispatch) code in the IR. In particular, it must be possible for a module to make use of generic defined in another (separately-compiled) module, with validation performed before linking, and specialization performed after.
+
+* The IR must be able to express code that is close to the level of abstraction of shader intermediate languages (ILs) like SPIR-V and DXIL, so that we can minimize the amount of work required (and the number of issues that can arise) when translating the IR to these targets. This can involve lowering and legalization passes to match the constraints of those ILs, but it should not require too much work to be done outside of the IR.
+
+* It should be possible to translate code in the IR back into high-level-language code, including things like structured control-flow constructs.
+
+* Whenever possible, invariants required by the IR should be built into its structure so that they are easier to maintain.
+
+* We should strive to make the IR encoding, both in memory and when serialized, as compact as is practically possible.
+
+Inspirations
+------------
+
+The IR design we currently use takes inspiration from three main sources:
+
+* The LLVM project provides the basic inspiration for the approach to SSA, such as using a typed IR, the decision to use the same object to represent an instruction and the SSA value it produces, and the push to have an extremely simple `replaceAllUsesWith` primitive. It is easy to forget that it is possible to design a compiler with different design decisions; the LLVM ones just happen to both be well-motivated and well-known.
+
+* The Swift IL (SIL) provides the inspiration for our approach for encoding SSA "phi nodes" (blocks with arguments), and also informs some of how we have approached encoding generics and related features like existential types.
+
+* The SPIR-V IL provides the inspiration for the choice to uniformly represent types as instructions, for how to encode "join points" for structured control flow, and for the concept of "decorations" for encoding additional metadata on instructions.
+
+
+Key Design Decisions
+--------------------
+
+### Everything is an Instruction
+
+The Slang IR strives for an extremely high degree of uniformity, so almost every concept in the IR is ultimately just an instruction:
+
+* Ordinary add/sub/mul/etc. operations are instructions, as are function calls, branches, function parameters, etc.
+
+* Basic blocks in functions, as well as functions themselves are "parent instructions" that can have other instructions as children
+
+* Constant values (e.g., even `true` and `false`) are instructions
+
+* Types are instructions too, and can have operands (e.g., a vector type is the `VectorType` instruction applied to operands for the element type and count)
+
+* Generics are encoded entirely using ordinary instructions: a generic is encoded like a function that just happens to do computation at the type level
+
+* It isn't true right now, but eventually decorations will also be instructions, so that they can have operands like any other instruction
+
+* An overall IR module is itself an instruction so that there is a single tree that owns everything
+
+This uniformity greatly simplifies the task of supporting generics, and also means that operations that need to work over all instructions, such as cloning and serialization, can work with a single uniform representation and avoid special-casing particular opcodes.
+
+The decision to use an extremely uniform design, even going as far to treat types as "ordinary" instructions, is similar to SPIR-V, although we do not enforce many of the constraints SPIR-V does on how type and value instructions can be mixed.
+
+### Instructions Have a Uniform Structure
+
+Every instruction has:
+
+* An opcode
+* A type (the top-level module is the only place where this can be null)
+* Zero or more operands
+* Zero or more decorations
+* Zero or more children
+
+Instructions are not allowed to have any semantically-relevant information that is not in the above list.
+The only exception to this rule is instructions that represent literal constants, which store additional data to represent their value.
+
+The in-memory encoding places a few more restrictions on top of this so that, e.g., currently an instruction can either have operands of children, but not both.
+
+Because everything that could be used as an operand is also an instruction, the operands of an instruction are stored in a highly uniform way as a contiguous array of `IRUse` values (even the type is contiguous with this array, so that it can be treated as an additional operand when required).
+The `IRUse` type maintains explicit links for use-def information, currently in a slightly bloated fashion (there are well-known techniques for reducing the size of this information).
+
+### A Class Hierarchy Mirrored in Opcodes
+
+There is a logical "class hierarchy" for instructions, and we support (but do not mandate) declaring a C++ `struct` type to expose an instruction or group of instructions.
+These `struct` types can be helpful to encode the fact that the program knows an instruction must/should have a particular type (e.g., having a function parameter of type `IRFunction*` prevents users from accidentally passing in an arbitrary `IRInst*` without checking that it is a function first), and can also provide convenience accessors for operands/children.
+
+Do make "dynamic cast" operations on this class hierarchy efficient, we arrange for the instruction opcodes for the in-memory IR to guarantee that all the descendents of a particular "base class" will occupy a contiguous range of opcodes. Checking that an instruction is in that range is then a constant-time operation that only looks at its opcode field.
+
+There are some subtleties to how the opcodes are ordered to deal with the fact that some opcodes have a kind of "multiple inheritance" thing going on, but that is a design wart that we should probably remove over time, rather than something we are proud of.
+
+### A Simpler Encoding of SSA
+
+The traditional encoding of SSA form involves placing "phi" instructions at the start of blocks that represent control-flow join points where a variable will take on different values depending on the incoming edge that is taken.
+There are of course benefits to sticking with tradition, but phi instructions also have a few downsides:
+
+- The operands to phi instructions are the one case where the "def dominates use" constraint of SSA appears to be violated. I say "appears" because officially the action of a phi occurs on the incoming edge (not in the target block) and that edge will of course be dominated by the predecessor block. It still creates a special case that programmers need to be careful about. This also complicates serialization in that there is no order in which the blocks/instructions of a function can be emitted that guarantees that every instruction always precedes all of its uses in the stream.
+
+- All of the phi instructions at the start of the block must effectively operate in parallel, so that they all "read" from the correct operand before "writing" to the target variable. Like the above special case, this is only a problem for a phi related to a loop back-edge. It is of course possible to always remember the special interpretation of phi instructions (that they don't actually execute sequentially like every other instruction in a block), but its another special case.
+
+- The order of operands to a phi instruction needs to be related back to the predecessor blocks, so that one can determine which value is to be used for which incoming edge. Any transformation that modifies the CFG of a function needs to be careful to rewrite phi instructions to match the order in which predecessors are listed, or else the compiler must maintain a side data structure that remembers the mapping (and update it instead).
+
+- Directly interpreting/executing code in an SSA IR with phi instructions is made more difficult because when branching to a block we need to immediately execute any phi instructions based on the block from which we just came. The above issues around phis needing to be executed in parallel, and needing to track how phi operands relate to predecessor blocks also add complexity to an interpreter.
+
+Slang ditches traditional phi functions in favor of an alternative that matches the Swift IL (SIL).
+The idea doesn't really start in Swift, but rather in the existing observation that SSA form IR and a continuation-passing style (CPS) IR are semantically equivalent; one can encode SSA blocks as continuation functions, where the arguments of the continuation stand in for the phi instructions, and a branch to the block becomes just a call.
+
+Like Swift, we do not use an explicit CPS representation, but instead find a middle ground of a traditional SSA IR where instead of phi instructions basic blocks have parameters.
+The first N instructions in a Slang basic block are its parameters, each of which is an `IRParam` instruction.
+
+A block that would have had N phi instructions now has N parameters, but the parameters do not have operands.
+Instead, a branch instruction that targets that block will have N *arguments* to match the parameters, representing the values to be assigned to the parameters when this control-flow edge is taken.
+
+This encoding is equivalent in what it represents to traditional phi instructions, but nicely solves the problems outlined above:
+
+- The phi operands in the successor block are now arguments in the *predecessor* block, so that the "def dominates use" property can be enforced without any special cases.
+
+- The "assignment" of the argument values to parameters is now encoded with a single instruction, so that the simultaneity of all the assignments is more clear. We still need to be careful when leaving SSA form to obey those semantics, but there are no tricky issues when looking at the IR itself.
+
+- There is no special work required to track which phi operands come from which predecessor block, since the operands are attached to the terminator instruction of the predecessor block itself. There is no need to update phi instructions after a CFG change that might affect the predecessor list of a block. The trade-off is that any change in the *number* of parameters of a block now requires changes to the terminator of each predecessor, but that is a less common change (isolated to passes that can introduce or eliminate block parameters/phis).
+
+- It it much more clear how to give an operational semantics to a "branch with arguments" instead of phi instructions: compute the target block, copy the arguments to temporary storage (because of the simultaneity requirement), and then copy the temporaries over the parameters of the target block.
+
+The main caveat of this representation is that it requires branch instructions to have room for arguments to the target block. For an ordinary unconditional branch this is pretty easy: we just put a variable number of arguments after the operand for the target block. For branch instructions like a two-way conditional, we might need to encode two argument lists - one for each target block - and an N-way `switch` branch only gets more complicated.
+
+The Slang IR avoids the problem of needing to store arguments on every branch instruction by banning *critical edges* in IR functions that are using SSA phis/parameters. A critical edge is any edge from a block with multiple successors (meaning it ends in a conditional branch) to one with multiple predecessors (meaning it is a "join point" in the CFG).
+Phi instructions/parameters are only ever needed at join points, and so block arguments are only needed on branches to a join point.
+By ruling out conditional branches that target join points, we avoid the need to encode arguments on conditional branch instructions.
+
+This constraint could be lifted at some point, but it is important to note that there are no programs that cannot be represented as a CFG without critical edges.
+
+### A Simple Encoding of the CFG
+
+A traditional SSA IR represents a function as a bunch of basic blocks of instructions, where each block ends in a *terminator* instruction.
+Terminators are instructions that can branch to another block, and are only allowed at the end of a block.
+The potential targets of a terminator determine the *successors* of the block where it appears, and contribute to the *predecessors* of any target block.
+The successor-to-predecessor edges form a graph over the basic blocks called the control-flow graph (CFG).
+
+A simple representation of a function would store the CFG explicitly as a graph data structure, but in that case the data structure would need to be updated whenever a change is made to the terminator instruction of a branch in a way that might change the successor/predecessor relationship.
+
+The Slang IR avoids this maintenance problem by noting an important property.
+If block `P`, with terminator `t`, is a predecessor of `S`, then `t` must have an operand that references `S`.
+In turn, that means that the list of uses of `S` must include `t`.
+
+We can thus scan through the list of predecessors or successors of a block with a reasonably simple algorithm:
+
+* To find the successors of `P`, find its terminator `t`, identify the operands of `t` that represent successor blocks, and iterate over them. This is O(N) in the number of outgoing CFG edges.
+
+* To find the predecessors of `S`, scan through its uses and identify users that are terminator instructions. For each such user if this use is at an operand position that represents a successor, then include the block containing the terminator in the output. This is O(N) in the number of *uses* of a block, but we expect that to be on the same order as the number of predecessors in practice.
+
+Each of these actually iterates over the outgoing/incoming CFG *edges* of a block (which might contain duplicates if one block jumps to another in, e.g, multiple cases of a `switch`).
+Sometimes you actually want the edges, or don't care about repeats, but in the case where you want to avoid duplicates the user needs to build a set to deduplicate the lists.
+
+The clear benefit of this approach is that the predecessor/successor lists arise naturally from the existing encoding of control-flow instructions. It creates a bit of subtle logic when walking the predecessor/successor lists, but that code only needs to be revisited if we make changes to the terminator instructions that have successors.
+
+### Explicit Encoding of Control-Flow Join Points
+
+In order to allow reconstruction of high-level-language source code from a lower-level CFG, we need to encode something about the expected "join point" for a structured branch.
+This is the logical place where control flow is said to "reconverge" after a branch, e.g.:
+
+```hlsl
+if(someCondition) // join point is "D"
+{
+	A;
+}
+else
+{
+	B;
+	if(C) return;
+}
+D;
+```
+
+Note that (unlike what some programming models would say) a join point is *not* necessarily a postdominator of the conditional branch. In the example above the block with `D` does not postdominate the block with `someCondition` nor the one with `B`. It is even possible to construct cases where the high-level join point of a control-flow construct is unreachable (e.g., the block after an infinite loop).
+
+The Slang IR encodes structured control flow by making the join point be an explicit operand of a structured conditional branch operation. Note that a join-point operand is *not* used when computing the successor list of a block, since it does not represent a control-flow edge.
+This is slightly different from SPIR-V where join points ("merge points" in SPIR-V) are encoded using a metadata instruction that precedes a branch. Keeping the information on the instruction itself avoids cases where we move one but not the other of the instructions, or where we might accidentally insert code between the metadata instruction and the terminator it modifies.
+In the future we might consider using a decoration to represent join points.
+
+When using a loop instruction, the join point is also the `break` label. The SPIR-V `OpLoopMerge` includes not only the join point (`break` target) but also a `continue` target. We do not currently represent structured information for `continue` blocks.
+The reason for this is that while we could keep structured information about `continue` blocks, we might not be able to leverage it when generating high-level code, because the syntactic form of a `for` loop (the only construct in C-like languages where `continue` can go somewhere other than the top of the loop body) only allows an *expression* for the continue clause and not a general *statement*, but we cannot guarantee that after optimization the code in an IR-level "continue clause" would constitute a single expression.
+The approach we use today means that the code in "continue clause" might end up being emitted more than once in final code; this is deemed acceptable because it is what `fxc` already does.
+
+When it comes time to re-form higher-level structured control flow from Slang IR, we use the structuring information in the IR to form single-entry "regions" of code that map to existing high-level control-flow constructs (things like `if` statements, loops, `break` or `continue` statements, etc.).
+The current approach we use requires the structuring information to be maintained by all IR transformations, and also currently relies on some invariants about what optimizations are allowed to do (e.g., we had better not introduce multi-level `break`s into the IR).
+
+In the future, it would be good to investigate adapting the "Relooper" algorithm used in Emscripten so that we can recover valid structured control flow from an arbitrary CFG; for now we put off that work.
+If we had a more powerful restructuring algorithm at hand, we could start to support things like multi-level `break`, and also ensure that `continue` clauses don't lead to code duplication any more.
+
+## IR Global and Hoistable Value Deduplication
+
+Types, constants and certain operations on constants are considered "global value" in the Slang IR. Some other insts like `Specialize()` and `Ptr(x)` are considered as "hoistable" insts, in that they will be defined at the outer most scope where their operands are available. For example, `Ptr(int)` will always be defined at global scope (as direct children of `IRModuleInst`) because its only operand, `int`, is defined at global scope. However if we have `Ptr(T)` where `T` is a generic parameter, then this `Ptr(T)` inst will be always be defined in the block of the generic. Global and hoistable values are always deduplicated and we can always assume two hoistable values with different pointer addresses are distinct values.
+
+The `IRBuilder` class is responsible for ensuring the uniqueness of global/hoistable values. If you call any `IRBuilder` methods that creates a new hoistable instruction, e.g.  `IRBuilder::createIntrinsicInst`, `IRBuilder::emitXXX` or `IRBuilder::getType`, `IRBuilder` will check if an equivalent value already exists, and if so it returns the existing inst instead of creating a new one.
+
+The trickier part here is to always maintain the uniqueness when we modify the IR. When we update the operand of an inst from a non-hoistable-value to a hoistable-value, we may need to hoist `inst` itself as a result. For example, consider the following code:
+```
+%1 = IntType
+%p = Ptr(%1)
+%2 = func {
+   %x = ...;
+   %3 = Ptr(%x);
+   %4 = ArrayType(%3);
+   %5 = Var (type: %4);
+   ...
+}
+```
+
+Now consider the scenario where we need to replace the operand in `Ptr(x)` to `int` (where `x` is some non-constant value), we will get a `Ptr(int)` which is now a global value and should be deduplicated:
+```
+%1 = IntType
+%p = Ptr(%1)
+%2 = func {
+   %x = ...;
+   //%3 now becomes %p.
+   %4 = ArrayType(%p);
+   %5 = Var (type: %4);
+   ...
+}
+```
+Note this code is now breaking the invariant that hoistable insts are always defined at the top-most scope, because `%4` becomes is no longer dependent on any local insts in the function, and should be hoisted to the global scope after replacing `%3` with `%p`. This means that we need to continue to perform hoisting of `%4`, to result this final code:
+```
+%1 = IntType
+%p = Ptr(%1)
+%4 = ArrayType(%p); // hoisted to global scope
+%2 = func {
+   %x = ...;
+   %5 = Var (type: %4);
+   ...
+}
+```
+
+As illustrated above, because we need to maintain the invariants of global/hoistable values, replacing an operand of an inst can have wide-spread effect on the IR.
+
+To help ensure these invariants, we introduce the `IRBuilder.replaceOperand(inst, operandIndex, newOperand)` method to perform all the cascading modifications after replacing an operand. However the `IRInst.setOperand(idx, newOperand)` will not perform the cascading modifications, and using `setOperand` to modify the operand of a hoistable inst will trigger a runtime assertion error.
+
+Similarly, `inst->replaceUsesWith` will also perform any cascading modifications to ensure the uniqueness of hoistable values. Because of this, we need to be particularly careful when using a loop to iterate the IR linked list or def-use linked list and call `replaceUsesWith` or `replaceOperand` inside the loop.
+
+Consider the following code:
+
+```
+IRInst* nextInst = nullptr;
+for (auto inst = func->getFirstChild(); inst; inst = nextInst)
+{
+     nextInst = inst->getNextInst(); // save a copy of nestInst
+     // ...
+     inst->replaceUsesWith(someNewInst); // Warning: this may be unsafe, because nextInst could been moved to parent->parent!
+}
+```
+
+Now imagine this code is running on the `func` defined above, imagine we are now at `inst == %3` and we want to replace `inst` with `Ptr(int)`. Before calling `replaceUsesWith`, we have stored `inst->nextInst` to `nextInst`, so `nextInst` is now `%4`(the array type). Now after we call `replaceUsesWith`, `%4` is hoisted to global scope, so in the next iteration, we will start to process `%4` and follow its `next` pointer to `%2` and we will be processing `func` instead of continue walking the child list!
+
+Because of this, we should never be calling `replaceOperand` or `replaceUsesWith` when we are walking the IR linked list. If we want to do so, we must create a temporary workList and add all the insts to the work list before we make any modifications. The `IRInst::getModifiableChildren` utility function will return a temporary work list for safe iteration on the children. The same can be said to the def-use linked list. There is `traverseUses` and `traverseUsers` utility functions defined in `slang-ir.h` to help with walking the def-use list safely.
+
+Another detail to keep in mind is that  any local references to an inst may become out-of-date after a call to `replaceOperand` or `replaceUsesWith`. Consider the following code:
+```
+IRBuilder builder;
+auto x = builder.emitXXX(); // x is some non-hoistable value.
+auto ptr = builder.getPtrType(x);  // create ptr(x).
+x->replaceUsesWith(intType); // this renders `ptr` obsolete!!
+auto var = builder.emitVar(ptr); // use the obsolete inst to create another inst.
+```
+In this example, calling `replaceUsesWith` will cause `ptr` to represent `Ptr(int)`, which may already exist in the global scope. After this call, all uses of `ptr` should be replaced with the global `Ptr(int)` inst instead. `IRBuilder` has provided the mechanism to track all the insts that are removed due to deduplication, and map those removed but not yet deleted insts to the existing inst. When using `ptr` to create a new inst, `IRBuilder` will first check if `ptr` should map to some existing hoistable inst in the global deduplication map and replace it if possible. This means that after the call to `builder.emitVar`, `var->type` is not equal to to `ptr`.
+
+### Best Practices
+
+In summary, the best practices when modifying the IR is:
+- Never call `replaceUsesWith` or `replaceOperand` when walking raw linked lists in the IR. Always create a work list and iterate on the work list instead. Use `IRInst::getModifiableChildren` and `traverseUses` when you need to modify the IR while iterating.
+- Never assume any local references to an `inst` is up-to-date after a call to `replaceUsesWith` or `replaceOperand`. It is OK to continue using them as operands/types to create a new inst, but do not assume the created inst will reference the same inst passed in as argument.
+
+
@@ -0,0 +1,265 @@
+An overview of the Slang Compiler
+=================================
+
+This document will attempt to walk through the overall flow of the Slang compiler, as an aid to developers who are trying to get familiar with the codebase and its design.
+More emphasis will be given to places where the compiler design is nontraditional, or might surprise newcomers; things that are straightforward won't get much detail.
+
+High-Level Concepts
+-------------------
+
+Compilation is always performed in the context of a *compile request*, which bundles together the options, input files, and request for code generation.
+Inside the code, there is a type `CompileRequest` to represent this.
+
+The user specifies some number of *translation units* (represented in the code as a `TranslationUnitRequest`) which comprise some number of *sources* (files or strings).
+HLSL follows the traditional C model where a "translation unit" is more or less synonymous with a source file, so when compiling HLSL code the command-line `slangc` will treat each source file as its own translation unit.
+For Slang code, the command-line tool will by default put all source files into a single translation unit (so that they represent a shared namespace0).
+
+The user can also specify some number of *entry points* in each translation unit (`EntryPointRequest`), which combines the name of a function to compile with the pipeline stage to compile for.
+
+In a single compile request, we can generate code for zero or more *targets* (represented with `TargetRequest`) a target defines both the format for output code (e.g., DXIL or SPIR-V) and a *profile* that specifies the capability level to assume (e.g., "Shader Model 5.1").
+
+It might not be immediately clear why we have such fine-grained concepts as this, but it ends up being quite important to decide which pieces of the compiler are allowed to depend on which pieces of information (e.g., whether or not a phase of compilation gets to depend on the chosen target).
+
+The "Front End"
+---------------
+
+The job of the Slang front-end is to turn textual source code into a combination of code in our custom intermediate representation (IR) plus layout and binding information for shader parameters.
+
+### Lexing
+
+The first step in the compiler (after a source file has been loaded into memory) is to *lex* it.
+The `Lexer` type is implement in `lexer.{h,cpp}` and produces `Token`s that represent the contents of the file on-demand as requested by the next phase of compilation.
+
+Each token stores a `TokenCode` that indicates the kind of token, the raw text of the token, and the location in the source code where it is located.
+Source locations use a somewhat clever encoding to avoid being bloated (they are a single integer rather than separate file, line, and column fields).
+
+We don't make any attempt in the lexer to extract the actual value of integer and floating-point literals; we just store the raw text.
+We also don't try to distinguish keywords from identifiers; keywords show up as ordinary identifier tokens.
+
+Much of the complexity (and inefficiency) in the current lexer is derived from the need to support C-isms like backspace line continuation, and special case rules like allowing `<>` to delimit a file name string after a `#include`.
+
+### Preprocessing
+
+The preprocessor (`Preprocessor`) in `preprocessor.{h,cpp}` deals with `#include` constructs, macro expansions, etc.
+It pulls tokens from the lexer as needed (making sure to set flags to control the lexer behavior when required) and uses a limited lookahead to decide what to do with each token.
+
+The preprocessor maintains a stack of input streams, with the original source file at the bottom, and pushes entries for `#include`d files, macros to expand etc.
+
+Macro definitions store a sequence of already-lexed tokens, and expansion simply "replays" these tokens.
+Expansion keeps a notion of an "environment" for looking up identifiers and mapping them to macro definitions.
+Calling through to a function-style macro creates a fresh environment that maps the macro parameter names to pseudo-macros for the arguments.
+
+We still tokenize code in inactive preprocessor conditionals, but don't evaluate preprocessor directives inside inactive blocks (except those that may change the active/inactive state).
+Preprocessor directives are each handled as a callback on the preprocessor state and are looked up by name; adding a new directive (if we ever had a reason to) is a fairly simple task.
+
+One important detail of the preprocessor is that it runs over a full source file at once and produces a flat array of `Token`s, so that there is no direct interaction between the parser and preprocessor.
+
+### Parsing
+
+The parser (`Parser` in `parser.{h,cpp}`) is mostly a straightforward recursive-descent parser.
+Because the input is already tokenized before we start, we can use arbitrary lookahead, although we seldom look ahead more than one token.
+
+Traditionally, parsing of C-like languages requires context-sensitive parsing techniques to distinguish types from values, and deal with stuff like the C++ "most vexing parse."
+Slang instead uses heuristic approaches: for example, when we encounter an `<` after an identifier, we first try parsing a generic argument list with a closing `>` and then look at the next token to determine if this looks like a generic application (in which case we continue from there) or not (in which case we backtrack).
+
+There are still some cases where we use lookup in the current environment to see if something is a type or a value, but officially we strive to support out-of-order declarations like most modern languages.
+In order to achieve that goal we will eventually move to a model where we parse the bodies of declarations and functions in a later pass, after we have resolved names in the global scope.
+
+One important choice in the parser is that we strive to avoid hard-coding keywords as much as possible.
+We already track an environment for C-like parsing, and we simply extend that so that we also look up declaration and statement keywords in the environment.
+This means that most of the language "keywords" in Slang aren't keywords at all, and instead are just identifiers that happen to be bound to syntax in the default environment.
+Syntax declarations are associated with a callback that is invoked to parse the construct they name.
+
+The design of treating syntax as ordinary declarations has a long-term motivation (we'd like to support a flexible macro system) but it also has short-term practical benefits.
+It is easy for us to add new modifier keywords to the language without touching the lexer or parser (just adding them to the core module), and we also don't have to worry about any of Slang's extended construct (e.g., `import`) breaking existing HLSL code that just happens to use one of those new keywords as a local variable name.
+
+What the parser produces is an abstract syntax tree (AST).
+The AST currently uses a strongly-typed C++ class hierarchy with a "visitor" API generated via some ugly macro magic.
+Dynamic casting using C++ RTTI is used in many places to check the class of an AST node; we aren't happy with this but also haven't had time to implement a better/faster solution.
+
+In the parsed AST, both types and expressions use the same representation (because in an expression like `A(B)` it is possible that `A` will resolve to a type, or to a function, and we don't know which yet).
+
+One slightly odd design choice in the parser is that it attaching lexical scoping information to the syntax nodes for identifiers, and any other AST node that need access to the scope/environment where it was defined. This is a choice we will probably change at some point, but it is deeply ingrained right now.
+
+### Semantic Checking
+
+The semantic checking step (`check.{h,cpp}`) is, not surprisingly, the most complicated and messiest bit of the compiler today.
+The basic premise is simple: recursively walk the entire AST and apply semantic checking to each construct.
+
+Semantic checking applies to one translation unit at a time.
+It has access to the list of entry points for the translation unit (so it can validate them), but it *not* allowed to depend on the compilation target(s) the user might have selected.
+
+Semantic checking of an expression or type term can yield the same AST node, with type information added, or it can return newly constructed AST needs (e.g., when an implicit cast needs to be inserted).
+Unchecked identifiers or member references are always resolved to have a pointer to the exact declaration node they are referencing.
+
+Types are represented with a distinct class hierarchy from AST nodes, which is also used for a general notion of compile-time values which can be used to instantiate generic types/functions/etc.
+An expression that ends up referring to a type will have a `TypeType` as its type, which will hold the actual type that the expression represents.
+
+The most complicated thing about semantic checking is that we strive to support out-of-order declarations, which means we may need to check a function declaration later in the file before checking a function body early in the file.
+In turn, that function declaration might depend on a reference to a nested type declared somewhere else, etc.
+We currently solve this issue by doing some amount of on-demand checking; when we have a reference to a function declaration and we need to know its type, we will first check if the function has been through semantic checking yet, and if not we will go ahead and recursively type check that function before we proceed.
+
+This kind of unfounded recursion can lead to real problems (especially when the user might write code with circular dependencies), so we have made some attempts to more strictly "phase" the semantic checking, but those efforts have not yet been done systematically.
+
+When code involved generics and/or interfaces, the semantic checking phase is responsible for ensuring that when a type claims to implement an interface it provides all of the requirements of that interface, and it records the mapping from requirements to their implementations for later use. Similarly, the body of a generic is checked to make sure it uses type parameters in ways that are consistent with their constraints, and the AST is amended to make it explicit when an interface requirement is being employed.
+
+### Lowering and Mandatory Optimizations
+
+The lowering step (`lower-to-ir.{h,cpp}`) is responsible for converting semantically valid ASTs into an intermediate representation that is more suitable for specialization, optimization, and code generation.
+The main thing that happens at this step is that a lot of the "sugar" in a high-level language gets baked out. For example:
+
+- A "member function" in a type will turn into an ordinary function that takes an initial `this` parameter
+- A `struct` type nested in another `struct` will turn into an ordinary top-level `struct`
+- Compound expressions will turn into sequences of instructions that bake the order of evaluation
+- High-level control-flow statements will get resolved to a control-flow graph (CFG) of basic blocks
+
+The lowering step is done once for each translation unit, and like semantic checking it does *not* depend on any particular compilation target.
+During this step we attach "mangled" names to any imported or exported symbols, so that each function overload, etc. has a unique name.
+
+After IR code has been generated for a translation unit (now called a "module") we next perform a set of "mandatory" optimizations, including SSA promotion and simple copy propagation and elimination of dead control-flow paths.
+These optimizations are not primarily motivated by a desire to speed up code, but rather to ensure that certain "obvious" simplifications have been performed before the next step of validation.
+
+After the IR has been "optimized" we perform certain validation/checking tasks that would have been difficult or impossible to perform on the AST.
+For example, we can validate that control flow never reached the end of a non-`void` function, and issue an error otherwise.
+There are other validation tasks that can/should be performed at this step, although not all of them are currently implemented:
+
+- We should check that any `[unroll]` loops can actually be unrolled, by ensuring that their termination conditions can be resolved to a compile-time constant (even if we don't know the constant yet)
+
+- We should check that any resource types are being used in ways that can be statically resolved (e.g., that the code never conditionally computes a resource to reference), since this is a requirement for all our current targets
+
+- We should check that the operands to any operation that requires a compile-time constant (e.g., the texel offset argument to certain `Sample()` calls) are passed values that end up being compile-time constants
+
+The goal is to eliminate any possible sources of failure in low-level code generation, without needing to have a global view of all the code in a program.
+Any error conditions we have to push off until later starts to limit the value of our separate compilation support.
+
+### Parameter Binding and Type Layout
+
+The next phase of parameter binding (`parameter-binding.{h,cpp}`) is independent of IR generation, and proceeds based on the AST that came out of semantic checking.
+Parameter binding is the task of figuring out what locations/bindings/offsets should be given to all shader parameters referenced by the user's code.
+
+Parameter binding is done once for each target (because, e.g., Vulkan may bind parameters differently than D3D12), and it is done for the whole compile request (all translation units) rather than one at a time.
+This is because when users compile something like HLSL vertex and fragment shaders in distinct translation units, they will often share the "same" parameter via a header, and we need to ensure that it gets just one location.
+
+At a high level, parameter binding starts by computing the *type layout* of each shader parameter.
+A type layout describes the amount of registers/bindings/bytes/etc. that a type consumes, and also encodes the information needed to compute offsets/registers for individual `struct` fields or array elements.
+
+Once we know how much space each parameter consumes, we then inspect an explicit binding information (e.g., `register` modifiers) that are relevant for the target, and build a data structure to record what binding ranges are already consumed.
+Finally, we go through any parameters without explicit binding information and assign them the next available range of the appropriate size (in a first-fit fashion).
+
+The parameter binding/layout information is what the Slang reflection API exposes. It is layered directly over the Slang AST so that it accurately reflects the program as the user wrote it, and not the result of lowering that program to our IR.
+
+This document describes parameter binding as a "front end" activity, but in practice it is something that could be done in the front-end, the back-end or both.
+When shader code involves generic type parameters, complete layout information cannot be generated until the values of these parameters are fully known, and in practice that might not happen until the back end.
+
+### Serialization
+
+It is not yet fully implemented, but our intention is that the last thing the front-end does is to serialize the following information:
+
+- A stripped-down version of the checked AST for each translation unit including declarations/types, but not function bodies
+
+- The IR code for each translation unit
+
+- The binding/layout information for each target
+
+The above information is enough to type-check a subsequent module that `import`s code compile in the front-end, to link against its IR code, or to load and reflect type and binding information.
+
+
+The "Back End"
+--------------
+
+The Slang back end logically starts with the user specifying:
+
+- An IR module, plus any necessary modules to link in and provide its dependencies
+
+- An entry point in that module, plus arguments for any generic parameters that entry point needs
+
+- A compilation target (e.g., SPIR-V for Vulkan)
+
+- Parameter binding/layout information for that module and entry point, computed for the chosen target
+
+We eventually want to support compiling multiple entry points in one pass of the back end, but for now it assumes a single entry point at a time
+
+### Linking and Target Specialization
+
+The first step we perform is to copy the chosen entry point and anything it depends on, recursively into a "fresh" IR module.
+We make a copy of things so that any optimization/transformation passes we do for one target don't alter the code the front-end produced in ways that affect other targets.
+
+While copying IR code into the fresh module, we have cases where there might be multiple definitions of the same function or other entity.
+In those cases, we apply "target specialization" to pick the definition that is the best for the chosen target.
+This step is where we can select between, say, a built-in definition of the `saturate` function for D3D targets, vs. a hand-written one in a Slang standard module to use for GLSL-based targets.
+
+### API Legalization
+
+If we are targeting a GLSL-based platform, we need to translate "varying" shader entry point parameters into global variables used for cross-stage data passing.
+We also need to translate any "system value" semantics into uses of the special built-in `gl_*` variables.
+
+We currently handle this kind of API-specific legalization quite early in the process, performing it right after linking.
+
+### Generic Specialization
+
+Once the concrete values for generic parameters are know we can set about specializing code to the known types.
+We do this by cloning a function/type/whatever and substituting in the concrete arguments for the parameters.
+This process can be continued as specializing one function may reveal opportunities to specialize others.
+
+During this step we also specialize away lookup of interface requirements through their witness tables, once generic witness-table parameters have been replaced with concrete witness tables.
+
+At the end of specialization, we should have code that makes no use of user-defined generics or interfaces.
+
+### Type Legalization
+
+While HLSL and Slang allow a single `struct` type to contain both "ordinary" data like a `float3` and "resources" like a `Texture2D`, the rules for GLSL and SPIR-V are more restrictive.
+There are some additional wrinkles that arise for such "mixed" types, so we prefer to always "legalize" the types in the users code by replacing an aggregate type like:
+
+```hlsl
+struct Material { float4 baseColor; Texture2D detailMap; };
+Material gMaterial;
+```
+
+with separate declarations for ordinary and resource fields:
+
+```hlsl
+struct Material { float4 baseColor; }
+
+Material gMaterial;
+Texture2D gMaterial_detailMap;
+```
+
+Changing the "shape" of a type like this (so that a single variable becomes more than one) needs to be done consistently across all declarations/functions in the program (hence why we do it after specialization, so that all concrete types are known).
+
+### Other Optimizations
+
+We dont' currently apply many other optimizations on the IR code in the back-end, under the assumption that the lower-level compilers below Slang will do some of the "heavy lifting."
+
+That said, there are certain optimizations that Slang must do eventually, for semantic completeness. One of the most important examples of these is implementing the semantics of the `[unroll]` attribute, since we can't always rely on downstream compilers to have a capable unrolling implementation.
+
+We expect that over time it will be valuable for Slang to support a wider array of optimization passes, as long as they are ones that are considered "safe" to do above the driver interface, because they won't interfere with downstream optimization opportunities.
+
+### Emission
+
+Once we have transformed the IR code into something that should be legal for the chosen target, we emit code in the appropriate format for the target. This can be high-level source code (such as HLSL, GLSL, Metal, WGSL, C++, or CUDA) or binary formats (such as SPIR-V, DXIL, PTX, or MetalLib) depending on the compilation target.
+
+The emit logic is mostly just a scan over the IR code to emit a high-level declaration for each item: an IR structure type becomes a `struct` declaration, and IR function becomes a function definition, etc.
+
+In order to make the generated code a bit more readable, the Slang compiler currently does *not* emit declarations using their mangled names and instead tries to emit everything using a name based on how it was originally declared.
+
+To improve the readability of function bodies, the emit logic tries to find consecutive sequences of IR instructions that it can emit as a single high-level language expression. This reduces the number of temporaries in the output code, but we need to be careful about inserting parentheses to respect operator precedence, and also to not accidentally change the order of evaluation of code.
+
+When emitting a function body, we need to get from the low-level control flow graph (CFG) to high-level structured control-flow statements like `if`s and loops. We currently do this on a per-function basis during code emission, using an ad hoc algorithm based on control-flow structured information we stored in the IR.
+A future version of the compiler might implement something more complete like the "Relooper" algorithm used by Emscripten.
+
+### Downstream Compiler Execution
+
+For certain targets and compilation paths, we invoke downstream compilers to generate binary code (and optionally to disassemble that code for console output). For example:
+- DXIL and DXBC targets use dxc and fxc respectively
+- SPIR-V, although generated directly from the Slang IR by default, can instead use glslang if the `-emit-spirv-via-glsl` option is specified for `slangc`. If that option is used, GLSL is emitted from the Slang IR to pass to glslang for SPIR-V generation
+- PTX generation uses NVRTC
+- MetalLib and MetalLibAssembly targets use the Metal compiler (MetalC)
+
+Targets that have output emitted directly from the Slang IR without the use of downstream compilers include high-level source formats like HLSL, GLSL, Metal, WGSL, C++, and CUDA source, as well as the default SPIR-V binary generation path.
+
+The Slang compiler also supports a "pass through" mode where it skips most of the steps outlined so far and just passes text along to downstream compilers directly. This is primarily intended as a debugging aid for developers working on Slang, since it lets you use the same command-line arguments to invoke both Slang compilation and compilation with these other compilers.
+
+Conclusion
+----------
+
+Hopefully this whirlwind introduction to the flow of the Slang compiler gives some idea of how the project fits together, and makes it easier to dive into the code and start being productive.
@@ -0,0 +1,68 @@
+# Resolving Ambiguity in Slang's Parser
+
+A typical text-book style compiler front-end usually features explicit stages: tokenization, parsing, and semantic checking. Slang's original design follows this pattern, but the design has a drawback that it cannot effectively disambiguate the syntax due to lack of semantic info during parsing.
+
+For example, without knowing what `X` is, it is impossible to tell whether `X<a&&b>(5)` means calling a generic function `X` with argument `5`, or computing the logical `AND` between condition `X < a` and `b > 5`.
+
+Slang initially addresses this problem with a heursitic: if the compiler sees `IDENTIFIER` followed by `<`, it will try to parse the expression as a generic specialization first, and if that succeeds, it checks the token after the closing `>` to see if the following token is one of the possible "generic specialization followers". In this example, the next token is `(`, which is a "generic specialization follower", so the compiler determines that the expression being parsed is very likely a generic function call, and it will parse the expression as such. For reference, the full set of "generic specialization followers" are: `::`, `.`, `(`, `)`, `[`, `]`, `:`, `,`, `?`, `;`, `==`, `!=`, `>` and `>>`.
+
+This simplistic heuristic is originated from the C# compiler, which works well there since C# doesn't allow generic value arguments, therefore things like `X<a&&b>...` or `X<a<y>...` can never be valid generic specializations. This isn't the case for Slang, where generic arguments can be int or boolean values, so `a&&b` and `a<y` are valid as generic arguments. Although using the same heuristic here works most of the time, it is still causing a lot of confusion to the users when the heuristic fails.
+
+The ambiguity problem can be systematically solved if the parser has access to semantic info. If the parser knows that `X` is / isn't a generic, then it can parse the expression accordingly without any guess work. The key challenge is to make such semantic info available while we are still parsing.
+
+## Two-stage Parsing
+
+Slang solves this problem by breaking parsing into two stages: the decl parsing stage, and body parsing stage. Initially, we will parse the user source in the decl parsing stage. In this stage, we parse all decls, such as `struct`s, variables, functions etc. as usual, except that when we are about to parse the body of a function, we will just collect all tokens enclosed by `{` and `}` and store them in a raw list as a `UnparsedStmt` AST node. By deferring the parsing of function bodies, we no longer need to guess whether a `<` token inside a function body means generic specialization or less-than comparison.
+
+After the decl parsing stage, we have the AST that represents the decl structure but not the function bodies. With this initial AST, we can start semantic checking. Once we reached the `UnparsedStmt` nodes, the semantic visitor will spawn a new `Parser` and start to parse the tokens stored in the `UnparsedStmt` node. When we spawn the parser in a semantic visitor, initialize the parser to be in `Body` parsing stage, and pass a pointer to the semantic visitor to the parser. This way, we are triggering the second parsing stage from the semantic visitor.
+
+During the second parsing stage, whenever we see a `<` and need to disambiguate, we will use the semantic visitor to check the expression that has been parsed so far before `<`. If we are able to type check the expression and find it to be a `DeclRefExpr` referencing a generic decl, or an `OverloadedExpr` where one of the candidate is a generic decl, then we know `<` should be parsed as a generic specialization instead of `operator <`. If the expression before `<` checks to be a reference to a variable or a property, we should parse it as the comparison operator. The reason we are still parsing `<` as generic specialization when the expression before it is an non-generic function or type, is to allow us provide better error messages instead of just a "syntax error" somewhere down the line: in this case the user is most likely treating the non-generic type or function as a generic one by mistake, so we should diagnose as such. In the case that we are unable to properly check the preceeding expression or it checks to something else that we don't know, the compiler will fallback to the heuristic based method for disambiguation.
+
+Note that in the second stage, parsing and semantic checking is interleaved organically. We no longer have a clean boundary between parsing and checking. However, the checking that happens in the second stage is on-demand and checks only necessary parts of the code to determine the type of the expression preceeding the `<` token. Any other code irrelevant to disambiguation purposes are left unchecked. Once the function body is fully parsed, the semantic visitor working on the function will make sure every node of the parsed AST is visited.
+
+This two stage parsing technique should work well to correctly disambiguate code inside a function body. However the current implementation is not 100% bulletproof. Expressions at decl level, such as default values for struct members or function parameters, are still fully parsed in the first stage using the heuristic based method. However this should be a lesser problem in practice, because the default values are typically simple expressions and the chances of running into wrongly disambiguated case is much lower than in function bodies.
+
+## Scope of Local Variables
+
+Another issue linked with parsing is to correctly support the scope of local variables. A local variable should only be visible to code after its declaration within the same `{}` block. Consider this example:
+
+```cpp
+static int input = 100;
+int f()
+{
+    input = 2; // global `input` is now 2
+    int input = input + 1; // local `input` is now 3
+    input = input + 2; // local `input` is now 5
+    return input; // returns 5.
+}
+```
+
+In Slang's implementation, we are creating a `ScopeDecl` container node for each `BlockStatement`, and variable declarations inside the block are added to the same `ScopeDecl`. This creates a problem for two stage parsing: to allow any expression to check during disambiguation, we need to insert variables into the scope as soon as they are parsed, but this means that when we are doing the "full checking" after the entire body is parsed, all variables are already registered in scope and discoverable when we are checking the earlier statements in the block. This means that the compiler cannot report an error if the user attempts to use a variable that is defined later in the block. In the example above, it means that when we are checking the first statement `input = 2`, the lookup logic for `input` will find the local variable instead of the global variable, thus generating the wrong code.
+
+One way to solve this problem is instead of registering all local variables to the same scope owned by the containing `BlockStmt`, we make each local variable declaration own its own scope, that is ended at the end of the owning block. This way, all statements following the local variable declaration become the children of the local variable `DeclStmt`, effectively parsing the above example as:
+
+```cpp
+static int input = 100;
+int f()
+{
+    input = 2; // global `input` is now 2
+    {
+        int input = input + 1; // local `input` is now 3
+        input = input + 2; // local `input` is now 5
+        return input; // returns 5.
+    }
+}
+
+```
+
+This will ensure the scope data-structure matches the semantic scope of the variable, and allow the compiler to produce the correct diagnostics.
+
+However, expressing scope this way creates long nested chains in the AST, and leads to inefficient lookup and deep ASTs that risk overflowing the stack. Instead, Slang stays with the design to put all variables in the same block registered to the same `ScopeDecl`, but uses a separate state on each `VarDecl` called `hiddenFromLookup` to track whether or not the decl should be visible to lookup. During parsing, all decls are set to visible by default, so they can be used for disambiguation purpose. Once parsing is fully done and we are about to check a `BlockStmt`, we will first visit all `DeclStmt`s in the block, mark it as `invisible`, then continue checking the children statements. When checking encounters a `DeclStmt`, it will then mark the decl as `visible`, allowing it to be found by lookup logic for code after the declaration side. This solution allows us to respect the semantic scope of local variables without actually forming a long chain of scopes for a sequence of statements.
+
+## Future Work: Extend Staged Parsing to Decl Scopes
+
+We can further extend this to properly support expressions in global/decl scopes, such as default value expressions for struct members, or the type expressions for functions and global/member variables. To do so, we will use a different strategy for parsing expressions in the first parsing stage. Instead of parsing the expression directly, we should identify the token boundary of an expression without detailed understanding of the syntax. We will parse all expressions into `UnparsedExpr` nodes, which contain unparsed tokens for each expression. By doing so, the first parsing stage will give us an AST that is detailed enough to identify the names of types and functions, and whether or not they are generic. Then we can perform the semantic checking on the intial AST, and use the semantic checking to drive the parsing and checking of any `UnparsedExpr` and `UnparsedStmt`s.
+
+## Future Work: ScopeRef
+
+We can get rid of the `hiddenFromLookup` flag and use a more immutable representation of AST nodes if we introduce the concept of a `ScopeRef` that is a `Scope*` + `endIndex` to mark the boundary of the referenced scope. This way, different statements in a block can have different `ScopeRef` to the same scope but different ending member index. If we are looking up through a `ScopeRef` and find a variable in the scope that has an index greater than `endIndex`, we should treat the variable as invisible and report an error. This is cleaner, allowing better error messages, and avoids having to maintain mutable state flags on Decls.
@@ -0,0 +1,216 @@
+Semantic Checking
+=================
+
+The semantic checking logic in the Slang compiler is located in `source/slang/slang-check*`.
+Semantic checking is applied in the front end after parsing, and before lowering of code to the IR.
+
+The main job of the semantic checking stage is to detect and forbid code that has errors in it.
+The errors and other diagnostics reported are intended to be of benefit to the user, but semantic checking is also important for the overall function of the compiler.
+Stages of compilation after semantic checking (e.g., lowering to the IR) are allowed to *assume* that the code they operate on is semantically valid, and may assert-fail or even crash on invalid code.
+Semantic checking is thus not an optional step, and there is no meaningful way to turn it off.
+
+Semantic Checking can be broken into three main kinds of work, and we will discuss how each is implemented in the following sections:
+
+* Checking of "terms" which include expressions and type expressions
+
+* Checking of statements
+
+* Checking of declarations
+
+Checking Terms
+--------------
+
+### Some Terminology for Terms
+
+We use the word "term" to refer generically to something that can be evaluated to produce a result, but where we do not yet know if the result will be a type or a value. For example, `Texture2D` might be a term that results in a type, while `main` might be a term that results in a value (of function type), but both start out as a `NameExpr` in the AST. Thus the AST uses the class hierarchy under `Expr` to represent terms, whether they evaluate to values or types.
+
+There is also the `Type` hierarchy, but it is important to understand that `Type` represents types as their logical immutable selves, while `Expr`s that evaluate to types are *type expressions* which can be concretely pointed to in the user's code. Type expressions have source locations, because they represent something the user wrote in their code, while `Type`s don't have singular locations by default.
+
+The codebase uses the notion of a `TypeRepr` for those `Expr`s that should only ever evaluate to types, and there is also a `TypeExp` type that is meant to package up a `Type` with an optional `Expr` for a type expression that produced it. The names of these implementation types aren't great, and should probably not be spread further.
+
+A value-bearing `Expr` will eventually be given a `Type` that describes the type of value it produces.
+An `Expr` that evaluates to a type will eventually be given a `Type` that uses the `TypeType` subclass to indicate the specific type it evaluated to.
+The `TypeType` idea is kind of kludge to represent "kinds" (the "types of types") in our system.
+More correctly, we should say that every `Expr` gets a *classifier*, with the classifiers for value expressions being `Type`s and the classifiers for type expressions being kinds, but we haven't had time or inclination to fix the model yet.
+
+### The Big Picture
+
+Checking of terms is largely done as an ad hoc postorder traversal of the AST.
+That is, in order to check a compound expression like `f(a)` we first need to check `f` and `a` before we can check the function call.
+
+When checking an expression there are four main things that have to be done:
+
+1. Recursively check all sub-expressions.
+
+2. Detect and diagnose any errors (or warnings) in the current expression.
+
+3. Optionally construct a new expression to replace the current expression (or one of its sub-expressions) in cases where the syntactic form of the input doesn't match the desired semantics (e.g., make an implicit type conversion explicit in the AST).
+
+4. Determine the correct type for the result expression, and store it so that it can be used by subsequent checking.
+
+Those steps may end up being interleaved in practice.
+
+### Handling Errors Gracefully
+
+If an error is detected in a sub-expression, then there are a few issues that need to be dealt with:
+
+* We need to ensure that an erroneous sub-expression can't crash the compiler when it goes on to check a parent expression. For example, leaving the type of an expression as null when it has errors is asking for trouble.
+
+* We ideally want to continue to diagnose other unrelated errors in the same expression, statement, function, or file. That means that we shouldn't just bail out of semantic checking entirely (e.g., by throwing an exception).
+
+* We don't want to produce "cascading" errors where, e.g., an error in `a` causes us to also report an error in `a + b` because no suitable operator overload was found.
+
+We tackle all of these problems by introducing the `ErrorType` and `ErrorExpr` classes.
+If we can't determine a correct type for an expression (say, because it has an error) then we will assign it the type `ErrorType`.
+If we can't reasonably form an expression to return *at all* then we will return an `ErrorExpr` (which has type `ErrorType`).
+
+These classes are designed to make sure that subsequent code won't crash on them (since we have non-null objects), but to help avoid cascading errors.
+Some semantic checking logic will detect `ErrorType`s on sub-expressions and skip its own checking logic (e.g., this happens for function overload resolution), producing an `ErrorType` further up.
+In other cases, expressions with `ErrorType` can be silently consumed.
+For example, an erroneous expression is implicitly convertible to *any* type, which means that assignment of an error expression to a local variable will always succeed, regardless of variable's type.
+
+### Overload Resolution
+
+One of the most involved parts of expression checking is overload resolution, which occurs when there is an expression of the form `f(...)` where `f` could refer to multiple function declarations.
+
+Our basic approach to overload resolution is to iterate over all the candidates and add them to an `OverloadResolveContext`.
+The context is responsible for keeping track of the "best" candidate(s) seen so far.
+
+Traditionally a language defines rules for which overloads are "better" than others that focus only on candidates that actually apply to the call site.
+This is the right way to define language semantics, but it can produce sub-optimal diagnostics when *no* candidate was actually applicable.
+
+For example, suppose the user wrote `f(a,b)` and there are 100 functions names `f`, but none works for the argument types of `a` and `b`.
+A naive approach might just say "no overload applicable to arguments with such-and-such types."
+A more advanced compiler might try to list all 100 candidates, but that wouldn't be helpful.
+If it turns out that of the 100 candidates, only 10 of them have two parameters, then it might be much more helpful to list only the 10 candidates that were even remotely applicable at the call site.
+
+The Slang compiler strives to provide better diagnostics on overload resolution by breaking the checking of a candidate callee into multiple phases, and recording the earliest phase at which a problem was detected (if any).
+Candidates that made it through more phases of checking without errors are considered "better" than other candidates, even if they ultimately aren't applicable.
+
+### Type Conversions
+
+Conversion of values from one type to another can occur both explicitly (e.g., `(int) foo`) and implicitly (e.g., `while(foo)` implicitly converts `foo` to a `bool`).
+
+Type conversion also tied into overload resolution, since some conversions get ranked as "better" than others when deciding between candidates (e.g., converting an `int` to a `float` is preferred over converting it to a `double`).
+
+We try to bottleneck all kinds of type conversion through a single code path so that the various kinds of conversion can be handled equivalently.
+
+### L-Values
+
+An *l-value* is an expression that can be used as the destination of an assignment, or for read-modify-write operations.
+
+We track the l-value-ness of expressions using `QualType` which basically represents a `Type` plus a bit to note whether something is an l-value or not.
+(This type could eventually be compressed down to be stored as a single pointer, but we haven't gotten to that yet)
+We do not currently have a concept like the `const` qualifier in C/C++, that would be visible to the language user.
+
+Propagation of l-value-ness is handled in an ad hoc fashion in the small number of expression cases that can ever produce l-values.
+The default behavior is that expressions are not l-values and the implicit conversion from `Type` to `QualType` reflects this.
+
+Checking Statements
+-------------------
+
+Checking of statements is relatively simpler than checking expressions.
+Statements do not produce values, so they don't get assigned types/classifiers.
+We do not currently have cases where a statement needs to be transformed into an elaborated form as part of checking (e.g., to make implicit behavior explicit), so statement checking operates "in place" rather than optionally producing new AST nodes.
+
+The most interesting part of statement checking is that it requires information about the lexical context.
+Checking a `return` statement requires knowing the surrounding function and its declared result type.
+Checking a `break` statement requires knowing about any surrounding loop or `switch` statements.
+
+We represent the surrounding function explicitly on the `SemanticsStmtVisitor` type, and also use a linked list of `OuterStmtInfo` threaded up through the stack to track lexically enclosing statements.
+
+Note that semantic checking of statements at the AST level does *not* encompass certain flow-sensitive checks.
+For example, the logic in `slang-check-stmt.cpp` does not check for or diagnose any of:
+
+* Functions that fail to `return` a value along some control flow paths
+
+* Unreachable code
+
+* Variables used without being initialized first
+
+All of the above are instead intended to be handled at the IR level (where dataflow analysis is easier) during the "mandatory" optimization passes that follow IR lowering.
+
+Checking Declarations
+---------------------
+
+Checking of declarations is the most complicated and involved part of semantic checking.
+
+### The Problem
+
+Simple approaches to semantic checking of declarations fall into two camps:
+
+1. One can define a total ordering on declarations (usually textual order in the source file) and only allow dependencies to follow that order, so that checking can follow the same order. This is the style of C/C++, which is inherited from the legacy of traditional single-pass compilers.
+
+2. One can define a total ordering on *phases* of semantic checking, so that every declaration in the file is checked at phase N before any is checked at phase N+1. E.g., the types of all variables and functions must be determined before any expressions that use those variables/functions can be checked. This is the style of, e.g., Java and C#, which put a premium on defining context-free languages that don't dictate order of declaration.
+
+Slang tries to bridge these two worlds: it has inherited features from HLSL that were inspired by C/C++, while it also strives to support out-of-order declarations like Java/C#.
+Unsurprisingly, this leads to unique challenges.
+
+Supporting out-of-order declarations means that there is no simple total order on declarations (we can have mutually recursive function or type declarations), and supporting generics with value parameters means there is no simple total order on phases.
+For that last part observe that:
+
+* Resolving an overloaded function call requires knowing the types of the parameters for candidate functions.
+
+* Determining the type of a parameter requires checking type expressions.
+
+* Type expressions may contain value arguments to generics, so checking type expressions requires checking value expressions.
+
+* Value expressions can include function calls (e.g., operator invocations), which then require overload resolution to type-check.
+
+### The Solution
+
+Our declaration checking logic takes the idea of phase-based checking as a starting point, but instead of a global ordering on phases we use a per-declaration order.
+
+Each declaration in the Slang AST will have a `DeclCheckState` that represents "how checked" that declaration is.
+We can apply semantic checking logic to a declaration `D` to raise its state to some desired state `S`.
+
+By default, the logic in `slang-check-decl.cpp` will do a kind of "breadth-first" checking strategy where it will try to raise all declarations to the one state before moving on to the next.
+In many cases this will reproduce the behavior of a Java or C#-style compiler with strict phases.
+
+The main difference for Slang is that whenever, during the checking of some declaration `D`, we discover that we need information from some other declaration `E` that would depend on `E` being in state `S`, we manually call a routine `ensureDecl(E,S)` whose job is to ensure that `E` has been checked enough for us to proceed.
+
+The `ensureDecl` operation will often be a no-op, if the declaration has already been checked previously, but in cases where the declaration *hasn't* been checked yet it will cause the compiler to recursively re-enter semantic checking and try to check `E` until it reached the desired state.
+
+In pathological cases, this method can result in unbounded recursion in the type checker. The breadth-first strategy helps to make such cases less likely, and introducing more phases to semantic checking can also help reduce problems.
+In the long run we may need to investigate options that don't rely on unbounded recursion.
+
+### The Rules
+
+As a programmer contributing to the semantic checking infrastructure, the declaration-checking strategy requires following a few rules:
+
+* If a piece of code is about to rely on some property of a declaration that might be null/absent/wrong if checking hasn't been applied, it should use `ensureDecl` to make sure the declaration in question has been checked enough for that property to be available.
+
+* If adding some `ensureDecl`s leads to an internal compiler error because of circularity in semantic checking, then either the `ensureDecl`s were misplaced, or they were too strong (you asked for more checking than was necessary), or in the worse case we need to add more phases (more `DeclCheckState`s) to separate out the checking steps and break the apparent cycle.
+
+* In very rare cases, semantic checking for a declaration may want to use `SetCheckState` to update the state of the declaration itself before recursively `ensureDecl`ing its child declarations, but this must be done carefully because it means you are claiming that the declaration is in some state `S`, while not having complete the checking that is associated with state `S`.
+
+* It should *never* be necessary to modify `checkModuleDecl` so that it performs certain kinds of semantic analysis on certain declarations before others (e.g., iterate over all the `AggTypeDecl`s before all the `FuncDecl`s). If you find yourself tempted to modify it in such a way, then add more `DeclCheckState`s to reflect the desired ordering. It is okay to have phases of checking that only apply to a subset of declarations.
+
+* Every statement and expression/term should be checked once and only once. If something is being checked twice and leading to failures, the right thing is to fix the source of the problem in declaration checking, rather than make the expression/statement checking be defensive against this case.
+
+Name Lookup
+-----------
+
+Lookup is the processing of resolving the contextual meaning of names either in a lexical scope (e.g., the user wrote `foo` in a function body - what does it refer to?) or in the scope of some type (e.g., the user wrote `obj.foo` for some value `obj` of type `T` - what does it refer to?).
+
+Lookup can be tied to semantic analysis quite deeply.
+In order to know what a member reference like `obj.foo` refers to, we not only need to know the type of `obj`, but we may also need to know what interfaces that type conforms to (e.g., it might be a type parameter `T` with a constraint `T : IFoo`).
+In order to support lookup in the presence of our declaration-checking strategy described above, the lookup logic may be passed a `SemanticsVisitor` that it can use to `ensureDecl()` declarations before it relies on their properties.
+
+However, lookup also currently gets used during parsing, and in those cases it may need to be applied without access to the semantics-checking infrastructure (since we currently separate parsing and semantic analysis).
+In those cases a null `SemanticsVisitor` is passed in, and the lookup process will avoid using lookup approaches that rely on derived semantic information.
+This is fine in practice because the main thing that gets looked up during parsing are names of `SyntaxDecl`s (which are all global) and also global type/function/variable names.
+
+
+Known Issues
+------------
+
+The largest known issue for the semantic checking logic is that there are currently dependencies between parsing and semantic checking.
+Just like a C/C++ parser, the Slang parser sometimes needs to disambiguate whether an identifier refers to a type or value to make forward progress, and that would in general require semantic analysis.
+
+Ideally the way forward is some combination of the following two strategies:
+
+* We should strive to make parsing at the "global scope" fully context-insensitive (e.g., by using similar lookahead heuristics to C#). We are already close to this goal today, but will need to be careful that we do not introduce regressions compared to the old parser (perhaps a "compatibility" mode for legacy HLSL code is needed?)
+
+* We should delay the parsing of nested scopes (both function and type bodies bracketed with `{}`) until later steps of the compiler. Ideally, parsing of function bodies can be done in a context-sensitive manner that interleaves with semantic checking, closer to the traditional C/C++ model (since we don't care about out-of-order declarations in function bodies).
+
@@ -0,0 +1,258 @@
+Serialization
+=============
+
+Slang's infrastructure for serialization is currently in flux, so there exist a mixture of different subsystems, using a mixture of different techniques.
+
+This document is curently minimal, and primarily serves to provide a replacement for an older draft that no longer reflects the state of the codebase.
+
+The Fossil Format
+=================
+
+The "fossil" format is a memory-mappable binary format for general-purpose serialization.
+
+Goals
+-----
+
+The main goals of the fossil format are:
+
+* Data can be read from memory as-is.
+
+  * Basic types are stored at offsets that are naturally aligned (e.g., a 4-byte integer is 4-byte aligned)
+
+  * Pointers are encoded as relative offsets, and can be traversed without any "relocation" step after data is loaded.
+
+* Supports general-purpose data, including complicated object graphs.
+
+* Data can include embedded layout information, allowing code to traverse it without statically knowing the structure.
+
+  * Embedded layout information should support versioning; new code should be able to load old data by notcing what has/hasn't been encoded.
+
+* Layout information is *optional*, and data can be traversed with minimal overhead by code that knows/assumes the layout
+
+Top-Level Structure
+-------------------
+
+A serialized blob in fossil format starts with a header (see `Slang::Fossil::Header`), which in turn points to the *root value*.
+All other data in the blob should be reachable from the root value, and an application can choose to make the root value whatever type they want (an array, structure, etc.).
+
+Encoding
+--------
+
+### Endian
+
+All data is read/written in the endianness of the host machine.
+There is currently no automatic support for encoding endian-ness as part of the format; a byte-order mark should be added if we ever need to support big-endian platforms.
+
+### Fixed-Size Types
+
+#### Basic Types
+
+Basic types like fixed-width integers and floating-point numbers are encoded as-is.
+That is, an N-byte value is stored directly as N bytes of data with N-byte alignment.
+
+A Boolean value is encoded as an 8-bit unsigned integer holding either zero or one.
+
+#### Pointers
+
+A pointer is encoded as a 4-byte signed integer, representing a relative offset.
+
+If the relative offset value is zero, then the pointer is null.
+Otehrwise, the relative offset value should be added to the offset of the pointer itself, to get the offset of the target.
+
+#### Optionals
+
+An optional value of some type `T` (e.g., the equivalent of a `std::optional<T>`) is encoded as a pointer to a `T`.
+If the pointer is null, the optional has no value; otherwise the value is stored at the offset being pointed to.
+
+Note that when encoding a pointer to an optional (`std::optional<T> *`) or an optional pointer (`std::optional<T*>`), there will be two indirections.
+
+#### Records
+
+Things that are conceptually like a `struct` or tuple are encoded as *records*, which are simply a sequence of *fields*.
+
+The alignment of a record is the maximum alignment of its fields.
+
+Fields in a record are laid out sequentially, where each field gets the next suitably-aligned offset after the preceding field.
+No effort is made to fill in "gaps" left by preceding fields.
+
+Note: currently the size of a record is *not* rounded up to be a multiple of its alignment, so it is possible for one field to be laid out in the "tail padding" of the field before it.
+This behavior should probably be changed, so that the fossilized layout better matches what C/C++ compilers tend to do.
+
+### Variable-Size Types
+
+Types where different instances may consume a different number of bytes may be encoded either *inline* or *indirectly*.
+
+If a variable-size type `V` is being referred to by a pointer or optional (e.g., `V*` or `std::optional<V>`), then it will be encoded inline as the target address of that pointer/optional.
+
+In all other contexts, including when a `V` is used as a field or a record, it will be encoded indirectly (conceptually, as if the field was actually a `V*`).
+When a variable-size type is encoded indirectly, a null pointer should be interpreted as an empty instance of the type `V`.
+
+#### Arrays
+
+An array of `T` is encoded as a sequence of `T` values, separated by the *stride* of `T` (the size of `T` rounded up to the alignment of `T`).
+The offset of the array is the offset of its first element.
+
+The number of elements in the array is encoded as a 4-byte unsigned integer stored immediately *before* the offset of the array itself.
+
+#### Strings
+
+A string is encoded in the same way that an array of 8-bit bytes would be (including the count stored before the first element).
+The only additional detail is that the serialized data *must* include an additional nul byte after the last element of the string.
+
+The data of a string is assumed to be in UTF-8 encoding, but there is nothing about the format that validates or enforces this.
+
+#### Dictionaries
+
+A dictionary with keys of type `K` and values of type `V` is encoded in the same way as an array of `P`, where `P` is a two-element tuple of a `K` and a `V`.
+
+There is currently no provision made for efficient lookup of elements of a fossilized dictionary.
+
+#### Variants
+
+A *variant* is a fossilized value that can describe its own layout.
+
+The content of variant holding a value of type `T` is encoded exactly as a record with one field of type `T` would be, starting at the offset of the variant itself.
+
+The four bytes immediately preceding a variant store a relative pointer to the fossilized layout for the type `T` of the content.
+
+### Layouts
+
+Every layout starts with a 4-byte unsigned integer that holds a tag representing the kind of layout (see `Slang::FossilizedValKind`).
+The value of the tag determines what, if any, information appears after the tag.
+
+In any place where a relative pointer to a layout is expected, a null pointer may be used to indicate that the relevant layout information is either unknown, or was elided from the fossilized data.
+
+#### Pointer-Like Types
+
+For pointers (`T*`) and optionals (`Optional<T>`), the tag is followed by a relative pointer to a layout for `T`.
+
+#### Container Types
+
+For arrays and dictionaries, the tag is followed by:
+
+* A relative pointer to a layout for the element type
+
+* A 4-byte unsigned integer holding the stride between elements
+
+#### Record Types
+
+For records, the tag is followed by:
+
+* A 4-byte unsigned integer holding the number of fields, `N`
+
+* `N` 8-byte values representing the fields, each comprising:
+
+    * A relative pointer to the type of the field
+
+    * A 4-byte unsigned integer holding the offset of that field within the record
+
+The RIFF Support Code
+=====================
+
+There is code in `source/core/slang-riff.{h,cpp}` that implements abastractions for reading and writing RIFF-structured files.
+
+The current RIFF implementation is trying to be "correct" for the RIFF format as used elsewhere (e.g., for `.wav` files), but it is unclear if this choice is actually helping us rather than hurting us.
+It is likely that we will want to customize the format if we keep using (e.g., at the very least increase the minimum alignment of chunks).
+
+RIFF is a simple chunk-based file format that is used by things like WAV files, and has inspired many similar container formats used in media/games.
+
+The RIFF structures are currently being used for a few things:
+
+* The top-level structure of serialized files for slang modules, "module libraries". This design choice is being utilized so that the compiler can navigate the relevant structures and extract the parts it needs (e.g., just the digest of a module, but not the AST or IR).
+
+* Repro files are using a top-level RIFF container, but it is just to encapsulate a single blob of raw data (with internal offset-based pointers)
+
+* The structure of the IR and `SourceLoc` serialization formats uses RIFF chunks for their top-level structure, but doesn't really make use of the ability to navigate them in memory or perform random access.
+
+* The actual serialized AST format is currently a deep hierarchy of RIFF chunks.
+
+* There is also code for a RIFF-based hierarchical virtual file-system format, and that format is being used for the serialized core module (seemingly just because it includes support for LZ4; the actual "file system" that gets serialized seems to only have a single file in it).
+
+General-Purpose Hierarchical Data Serialization
+===============================================
+
+The code in `source/slang/slang-serialize.{h,cpp}` implements a framework for serialization that is intended to be lightweight for users to adopt, while also scaling to more complicated cases like our AST serialization.
+
+In the simplest cases, all a programmer needs to know is that if they have declared a type like:
+
+    struct MyThing
+    {
+        float f;
+        List<OtherThing> others;
+        SomeObject* obj;
+    };
+
+then they can add serialization support for their type by writing a function like:
+
+    void serialize(Serializer const& serializer, MyThing& value)
+    {
+        SLANG_SCOPED_SERIALIZER_STRUCT(serializer);
+        serialize(serializer, value.f);
+        serialize(serializer, value.others);
+        serialize(serializer, value.obj);
+    }
+
+If the `OtherThing` and `SomeObject` types were already set up with their own serialization support, then that should be all that's needed.
+Of course there's a lot more to it in once you get into the details and the difficult cases.
+For now, looking at `source/slang/slang-serialize.h` is probably the best way to learn more about the approach.
+
+One key goal of this serialization system is that it allows the serialized format to be swapped in and out without affecting the per-type `serialize` functions.
+Currently there are only a small number of implementations.
+
+RIFF Serialization
+------------------
+
+The files `slang-serialize-riff.{h,cpp}` provide an implementation of the general-purpose serialization framework that reads/writes RIFF files with a particular kind of structure, based on what had previously been hard-coded for use in serializing the AST to RIFF.
+
+In practice this representation is kind of like an encoding of JSON as RIFF chunks, with leaf/data chunks for what would be leaf values in JSON, and container chunks for arrays and dictionaries (plus other aggregates that would translate into arrays or dictionaries in JSON).
+
+Fossil Serialization
+--------------------
+
+The files `slang-serialize-fossil.{h,cpp}` provide an implementation of the generla-purpose serialization framwork that reads/writes the "fossil" format, which is described earlier in this document.
+
+AST Serialization
+=================
+
+AST serialization is implementation as an application of the general-purpose framework described above.
+There is an `ASTSerializer` type that expands on `Serializer` to include the additional context that is needed for handling AST-related types like `SourceLoc`, `Name`, and the `NodeBase` hierarchy.
+
+The Old Serialization System
+============================
+
+The old serialization system has largely been removed, but some vestiges of it are still noticeable.
+
+There was an older serialization system in place that made use of an extensive RTTI system that types had to be registered with, plus a set of boilerplate macros for interfacing with that system that were generated from the C++ declarations of the AST node types.
+That system was also predicated on the idea that to serialize a user C++ type `Foo`, one would also hand-author a matching C++ type `SerialFooData`, and then write code to translate a `Foo` to/from a `SerialFooData` plus code to read/write a `SerialFooData` from the actual serialized data format.
+
+The IR and `SourceLoc` serialization approaches are currently still heavily influenced by the old serialization system, and there are still vestigates of the RTTI infrastructure that was introduced to support it.
+The hope is that as more subsystems are ported to use newer approaches to serialization, this code can all be eliminated.
+
+The following sections are older text that describes some of the formats that have not yet been revisited.
+
+IR Serialization
+----------------
+
+This mechanism is *much* simpler than generali serialization, because by design the IR types are very homogeneous in style. There are a few special cases, but in general an instruction consists of
+
+* Its type
+* A SourceLoc
+* 0 or more operands.
+* 0 or more children. 
+
+Within the IR instructions are pointers to IRInst derived types. As previously discussed serializing pointers directly is generally not a good idea. To work around this the pointers are turned into 32 bit indices. Additionally we know that an instruction can belong to at most one other instruction. 
+
+When serializing out special handling is made for child instructions - their indices are made to be a contiguous range of indices for all instructions that belong to each parent. The indices are ordered into the same order as the children are held in the parent. By using this mechanism it is not necessary to directly save off the indices that belong to a parent, only the range of indices. 
+
+The actual serialization mechanism is similar to the generalized mechanism - referenced objects are saved off in order of their indices. What is different is that the encoding fixes the size of the Inst to `IRSerialData`. That this can hold up to two operands, if the instruction has more than two operands then one of the UInt32 is the operand count and the other is an offset to a list of operands. It probably makes sense to alter this in the future to stream the instructions payload directly. 
+
+IR serialization allows a simple compression mechanism, that works because much of the IR serialized data is UInt32 data, that can use a variable byte encoding.
+
+SourceLoc Serialization
+-----------------------
+
+SourceLoc serialization presents several problems. Firstly we have two distinct serialization mechanisms that need to use it - IR serialization and generalized serialization. That being the case it cannot be saved directly in either, even though it may be referenced by either. 
+
+To keep things simple for now we build up SourceLoc information for both IR and general serialization via their writers adding their information into a SerialSourceLocWriter. Then we can save this information into a RIFF section, that can be loaded before either general or IR deserialization is used.  
+
+When reading the SourceLoc information has to be located and deserialized before any AST or IR deserialization. The SourceLoc data can then be turned into a SerialSourceLocReader, which is then either set on the `SerialReaders` `SerialExtraObjects`. Or passed to the `IRSerialReader`.
@@ -0,0 +1,254 @@
+Core Module Intrinsics
+======================
+
+The following document aims to cover a variety of systems used to add target specific features. They are most extensively used in the slang core module.
+
+**NOTE!** These features should *not* be considered stable! They can be used in regular slang code to add features, but they risk breaking with any Slang version change. Additionally the features implementation can be very particular to what is required for a specific feature set, so might not work as expected in all scenarios.
+
+As these features are in flux, it is quite possible this document is behind the current features available within the Slang code base.
+
+If you want to add support for a feature for a target to Slang, implementing it as a part of the Slang standard modules is typically a good way to progress. Depending on the extension/feature it may not be possible to add support exclusively via changes to the standard module alone. That said most support for target specific extensions and features involve at least some changes to the slang standard modules including the core module, and typically using the mechanisms described here.
+
+## Core Module
+
+The main place these features are used are within the slang core module. This is implemented with a set of slang files within the slang project
+
+* core.meta.slang 
+* hlsl.meta.slang
+* diff.meta.slang
+
+Looking at these files will demonstrate the features in use. 
+
+Most of the intrinsics and attributes have names that indicate that they are not for normal use. This is typically via a `__` prefix.
+
+The `.meta.slang` files look largely like Slang source files, but their contents can also be generated programmatically with C++ code. A section of code can drop into `C++` code if it is proceeded by `${{{{`. The C++ section is closed with a closing `}}}}`. This mechanism is typically used to generate different versions of a similar code sequence. Values from the C++ code can be accessed via the `$()`, where the contents of the brackets specifies something that can be calculated from within the C++ code.
+
+As an example, to produce an an array with values 0 to 9 we could write...
+
+```slang
+
+// Slang code
+${{{{
+// C++ code, calling out to a C++ function getTime, the result is held in variable time
+int cppTime = getTime();
+}}}}
+
+// Back to Slang code, can access the C++ variable previously defined as cppTime. Due to $().
+// The code inside the $() is executed on the C++ side, so can do calculations. In practice it would be easier
+// to just use call $(getTime() + 1), but this demonstrates variables are accessible.
+int slangTime = $(cppTime + 1);
+```
+
+# Attributes
+
+## [__readNone]
+
+A `[__readNone]` indicates a function that computes its results strictly based on argument values, without reading or writing through any pointer arguments, or any other state that could be observed by a caller.
+
+## [__NoSideEffect]
+
+Specifies a function declaration has no observable side effects. 
+
+## [__unsafeForceInlineEarly]
+
+Inlines the contained code, but does so very early stage. Being earlier allows allows some kinds of inlining transformations to work, that wouldn't work with regular inlining. It also means it must be used with *care*, because it may produce unexpected results for more complex scenarios.  
+
+## [__NonCopyableType]
+
+Marks a type to be non-copyable, causing SSA pass to skip turning variables of the the type into SSA values.
+
+## [__AlwaysFoldIntoUseSiteAttribute]
+
+A call to the decorated function should always be folded into its use site.
+
+## [KnownBuiltin("name")]
+
+A `[KnownBuiltin("name")]` attribute allows the compiler to identify this declaration during compilation, despite obfuscation or linkage removing optimizations
+
+# Intrinsics
+
+<a id="target-intrinsic"></a>
+## __target_intrinsic(target, expansion)
+
+This is a widely used and somewhat complicated intrinsic. Placed on a declaration it describes how the declaration should be emitted for a target. The complexity is that `expansion` is applied via a variety of rules. `target` is a "target capability", commonly it's just the emit target for the intrinsic, so one of...
+
+* hlsl
+* glsl
+* cuda - CUDA
+* cpp - C++ output (used for exe, shared-library or host-callable)
+
+* spirv - Used for slangs SPIR-V direct mechanism
+
+A function definition can have a `target_intrinsic` *and* a body. In that case, the body will be used for targets where the `target_intrinsic` isn't defined. 
+
+If the intrinsic can be emitted as is, the expansion need not be specified. If only the *name* needs to changed (params can be passed as is), only the name to be expanded to needs to be specified *without* `()`. In this scenario it is not necessary to specify as a string in quotes, and just the identifier name can be used.
+
+Currently `HLSL` has a special handling in that it is *assumed* if a declaration exists that it can be emitted verbatim to HLSL.  
+
+The target can also be a capability atom. The atoms are listed in "slang-capability-defs.h".
+
+What is perhaps of importance here is that for some features for a specific target can have multiple ways of achieving the same effect - for example "GL_NV_ray_tracing" and "GL_EXT_ray_tracing" are two different ray tracing extensions available for Vulkan through GLSL. The `-profile` option can disambiguate which extension is actually desired, and the capability with that name on the `target_intrinsic` specifies how to implement that feature for that specific extension.
+
+The expansion mechanism is implemented in "slang-intrinsic-expand.cpp" which will be most up to date.
+
+The `expansion` value can be a string or an identifier. If it is an identifier, it will just be emitted as is replacing the name of the declaration the intrinsics is associated with.
+
+Sections of the `expansion` string that are to be replaced are prefixed by the `$` sigil.
+
+* $0-9 - Indicates the parameter at that index. For a method call $0 is `this`.
+* $T0-9 - The type for the param at the index. If the type is a texture resource derived type, returns the *element* type.
+* $TR - The return type
+* $G0-9 - Replaced by the type/value at that index of specialization
+* $S0-9 - The scalar type of the generic at the index.
+* $p - Used on texturing operations. Produces the combined texture sampler arguments as needed for GLSL.
+* $C - The $C intrinsic is a mechanism to change the name of an invocation depending on if there is a format conversion required between the type associated by the resource and the backing ImageFormat. Currently this is only implemented on CUDA, where there are specialized versions of the RWTexture writes that will do a format conversion.
+* $E - Sometimes accesses need to be scaled. For example in CUDA the x coordinate for surface access is byte addressed. $E will return the byte size of the *backing element*.
+* $c - When doing texture access in GLSL the result may need to be cast. In particular if the underlying texture is 'half' based, GLSL only accesses (read/write) as float. So we need to cast to a half type on output. When storing into a texture it is still the case the value written must be half - but we don't need to do any casting there as half is coerced to float without a problem.
+* $z - If we are calling a D3D texturing operation in the form t.Foo(s, ...), where `t` is a Texture&lt;T&gt;, then this is the step where we try to properly swizzle the output of the equivalent GLSL call into the right shape.
+* $N0-9 - Extract the element count from a vector argument so that we can use it in the constructed expression.
+* $V0-9 - Take an argument of some scalar/vector type and pad it out to a 4-vector with the same element type (this is the inverse of `$z`).
+* $a - We have an operation that needs to lower to either `atomic*` or `imageAtomic*` for GLSL, depending on whether its first operand is a subscript into an array. This `$a` is the first `a` in `atomic`, so we will replace it accordingly.
+* $A - We have an operand that represents the destination of an atomic operation in GLSL, and it should be lowered based on whether it is an ordinary l-value, or an image subscript. In the image subscript case this operand will turn into multiple arguments to the `imageAtomic*` function.
+* $XP - Ray tracing ray payload
+* $XC - Ray tracing callable payload
+* $XH - Ray tracing hit object attribute
+* $P - Type-based prefix as used for CUDA and C++ targets (I8 for int8_t, F32 - float etc)
+
+## __attributeTarget(astClassName)
+
+For an attribute, specifies the AST class (and derived class) the attribute can be applied to.
+
+## __builtin
+
+Identifies the declaration is being "builtin".
+
+## __builtin_requirement(requirementKind)
+
+A modifier that indicates a built-in associated type requirement (e.g., `Differential`). The requirement is one of `BuiltinRequirementKind`.
+
+The requirement value can just be specified via the `$()` mechanism. 
+
+## __builtin_type(tag)
+
+Specifies a builtin type - the integer value of one of the enumeration BaseType.
+
+## __magic_type(clsName, tag)
+
+Used before a type declaration. The clsName is the name of the class that is used to represent the type in the AST in Slang *C++* code. The tag is an optional integer value that is in addition and meaningful in the context of the class type.
+
+##__intrinsic_type(op)
+
+Used to specify the IR opcode associated with a type. The IR opcode is listed as something like `$(kIROp_HLSLByteAddressBufferType)`, which will expand to the integer value of the opcode (because the opcode value is an enum value that is visible from C++). It is possible to just write the opcode number, but that is generally inadvisable as the ids for ops are not stable. If a code change in Slang C++ adds or removes an opcode the number is likely to be incorrect.
+
+As an example from the core module
+
+```slang
+__magic_type(HLSLByteAddressBufferType)
+__intrinsic_type($(kIROp_HLSLByteAddressBufferType))
+struct ByteAddressBuffer
+{
+    // ...
+};
+```
+
+# General
+
+## __generic<>
+
+Is an alternate syntax for specifying a declaration that is generic. The more commonly used form is to list the generic parameters in `<>` after the name of the declaration.
+
+## attribute_syntax
+
+Attribute syntax provides a mechanism to introduce an attribute type in Slang.
+
+Right now the basic form is:
+
+```
+attribute_syntax [name(parmName: paramType, ...)] : syntaxClass;
+```
+
+There can be 0 or more params associated with the attribute, and if so the () are not needed.
+
+* `name` gives the name of the attribute to define.
+* `paramName` is the name of param that are specified with attribute use
+* `paramType` is the type of the value associated with the param 
+* `syntaxClass` is the name of an AST node class that we expect this attribute to create when checked.
+
+For example 
+
+```
+__attributeTarget(FuncDecl)
+attribute_syntax [CudaDeviceExport] : CudaDeviceExportAttribute;
+```
+
+Defines an attribute `CudaDeviceExport` which can only be applied to FuncDecl or derived AST types. Once semantically checked will be turned into a `CudaDeviceExportAttribute` attribute in the AST.
+
+With a parameter
+
+```
+__attributeTarget(InterfaceDecl)
+attribute_syntax [anyValueSize(size:int)] : AnyValueSizeAttribute;
+```
+
+Defines an attribute `anyValueSize` that can be applied to `InterfaceDecl` and derived types. It takes a single parameter called `anyValueSize` of `int` type.
+
+## Ref<T>
+
+Allows returning or passing a value "by reference".
+
+# GLSL/Vulkan specific
+
+## __glsl_version(version)
+
+Used to specify the GLSL version number that is required for the subsequent declaration. When Slang emits GLSL source, the version at the start of the file, will be the largest version seen that emitted code uses.
+
+For example
+
+```slang
+__glsl_version(430)
+```
+
+## __glsl_extension
+
+Specifies the GLSL extension that is required for the declaration to work. A declaration that has the intrinsic, when output to GLSL will additionally add `#extension` to the the GLSL or SPIR-V output.  
+
+Multiple extensions can be applied to a decoration if that is applicable, if there are multiple ways of implementing that can be emitted in the same manner (see the section around [target](#target-intrinsic)) for more details.
+
+## __spirv_version
+
+When declaration is used for SPIR-V target will take the highest value seen to be the SPIR-V version required. For compilation through GLSLANG, the value is passed down to to GLSLANG specifying this SPIR-V is being targeted. 
+
+Example 
+
+```
+__spirv_version(1.3)
+```
+
+## vk::spirv_instruction
+
+Provides a way to use a limited amount of `GL_EXT_spirv_intrinsics` the extension.  
+
+```
+vk::spirv_instruction(op, set)
+```
+
+Op is the integer *value* for the op. The `set` is optional string which specifies the instruction set the op is associated with. 
+For example
+
+```
+__specialized_for_target(glsl)
+[[vk::spirv_instruction(1, "NonSemantic.DebugBreak")]]
+void debugBreak();
+``` 
+
+# CUDA specific 
+
+## __cuda_sm_version
+
+When declaration is used with this intrinsic for a CUDA target, the highest shader model seen will be passed down to the downstream CUDA compile (NVRTC).
+
+# NVAPI 
+
+## [__requiresNVAPI]
+
+If declaration is reached during a compilation for an applicable target (D3D11/12), will indicate that [NVAPI support](../nvapi-support.md) is required for declaration to work. 
@@ -0,0 +1,590 @@
+# Slang Compiler Diagnostic Guidelines
+
+## Overview
+
+The Slang compiler aims to provide clear, actionable, and user-friendly diagnostics that help developers quickly understand and fix issues in their code. These guidelines draw from best practices established by Rust, Clang, and Swift compilers while adapting them for Slang's specific needs.
+
+## Diagnostic Structure
+
+A complete diagnostic in Slang consists of:
+
+```
+error[E0000]: main error message
+  --> file.slang:LL:CC
+   |
+LL | <code>
+   |  ^^^^ primary label
+   |
+LL | <related code>
+   | -------------- secondary label
+   |
+   = note: additional context without a span
+   = help: suggestion for fixing the issue
+```
+
+### Core Components
+
+- **Level**: `error`, `warning`, `lint`, `remark` (plus attached `note`, `help`)
+- **Error Code**: Optional identifier (e.g., `E0308`) for detailed documentation lookup
+- **Message**: Concise description of the problem
+- **Source Location**: File path, line, and column information
+- **Code Snippet**: The affected code with visual indicators
+- **Labels**: Primary and secondary spans with explanatory text
+- **Sub-diagnostics**: Additional notes and suggestions
+- **Documentation Links**: References to relevant language guide chapters
+
+## Diagnostic Levels
+
+### Error
+
+Emitted when the compiler cannot proceed with compilation:
+
+- Syntax errors
+- Type mismatches that prevent code generation
+- Unresolved symbols
+- Constraint violations
+- Missing interface implementations
+
+### Warning
+
+Emitted for problematic but compilable code:
+
+- Deprecated feature usage
+- Unused variables or imports
+- Potentially incorrect but syntactically valid code
+- Code that may behave unexpectedly
+- Can be turned into errors with `-werror`
+
+### Lint
+
+Off-by-default style or clarity guidelines:
+
+- Extraneous parentheses
+- Style violations
+- Code clarity improvements
+
+### Note
+
+Provides additional context for errors and warnings:
+
+- Related code locations
+- Explanations of why something failed
+- References to relevant language rules
+
+### Help
+
+Offers actionable suggestions:
+
+- How to fix the problem
+- Alternative approaches
+- Links to documentation
+
+### Remark
+
+Off-by-default informational messages:
+
+- Optimization hints
+- Compilation progress information
+- Performance suggestions
+- Code generation notes
+
+## Writing Style Guidelines
+
+### Message Content
+
+1. **Be concise and precise**
+
+   - ❌ "The compiler failed to find a matching type"
+   - ✅ "type mismatch: expected `int`, found `string`"
+
+2. **Use plain language**
+
+   - Avoid compiler jargon when possible
+   - Define technical terms when necessary
+   - Write for developers who may be new to the language
+
+3. **Include relevant context**
+
+```
+error[E0277]: interface `IAddable` is not implemented for type `String`
+  --> file.slang:7:22
+   |
+4  | interface IAddable { This add(This other); }
+   |                      ---------------------- required by this interface
+5  |     String s1 = "hello";
+6  |     String s2 = "world";
+7  |     String result = add(s1, s2);
+   |                      ^^^ `add` requires `IAddable` interface
+```
+
+### Grammar and Formatting
+
+1. **No ending punctuation** for single-sentence messages
+
+   - ✅ ``cannot find type `Foo` in this scope``
+   - ❌ ``cannot find type `Foo` in this scope.``
+
+2. **Use backticks** for code elements
+
+   - Types: `` `float4` ``, `` `Texture2D<float4>` ``
+   - Identifiers: `` `myVariable` ``
+   - Keywords: `` `interface` ``, `` `struct` ``
+
+3. **Lowercase start** for messages
+
+   - ✅ `missing semicolon`
+   - ❌ `Missing semicolon`
+
+4. **Active voice** when describing problems
+
+   - ✅ ``function `foo` takes 2 arguments but 3 were provided``
+   - ❌ ``3 arguments were provided but function `foo` takes 2``
+
+5. **Use Oxford comma** in lists
+
+   - ✅ `` expected one of `int`, `float`, or `double` ``
+   - ❌ `` expected one of `int`, `float` or `double` ``
+
+6. **Use correct articles** (a vs. an)
+   - ✅ `an interface`
+   - ✅ `a struct`
+   - ✅ ``an `IFoo` implementation``
+   - ❌ `a interface`
+
+### Type Aliases and Underlying Types
+
+When type aliases are involved, show the underlying type when it helps clarify the error:
+
+```
+error[E0308]: type mismatch
+  --> file.slang:10:23
+   |
+10 |     ColorRGBA color = 0.5;
+   |                       ^^^ expected `ColorRGBA` (aka `float4`), found `float`
+```
+
+Display options for controlling type alias expansion:
+
+- `-show-type-aliases=always`: Always show "aka" annotations
+- `-show-type-aliases=helpful`: Show only when it clarifies (default)
+- `-show-type-aliases=never`: Never expand type aliases
+
+## Error Codes
+
+### Format
+
+- Use a letter prefix followed by 5 digits: `E00001`, `W00001`
+- Group related errors in ranges:
+  - **TBD**
+
+### Documentation
+
+**Each error code needs:**
+
+- Brief description
+- Links to documentation
+
+**Optionally:**
+
+- Common causes
+- Example code that triggers the error
+- Suggested fixes
+
+## Suggestions and Fix-its
+
+### Applicability Levels
+
+1. **MachineApplicable**: Can be automatically applied
+
+```
+help: add missing semicolon
+   |
+5  |     return value;
+   |                 +
+```
+
+2. **HasPlaceholders**: Requires user input
+
+```
+help: specify the type explicitly
+   |
+5  |     let color: <type> = value;
+   |              +++++++++
+```
+
+3. **MaybeIncorrect**: Suggestion might not be appropriate
+
+```
+help: consider adding the `[shader("compute")]` attribute
+   |
+5  | [shader("compute")]
+   | +++++++++++++++++++
+6  | void main() {
+```
+
+### Guidelines for Suggestions
+
+- Provide fix-its only when confidence is high
+- Show the exact change needed
+- Use placeholders (`<type>`, `<name>`) when user input is required
+- Prefer showing code transformations over textual descriptions
+
+## Span and Location Information
+
+### Primary Spans
+
+- Point to the exact location of the error
+- Keep spans as small as possible while remaining meaningful
+- For multi-token constructs, highlight the most relevant part
+
+### Secondary Spans
+
+- Show related code that contributes to the error
+- Use different labels to distinguish multiple spans
+- Order spans by relevance, not just by source location
+
+### Example
+
+```
+error[E0308]: type mismatch in function call
+  --> file.slang:10:11
+   |
+8  | void expectInt(int x) { }
+   |                ----- expected `int` here
+9  |
+10 |     expectInt("hello");
+   |               ^^^^^^^ found `string`
+```
+
+## Error Cascading Prevention
+
+We shouldn't be generating many dependent errors from a single mistake.
+
+We should at least be checking that there are no additional error messages in all our diagnostic tests. At the moment we generally only check for the presence of the tested diagnostic.
+
+To avoid overwhelming users with follow-on errors:
+
+1. **Stop type-checking** in a scope after critical type errors
+2. **Mark symbols as poisoned** when their definition has errors
+3. **Limit error propagation** from generic instantiation failures
+4. **Track error origins** to suppress duplicate reports
+
+Example:
+
+```
+error[E0412]: the type `MyTexture` is not defined
+  --> file.slang:5:5
+   |
+5  |     MyTexture tex;
+   |     ^^^^^^^^^ type not found
+   |
+   = note: subsequent errors involving `tex` have been suppressed
+```
+
+## Diagnostic Priority and Limits
+
+### Priority System
+
+When multiple errors exist, show them in this order:
+
+TBD
+
+1. Syntax errors
+2. Import/module errors
+3. Type definition errors
+4. Interface implementation errors
+5. Type mismatch errors
+6. Other semantic errors
+7. Warnings
+8. Remarks
+
+### Error Limits
+
+- Configurable via `-max-errors=N`
+- Show message when limit reached:
+
+```
+  error: aborting due to 20 previous errors; use `-max-errors=N` to see more
+```
+
+## Lint System
+
+Lints are a good opportunity to attach fix-its for a LSP or LLM.
+
+### Lint Naming
+
+- Use snake_case
+- Name should make sense with "allow": `allow unused_variables`
+- Be specific about what is being checked
+- Group related lints with common prefixes
+
+### Lint Levels
+
+1. **allow**: Off by default
+2. **warn**: On by default, produces warnings
+3. **deny**: On by default, produces errors
+
+### Lint Groups
+
+Define logical groups:
+
+- **style**: Code formatting and naming conventions
+  - NON_CAMEL_CASE_NAMES
+  - NON_UPPER_CASE_CONSTANTS
+  - INCONSISTENT_SPACING
+- **correctness**: Potential bugs or incorrect usage
+- **performance**: Performance-related suggestions
+
+## Special Diagnostic Features
+
+### Generic Type Diffing
+
+When dealing with complex generic types, highlight differences:
+
+```
+error[E0308]: type mismatch
+  = note: expected `RWStructuredBuffer<float4>`
+          found    `RWStructuredBuffer<float3>`
+                           ^^^^^^ types differ here
+```
+
+### Macro Expansion Context
+
+Show the expansion chain for errors in macros:
+
+```
+error[E0369]: invalid operation
+  --> file.slang:20:5
+   |
+20 |     MY_MACRO!(x + y);
+   |     ^^^^^^^^^^^^^^^^^ in this macro invocation
+   |
+  ::: macros.slang:5:10
+   |
+5  |     $left + $right
+   |           ^ cannot add these types
+```
+
+### Similar Name Suggestions
+
+```
+error[E0425]: cannot find `printn` in scope
+  --> file.slang:5:5
+   |
+5  |     printn("hello");
+   |     ^^^^^^ not found
+   |
+   = help: a similar function exists: `println`
+help: did you mean `println`?
+   |
+5  |     println("hello");
+   |     ~~~~~~~
+```
+
+## IDE Integration
+
+### LSP-Specific Formatting
+
+Optimize diagnostics for Language Server Protocol:
+
+- Include `DiagnosticRelatedInformation` for secondary spans
+- Provide `CodeAction` items for fix-its
+- Support incremental diagnostic updates
+- Include diagnostic tags (deprecated, unnecessary)
+
+### Inline Error Markup
+
+Specifications for IDE display:
+
+```json
+{
+  "severity": "error",
+  "range": {
+    "start": { "line": 10, "character": 5 },
+    "end": { "line": 10, "character": 10 }
+  },
+  "message": "undefined variable `count`",
+  "code": "E00123",
+  "codeDescription": { "href": "https://docs.shader-slang.org/errors/E00123" }
+}
+```
+
+### Quick-Fix Protocol
+
+Standardized fix communication:
+
+```json
+{
+  "title": "Add missing interface implementation",
+  "kind": "quickfix",
+  "diagnostics": ["E00987"],
+  "edit": {
+    "changes": {
+      "file.slang": [
+        {
+          "range": { "start": { "line": 15, "character": 0 } },
+          "newText": "interface MyStruct : IRenderable {\n    // implementation\n}\n"
+        }
+      ]
+    }
+  }
+}
+```
+
+### Diagnostic Severity Mappings
+
+Map compiler levels to IDE severity:
+
+- `error` → `DiagnosticSeverity.Error` (1)
+- `warning` → `DiagnosticSeverity.Warning` (2)
+- `remark` → `DiagnosticSeverity.Information` (3)
+- `note` → `DiagnosticSeverity.Hint` (4)
+
+## Internationalization
+
+TBD (can we use LLMs here?)
+
+## Testing Diagnostics
+
+### Diagnostic Verification
+
+TBD Test file syntax to be parsed and checked against machine readable output
+
+Filecheck style test descriptions, but can be tested using the machine readable output.
+
+```
+void test() {
+    int x = "string";
+    // ERROR: type mismatch
+    //      ^^^^^^^^ expected `int`, found `string`
+    // HELP: change the type annotation
+}
+```
+
+### Test Coverage Requirements
+
+- Each diagnostic should have at least one test
+- Test both positive and negative cases
+- Verify fix-its compile successfully
+- Check error recovery after applying suggestions
+
+## Progressive Disclosure
+
+### Beginner-Friendly Defaults
+
+- Show simple, actionable messages by default
+- Hide implementation details unless relevant
+- Provide links to learn more
+
+## Performance Considerations
+
+1. Don't compute expensive diagnostics unless needed
+2. Avoid reporting the same error multiple times
+3. Cache diagnostic messages for repeated errors
+4. Use error limits to prevent runaway diagnostics
+
+## Command-Line Interface
+
+### Display Options
+
+- `-error-format=json`: Machine-readable output
+- `-color=auto|always|never`: Control color output
+- `-show-error-codes`: Display error codes
+- `-explain E00001`: Show detailed error explanation
+- `-verbose-diagnostics`: Show additional diagnostic information
+- `-max-errors=N`: Set maximum error count
+- `-show-type-aliases=always|helpful|never`: Control type alias display
+
+### Verbose Mode
+
+With `-verbose-diagnostics`:
+
+- Show full type signatures including type aliases
+- Include compiler passes information
+- Show all possible fixes, not just the most likely
+- Display internal compiler state when relevant
+
+### Example JSON Output
+
+```json
+{
+  "level": "error",
+  "code": "E0308",
+  "message": "type mismatch",
+  "spans": [
+    {
+      "file": "main.slang",
+      "line": 10,
+      "column": 15,
+      "text": "float3 color = float4(1, 0, 0, 1);",
+      "label": "expected `float3`, found `float4`"
+    }
+  ],
+  "children": [
+    {
+      "level": "help",
+      "message": "use `.xyz` to extract the first three components",
+      "spans": [
+        {
+          "file": "main.slang",
+          "line": 10,
+          "column": 35,
+          "suggestion": ".xyz"
+        }
+      ]
+    }
+  ],
+  "documentation_url": "https://docs.shader-slang.org/errors/E00345"
+}
+```
+
+## Best Practices Checklist
+
+Before adding a new diagnostic:
+
+- [ ] Is the message clear and actionable?
+- [ ] Is the span as precise as possible?
+- [ ] Would a fix-it help?
+- [ ] Error code
+- [ ] Is the severity level appropriate?
+- [ ] Are related locations shown with notes?
+- [ ] Is the message properly capitalized and punctuated, grammar etc.
+- [ ] Will this message make sense in different contexts?
+- [ ] Have we considered error cascading?
+- [ ] Is there a relevant documentation link?
+- [ ] Does the documentation have examples?
+- [ ] Have we added tests for this diagnostic?
+
+## Examples of Good Diagnostics
+
+### Type Mismatch
+
+```
+error[E0308]: mismatched types
+  --> src/main.slang:5:16
+   |
+4  | float3 expectVec3(float3 v) { return v; }
+   |                   ------- expected due to this parameter type
+5  |     expectVec3(float4(1, 0, 0, 1));
+   |                ^^^^^^^^^^^^^^^^^^^ expected `float3`, found `float4`
+   |
+   = help: use `.xyz` to extract the first three components
+   = note: see https://docs.shader-slang.org/types/vectors for vector swizzling
+```
+
+### Missing Interface Implementation
+
+```
+error[E0277]: type `String` doesn't implement interface `IArithmetic`
+  --> src/main.slang:10:24
+   |
+10 |     String result = s1 + s2;
+   |                        ^ operator `+` requires `IArithmetic` interface
+   |
+   = note: the interface `IArithmetic` is not implemented for `String`
+   = note: string concatenation requires explicit method calls
+   = help: use `s1.concat(s2)` instead
+   = note: see https://docs.shader-slang.org/interfaces/operators
+```
+
+These guidelines should be treated as living documentation that evolves with the Slang compiler's needs and user feedback. Regular reviews and updates ensure diagnostics remain helpful and relevant.
@@ -0,0 +1,114 @@
+Slang Doc System
+================
+
+Slang contains a rudimentary documentation generation system. The mechanism used to mark up source is similar to [doxygen](https://www.doxygen.nl/manual/docblocks.html). Namely
+
+```
+/**
+ ... text ... (JavaDoc style)
+ */
+void someFunctionA() {}
+
+/*!
+ .. text .. (QT style)
+ another line
+ */
+void someFunctionB() {}
+
+/// ... text ... (Multi line)
+/// another line
+void someFunctionC() {}
+
+//!... text ...  (QT Multi line)
+//! another line
+void someFunctionD() {}
+
+```
+
+All of the above examples will add the documentation for the declaration that appears after them. Also note that this slightly diverges from doxygen in that an empty line before and after in a multi line comment is *not* required.
+
+We can also document the parameters to a function similarly
+
+```
+/// My function
+void myFunction(
+    /// The A parameter
+    int a,
+    /// The B parameter
+    int b);
+```
+
+If you just need a single line comment to describe something, you can place the documentation after the parameter as in
+
+```
+
+/// My function
+void myFunction(    int a,      //< The A parameter
+                    int b)      //< The B parameter
+{}
+```
+
+This same mechanisms work for other kinds of common situations such as with enums
+
+```
+/// An enum
+enum AnEnum
+{
+    Value, ///< A value
+    /// Another value
+    /// With a multi-line comment
+    AnotherValue,
+};
+```
+
+Like `doxygen` we can also have multi line comments after a declaration for example
+
+```
+/// An enum
+enum AnEnum
+{
+    Value, ///< A value
+           ///< Some more information about `Value`
+
+    /// Another value
+    /// With a multi-line comment
+    AnotherValue,
+};
+```
+
+
+
+
+To actually get Slang to output documentation you can use the `-doc` option from the `slangc` command line, or pass it in as parameter to `spProcessCommandLineArguments` or `processCommandLineArguments`. The documentation is currently output by default to the same `ISlangWriter` stream as diagnostics. So for `slangc` this will generally mean the terminal/stderr.
+
+Currently the Slang doc system does not support any of the 'advanced' doxygen documentation features. If you add documentation to a declaration it is expected to be in [markdown](https://guides.github.com/features/mastering-markdown/).
+
+Currently the only documentation style supported is a single file 'markdown' output. Future versions will support splitting into multiple files and linking between them. Also future versions may also support other documentation formats/standards.
+
+It is possible to generate documentation for the slang core module. This can be achieved with `slangc` via
+
+```
+slangc -doc -compile-core-module
+```
+
+The documentation will be written to a file `stdlib-doc.md`.
+
+It should be noted that it is not necessary to add markup to a declaration for the documentation system to output documentation for it. Without the markup the documentation is going to be very limited, in essence saying the declaration exists and other aspects that are available from the source. This may not be very helpful. For this reason and other reasons there is a mechanism to control the visibility of items in your source.
+
+There are 3 visibility levels 'public', 'internal' and 'hidden'/'private'. There is a special comment that controls visibility for subsequent lines. The special comment starts with `//@` as shown below.
+
+```
+//@ public:
+
+void thisFunctionAppearsInDocs() {}
+
+//@ internal:
+
+void thisFunctionCouldAppearInInternalDocs() {}
+
+//@ hidden:
+
+void thisFunctionWillNotAppearInDocs() {}
+```
+
+
@@ -0,0 +1,42 @@
+Frequently Asked Questions
+==========================
+
+### How did this project start?
+
+The Slang project forked off from the ["Spire"](https://github.com/spire-lang/spire) shading language research project.
+In particular, Slang aims to take the lessons learned in that research effort (about how to make more productive shader compilation languages and tools) and apply them to a stystem that is easier to adopt, and hopefully more amenable to production use.
+
+### Why should I use Slang instead of glslang, hlsl2glslfork, the Microsoft open-source HLSL compiler, etc.?
+
+If you are mostly just shopping around for a tool to get HLSL shaders working on other graphics APIs, then [this](http://aras-p.info/blog/2014/03/28/cross-platform-shaders-in-2014/) blog post is probably a good place to start.
+
+If one of those tools meets your requirements, then you should probably use it.
+Slang is a small project, and early in development, so you might find that you hit fewer bumps in the road with one of the more established tools out there.
+
+The goal of the Slang project is not to make "yet another HLSL-to-GLSL translator," but rather to create a shading language and supporting toolchain that improves developer productivity (and happiness) over the existing HLSL language and toolchain, while providing a reasonable adoption path for developers who have an existing investment in HLSL shader code.
+If you think that is something interesting and worth supporting, then please get involved!
+
+### What would make a shading language more productive?
+
+This is probably best answered by pointing to the most recent publication from the Spire research project:
+
+[Shader Components: Modular and High Performance Shader Development](http://graphics.cs.cmu.edu/projects/shadercomp/)
+
+Some other papers for those who would like to read up on our inspiration:
+
+[A System for Rapid Exploration of Shader Optimization Choices](http://graphics.cs.cmu.edu/projects/spire/)
+[Spark: Modular, Composable Shaders for Graphics Hardware](https://graphics.stanford.edu/papers/spark/)
+
+### Who is using Slang?
+
+Right now the only user of Slang is the [Falcor](https://github.com/NVIDIA/Falcor) real-time rendering framework developed and used by NVIDIA Research.
+The implementation of Slang has so far focused heavily on the needs of Falcor.
+
+### Won't we all just be using C/C++ for shaders soon?
+
+The great thing about both Vulkan and D3D12 moving to publicly-documented binary intermediate languages (SPIR-V and DXIL, respectively) is that there is plenty of room for language innovation on top of these interfaces.
+
+Having support for writing GPU shaders in a reasonably-complete C/C++ language would be great.
+We are supportive of efforts in the "C++ for shaders" direction.
+
+The Slang effort is about trying to solve the challenges that are unique to the real-time graphics domain, and that won't magically get better by switching to C++.
@@ -0,0 +1,9 @@
+### Derivatives In Compute 
+An entry point may be decorated with `[DerivativeGroupQuad]` or `[DerivativeGroupLinear]` to specify how to use derivatives in compute shaders.
+
+GLSL syntax may also be used, but is not recommended (`derivative_group_quadsNV`/`derivative_group_linearNV`).
+
+Targets:
+* **_SPIRV:_** Enables `DerivativeGroupQuadsNV` or `DerivativeGroupLinearNV`.
+* **_GLSL:_** Enables `derivative_group_quadsNV` or `derivative_group_LinearNV`.
+* **_HLSL:_** Does nothing. `sm_6_6` is required to use derivatives in compute shaders. HLSL uses an equivalent of `DerivativeGroupQuad`.
@@ -0,0 +1,205 @@
+Texture Footprint Queries
+=========================
+
+Slang supports querying the *footprint* of a texture sampling operation: the texels that would be accessed when performing that operation.
+This feature is supported on Vulkan via the `GL_NV_shader_texture_footprint` extension, and on D3D12 via the `NvFootprint*` functions exposed by NVAPI.
+
+# Background
+
+There are many GPU rendering techniques that involve generating a texture (e.g., by rendering to it) and then sampling from that texture in a 3D rendering pass, such that it is difficult to predict *a priori* which parts of the texture will be accessed, or not.
+As one example, consider rendering a shadow map that will be accessed when shading a g-buffer.
+Depending on the geometry that was rendered into the g-buffer, and the occlusion that might exist, some parts of the shadow map might not be needed at all.
+
+In principle, an application could use a compute pass on the g-buffer to compute, for each pixel, the part of the shadow-map texture that it will access - its footprint.
+The application could then aggregate these footprints into a stencil mask or other data structure that could be used to optimize the rendering pass that generates the shadow map.
+
+Unfortunately, it is almost impossible for applications to accurately and reliably predict the texel data that particular sampling operations will require, once non-trivial texture filtering modes are considered.
+Sampling operations support a wide variety of state that affects the lookup and filtering of texels. For example:
+
+* When bilinear filtering is enabled, a sampling operation typically accesses the four texels closest to the sampling location and blends them.
+
+* When trilinear filtering is enabled, a sampling operation may access texels at two different mip levels.
+
+* When anisotropic filtering is enabled, a sampling operation may take up to N *taps* (where N is the maximum supported degree of anisotropy), each of which may itself access a neighborhood of texels to produce a filtered value for that tap.
+
+* When sampling a cube map, a sampling operation may straddle the "seam" between two or even three cube faces.
+
+Texture footprint queries are intended to solve this problem by providing application developers with a primitive that can query the footprint of a texture sampling operation using the exact same sampler state and texture coordinates that will be used when sampling the texture later.
+
+# Slang Shader API
+
+Rather than exactly mirror the Vulkan GLSL extension or the NVAPI functions, the Slang core module provides a single common interface that can map to either of those implementations.
+
+## Basics
+
+A typical 2D texture sampling operation is performed using the `Sample()` method on `Texture2D`:
+
+```hlsl
+Texture2D<float4> texture = ...;
+SamplerState sampler = ...;
+float2 coords = ...;
+
+// Sample a 2D texture
+float4 color = texture.Sample(
+    sampler, coords);
+```
+
+To query the footprint that would be accessed by this operation, we can use an operation like:
+
+```hlsl
+uint granularity = ...;
+TextureFootprint2D footprint = texture.queryFootprintCoarse(granularity,
+    sampler, coords);
+```
+
+Note that the same arguments used to call `Sample` above are here passed to `queryFootprint` in the exact same order.
+The returned `footprint` encodes a conservative footprint of the texels that would be accessed by the equivalent `Sample` operation above.
+
+Texture footprints are encoded in terms of blocks of texels, and the size of those blocks determined the *granularity* of the footprint.
+The `granularity` argument to `queryFootprintCoarse` above indicates the granularity of blocks that the application requests.
+
+In cases where a filtering operation might access two mip levels - one coarse and one fine - a footprint query only returns information about one of the two levels.
+The application selects between these options by calling either `queryFootprintCoarse` or `queryFootprintFine`.
+
+## Variations
+
+A wide range of footprint queries are provided, corresponding to various cases of texture sampling operations with different parameters.
+For 2D textures, the following functions are supported:
+
+```hlsl
+TextureFootprint2D Texture2D.queryFootprintCoarse(
+    uint granularity, SamplerState sampler, float2 coords);
+TextureFootprint2D Texture2D.queryFootprintFine(
+    uint granularity, SamplerState sampler, float2 coords);
+TextureFootprint2D Texture2D.queryFootprintCoarseBias(
+    uint granularity, SamplerState sampler, float2 coords,
+    float lodBias);
+TextureFootprint2D Texture2D.queryFootprintFineBias(
+    uint granularity, SamplerState sampler, float2 coords,
+    float lodBias);
+TextureFootprint2D Texture2D.queryFootprintCoarseLevel(
+    uint granularity, SamplerState sampler, float2 coords,
+    float lod);
+TextureFootprint2D Texture2D.queryFootprintFineLevel(
+    uint granularity, SamplerState sampler, float2 coords,
+    float lod);
+TextureFootprint2D Texture2D.queryFootprintCoarseGrad(
+    uint granularity, SamplerState sampler, float2 coords,
+    float2 dx, float2 dy);
+TextureFootprint2D Texture2D.queryFootprintFineGrad(
+    uint granularity, SamplerState sampler, float2 coords,
+    float2 dx, float2 dy);
+
+// Vulkan-only:
+TextureFootprint2D Texture2D.queryFootprintCoarseClamp(
+    uint granularity, SamplerState sampler, float2 coords,
+    float lodClamp);
+TextureFootprint2D Texture2D.queryFootprintFineClamp(
+    uint granularity, SamplerState sampler, float2 coords,
+    float lodClamp);
+TextureFootprint2D Texture2D.queryFootprintCoarseBiasClamp(
+    uint granularity, SamplerState sampler, float2 coords,
+    float lodBias,
+    float lodClamp);
+TextureFootprint2D Texture2D.queryFootprintFineBiasClamp(
+    uint granularity, SamplerState sampler, float2 coords,
+    float lodBias,
+    float lodClamp);
+TextureFootprint2D Texture2D.queryFootprintCoarseGradClamp(
+    uint granularity, SamplerState sampler, float2 coords,
+    float2 dx, float2 dy,
+    float lodClamp);
+TextureFootprint2D Texture2D.queryFootprintFineGradClamp(
+    uint granularity, SamplerState sampler, float2 coords,
+    float2 dx, float2 dy,
+    float lodClamp);
+```
+
+For 3D textures, the following functions are supported:
+
+```hlsl
+TextureFootprint3D Texture3D.queryFootprintCoarse(
+    uint granularity, SamplerState sampler, float3 coords);
+TextureFootprint3D Texture3D.queryFootprintFine(
+    uint granularity, SamplerState sampler, float3 coords);
+TextureFootprint3D Texture3D.queryFootprintCoarseBias(
+    uint granularity, SamplerState sampler, float3 coords,
+    float lodBias);
+TextureFootprint3D Texture3D.queryFootprintFineBias(
+    uint granularity, SamplerState sampler, float3 coords,
+    float lodBias);
+TextureFootprint3D Texture3D.queryFootprintCoarseLevel(
+    uint granularity, SamplerState sampler, float3 coords,
+    float lod);
+TextureFootprint3D Texture3D.queryFootprintFineLevel(
+    uint granularity, SamplerState sampler, float3 coords,
+    float lod);
+
+// Vulkan-only:
+TextureFootprint3D Texture3D.queryFootprintCoarseClamp(
+    uint granularity, SamplerState sampler, float3 coords,
+    float lodClamp);
+TextureFootprint3D Texture3D.queryFootprintFineClamp(
+    uint granularity, SamplerState sampler, float3 coords,
+    float lodClamp);
+TextureFootprint3D Texture3D.queryFootprintCoarseBiasClamp(
+    uint granularity, SamplerState sampler, float3 coords,
+    float lodBias,
+    float lodClamp);
+TextureFootprint3D Texture3D.queryFootprintFineBiasClamp(
+    uint granularity, SamplerState sampler, float3 coords,
+    float lodBias,
+    float lodClamp);
+```
+
+## Footprint Types
+
+Footprint queries on 2D and 3D textures return values of type `TextureFootprint2D` and `TextureFootprint3D`, respectively, which are built-in `struct`s defined in the Slang core module:
+
+```
+struct TextureFootprint2D
+{
+    typealias Anchor        = uint2;
+    typealias Offset        = uint2;
+    typealias Mask          = uint2;
+    typealias LOD           = uint;
+    typealias Granularity   = uint;
+
+    property anchor         : Anchor        { get; }
+    property offset         : Offset        { get; }
+    property mask           : Mask          { get; }
+    property lod            : LOD           { get; }
+    property granularity    : Granularity   { get; }
+    property isSingleLevel  : bool          { get; }
+}
+
+struct TextureFootprint3D
+{
+    typealias Anchor        = uint3;
+    typealias Offset        = uint3;
+    typealias Mask          = uint2;
+    typealias LOD           = uint;
+    typealias Granularity   = uint;
+
+    property anchor         : Anchor        { get; }
+    property offset         : Offset        { get; }
+    property mask           : Mask          { get; }
+    property lod            : LOD           { get; }
+    property granularity    : Granularity   { get; }
+    property isSingleLevel  : bool          { get; }
+}
+```
+
+A footprint is encoded in terms of *texel groups*, where the `granularity` determines the size of those groups.
+When possible, the returned footprint will match the granularity passed into the query operation, but a larger granularity may be selected in cases where the footprint is too large to encode at the requested granularity.
+
+The `anchor` property specifies an anchor point in the texture, in the vicinity of the footprint. Its components are in multiples of 8 texel groups.
+
+The `offset` property specifies how the bits in `mask` map to texel groups in the vicinity of the `anchor` point.
+
+The `mask` property is a 64-bit bitfield (encoded as a `uint2`), where each bit represents footprint coverage of one texel group, within a 8x8 (for 2D textures) or 4x4x4 neighborhood of texel groups.
+
+The `lod` property indicates the mipmap level that would be accessed by the sampling operation.
+
+The `isSingleLevel` property indicates if the sampling operation is known to access only a single mip level.
+Note that this property will always be `false` when using the D3D/NVAPI path.
@@ -0,0 +1,259 @@
+Slang Language Guide
+====================
+
+This document will try to describe the main characteristics of the Slang language that might make it different from other shading languages you have used.
+
+The Basics
+----------
+
+Slang is similar to HLSL, and it is expected that many HLSL programs can be used as Slang code with no modifications.
+Big-picture stuff that is supported:
+
+* A C-style preprocessor
+* Ordinary function, `struct`, `typedef`, etc. declarations
+* The standard vector/matrix types like `float3` and `float4x4`
+* The less-used explicit `vector<T,N>` and `matrix<T,R,C>` types
+* `cbuffer` declarations for uniform parameters
+* Global-scope declarations of texture/sampler parameters, including with `register` annotations
+* Entry points with varying `in`/`out` parameters using semantics (including `SV_*` system-value semantics)
+* The built-in templated resource types like `Texture2D<T>` with their object-oriented syntax for sampling operations
+* Attributes like `[unroll]` are parsed, and passed along for HLSL/DXBC output, but dropped for other targets
+* `struct` types that contain textures/samplers as well as ordinary uniform data, both as function parameters and in constant buffers
+* The built-in functions up through Shader Model 6.0 (as documented on MSDN) are supported
+
+New Features
+------------
+
+### Import Declarations
+
+In order to support better software modularity, and also to deal with the issue of how to integrate shader libraries written in Slang into other languages, Slang introduces an `import` declaration construct.
+
+The basic idea is that if you write a file `foo.slang` like this:
+
+```hlsl
+// foo.slang
+
+float4 someFunc(float4 x) { return x; }
+```
+
+you can then import this code into another file in Slang, HLSL, or GLSL:
+
+```hlsl
+// bar.slang
+
+import foo;
+
+float4 someOtherFunc(float4 y) { return someFunc(y); }
+```
+
+The simplest way to think of it is that the `import foo` declaration instructs the compiler to look for `foo.slang` (in the same search paths it uses for `#include` files), and give an error if it isn't found.
+If `foo.slang` is found, then the compiler will go ahead and parse and type-check that file, and make any declarations there visible to the original file (`bar.slang` in this example).
+
+When it comes time to generate output code, Slang will output any declarations from `import`ed files that were actually used (it skips those that are never referenced), and it will cross-compile them as needed for the chosen target.
+
+A few other details worth knowing about `import` declarations:
+
+* The name you use on the `import` line gets translated into a file name with some very simple rules. An underscore (`_`) in the name turns into a dash (`-`) in the file name, and dot separators (`.`) turn into directory separators (`/`). After these substitutions, `.slang` is added to the end of the name.
+
+* If there are multiple `import` declarations naming the same file, it will only be imported once. This is also true for nested imports.
+
+* Currently importing does not imply any kind of namespacing; all global declarations still occupy a single namespace, and collisions between different imported files (or between a file and the code it imports) are possible. This is a bug.
+
+* If file `A.slang` imports `B.slang`, and then some other file does `import A;`, then only the names from `A.slang` are brought into scope, not those from `B.slang`. This behavior can be controlled by having `A.slang` use `__exported import B;` to also re-export the declarations it imports from `B`.
+
+* An import is *not* like a `#include`, and so the file that does the `import` can't see preprocessor macros defined in the imported file (and vice versa). Think of `import foo;` as closer to `using namespace foo;` in C++ (perhaps without the same baggage).
+
+### Explicit Parameter Blocks
+
+One of the most important new features of modern APIs like Direct3D 12 and Vulkan is an interface for providing shader parameters using efficient *parameter blocks* that can be stored in GPU memory (these are implemented as descriptor tables/sets in D3D12/Vulkan, and "attribute buffers" in Metal).
+However, HLSL and GLSL don't support explicit syntax for parameter blocks, and so shader programmers are left to manually pack parameters into blocks either using `register`/`layout` modifiers, or with API-based remapping (in the D3D12 case).
+
+Slang supports a simple and explicit syntax for exploiting parameter blocks:
+
+```hlsl
+struct ViewParams
+{
+	float3 cameraPos;
+	float4x4 viewProj;
+	TextureCube envMap;
+};
+
+ParameterBlock<ViewParams> gViewParams;
+```
+
+In this example, the fields of `gViewParams` will be assigned to registers/bindings in a way that supports allocating them into a single parameter block.
+For example, when generating GLSL for Vulkan, the Slang compiler will generate a single `uniform` block (for `cameraPos` and `viewProj`) and a global `textureCube` for `envMap`, both decorated with the same `layout(set = ...)`.
+
+
+### Interfaces
+
+Slang supports declaring `interface`s that user-defined `struct` types can implement.
+For example, here is a simple interface for light sources:
+
+```hlsl
+// light.slang
+
+struct LightSample { float3 intensity; float3 direction; };
+
+interface ILight
+{
+	LightSample sample(float3 position);
+}
+```
+
+We can now define a simple user type that "conforms to" (implements) the `ILight` interface:
+
+```hlsl
+// point-light.slang
+
+import light;
+
+struct PointLight : ILight
+{
+	float3 position;
+	float3 intensity;
+
+	LightSample sample(float3 hitPos)
+	{
+		float3 delta = hitPos - position;
+		float distance = length(delta);
+
+		LightSample sample;
+		sample.direction = delta / distance;
+		sample.intensity = intensity * falloff(distance);
+		return sample;
+	}
+}
+```
+
+### Generics
+
+Slang supports *generic* declarations, using the common angle-bracket (`<>`) syntax from languages like C#, Java, etc.
+For example, here is a generic function that works with any type of light:
+
+```hlsl
+// diffuse.slang
+import light;
+
+float4 computeDiffuse<L : ILight>( float4 albedo, float3 P, float3 N, L light )
+{
+	LightSample sample = light.sample(P);
+	float nDotL = max(0, dot(N, sample.direction));
+	return albedo * nDotL;
+}
+```
+
+The `computeDiffuse` function works with any type `L` that implements the `ILight` interface.
+Unlike with C++ templates, the `computeDiffuse` function can be compiled and type-checked once (you won't suddenly get unexpected error messages when plugging in a new type).
+
+#### Global-Scope Generic Parameters
+
+Putting generic parameter directly on functions is helpful, but in many cases existing HLSL shaders declare their parameters at global scope.
+For example, we might have a shader that uses a global declaration of material parameters:
+
+```hlsl
+Material gMaterial;
+```
+
+In order to allow such a shader to be converted to use a generic parameter for the material type (to allow for specialization), Slang supports declaring type parameters at the global scope:
+
+```hlsl
+type_param M : IMaterial;
+M gMaterial;
+```
+
+Conceptually, you can think of this syntax as wrapping your entire shader program in a generic with parameter `<M : IMaterial>`.
+This isn't beautiful syntax, but it may help when incrementally porting an existing HLSL codebase to use Slang's features.
+
+### Associated Types
+
+Sometimes it is difficult to define an interface because each type that implements it might need to make its own choice about some intermediate type.
+As a concrete example, suppose we want to define an interface `IMaterial` for material surface shaders, where each material might use its own BRDF.
+We want to support evaluating the *pattern* of the surface separate from the reflectance function.
+
+```hlsl
+// A reflectance function
+interface IBRDF
+{
+	float3 eval(float3 wi, float3 wo);
+}
+struct DisneyBRDF : IBRDF { ... };
+struct KajiyaKay : IBRDF { ... };
+
+// a surface pattern
+interface IMaterial
+{
+	??? evalPattern(float3 position, float2 uv);
+}
+```
+
+What is the type `???` that `evalPattern` should return? We know that it needs to be a type that supports `IBRDF`, but *which* type?
+One material might want to use `DisneyBRDF` while another wants to use `KajiyaKay`.
+
+The solution in Slang, as in modern languages like Swift and Rust, is to use *associated types* to express the dependence of the BRDF type on the material type:
+
+```hlsl
+interface IMaterial
+{
+	associatedtype B : IBRDF;
+	B evalPattern(float3 position, float2 uv);
+}
+
+struct MyCoolMaterial : IMaterial
+{
+	typedef DisneyBRDF B;
+	B evalPattern(float3 position, float2 uv)
+	{ ... }
+}
+```
+
+Associated types are an advanced concept, and we only recommend using them when they are needed to define a usable interface.
+
+
+Future Extensions
+-----------------
+
+### Implicit Generics Syntax
+
+The syntax for generics and interfaces in Slang is currently explicit, but verbose:
+
+```hlsl
+float4 computeDiffuse<L : ILight>( L light, ... )
+{ ... }
+```
+
+As a future change, we would like to allow using an interface like `ILight` as an ordinary parameter type:
+
+```hlsl
+float4 computeDiffuse( ILight light, ... )
+{ ... }
+```
+
+This simpler syntax would act like "syntactic sugar" for the existing explicit generics syntax, so it would retain all of the important performance properties.
+
+### Returning a Value of Interface Type
+
+While the above dealt with using an interface as a parameter type, we would eventually like to support using an interface as the *return* type of a function:
+
+```hlsl
+ILight getALightSource(Scene scene) { ... }
+```
+
+Implementing this case efficiently is more challenging. In most cases, an associated type can be used instead when an interface return type would be desired.
+
+
+Not Supported
+-------------
+
+Some features of the current HLSL language are not supported, but probably will be given enough time/resources:
+
+* Local variables of texture/sampler type (or that contain these)
+* Matrix swizzles
+* Explicit `packoffset` annotations on members of `cbuffer`s
+
+Some things from HLSL are *not* planned to be supported, unless there is significant outcry from users:
+
+* Pre-D3D10 and D3D11 syntax and operations
+* The "effect" system, and the related `<>` annotation syntax
+* Explicit `register` bindings on textures/samplers nested in `cbuffer`s
+* Any further work towards making HLSL a subset of C++ (simply because implementing a full C++ compiler is way out of scope for the Slang project)
@@ -0,0 +1,19 @@
+> Note: This document is a work in progress. It is both incomplete and, in many cases, inaccurate.
+
+Slang Language Reference
+========================
+
+Contents
+--------
+
+* [Introduction](introduction.md)
+* [Basic Concepts](basics.md)
+* [Lexical Structure](lexical-structure.md)
+* [Preprocessor](preprocessor.md)
+* [Types](types.md)
+* [Expressions](expressions.md)
+* [Statements](statements.md)
+* [Declarations](declarations.md)
+* [Attributes](attributes.md)
+* [Graphics Shaders and Compute Kernels](shaders-and-kernels.md)
+* [Glossary](glossary.md)
--- a/Show More
+++ b/Show More